Perl Tips

This article was written to provide a few Perl tips and examples that can help you quickly edit single or batch data files from a command line.
 
Perl is a general-purpose language that is very powerful at manipulating data files containing text and numbers. Often it is necessary for SeaBASS data submitters or users to search and replace patterns found throughout many text files, which can be very time consuming if you manually open, change and resave each file individually. If you've ever accidentally misformatted a header, or need to get rid of extra spaces at the beginning of every line, or faced some other repetitive task involving search-and-replace, then don't fret; continue reading to learn some simple ways to fix those issues in a single command.
 
Perl is free and is even installed by default on several operating systems (e.g. most Macs). If you don't already have it, it can be downloaded from www.perl.org for most flavors of Windows, Mac and Linux operating systems. Of the options available, Windows users might wish to try Strawberry Perl (though we do not officially endorse any particular version.) Many Perl guides and tutorials already exist on the Internet so the content on this page will focus only on a few tips that are relevant to SeaBASS data. You can accomplish extremely complex tasks by writing longer amounts of code in Perl, but those are beyond the scope of this article. If Perl is new to you then you might also want to read a more comprehensive introduction such as the one found on the official Perl site, but you should be able to get started with the examples below.


Table of Contents


Regular expressions allow you to search for patterns of text (including numbers) and replace them with other patterns (that can include variables.) A short description and some examples are given below. For more information, consult a page like http://perldoc.perl.org/perlrequick.html.
 
One of the simplest uses for a regular expression is to search for a simple specific text pattern like "abc" and replace it with something else like "def". The search and replace regular expression is written using an "s" followed by 3 slashes "s///". The slashes are delimiters that separate different parts of the search and replace pattern and some extra optional search options that follow the last delimiter. For example, the following regular expression will search and replace all instances of the first pattern ("abc") with the second ("def").
 
s/abc/def/gi
 
Breaking down the above example in a bit more detail: the "s" is necessary for the search and replace command, the "abc" can be any pattern you want to match, and the "def" is the replace pattern. The optional "g" flag makes the search "global" and replaces all instances of the search pattern in a given line (not using "g" will only replace the first instance of a pattern found). Normally Perl only matches patterns if they are the same case (e.g., lowercase abc would NOT match uppercase ABC), however the optional "i" flag in the example makes the search case-insensive. You do not need to use either the g or i if you don't want their functionality.
 
Also, while slashes ("/") are commonly used as the delimiter, you can instead use a different symbol such as "|" (called pipe, and found by pressing shift and blackslash on most keyboards). The only reason you might want to switch is if your search or replace pattern involves the delimiter. In those cases you have to instruct Perl how to interpret them using backslashes. For example, the following lines do the same thing, but the first is a little easier to read:
 
s|/experiment=abc|/experiment=def|i
s/\/experiment=abc/\/experiment=def/i
 
Some punctuation and other symbols have special meanings in regular expressions (such as \ ( ) . + ^ $ and others). Generally, if you want to use one of those symbols as part of your search pattern, you must put a backslash before it, to tell Perl to ignore its special meaning. For example, \. must be used to search for a period in text, because a plain . has the special meaning, it's almost like a wildcard that matches (almost) anything. Also, the delimiter you use (such as / or |) takes on special meaning, and if your search pattern involves the same character, you must put backslash before any text in the search or replace part of the pattern. That's why there are backslashes before "/experiment" in one of the above examples.
 
Regular expressions are very powerful. While it is very straightforward to replace an exact snippet of text or numbers with another, with the right syntax you can search for and change very complex patterns. You can also specify where the search patterns occur with context to other information in the file. For example, you can look for patterns that occur at the beginning or end of a line using the special characters ^ and $. One place you can put this concept to use is if you have saved a spreadsheet to a text file (e.g., a .csv file). If you open the new file in a text editor, you will probably see that a long string of commas has been appended to the end of all of your metadata headers, like this:
 
/begin_header,,,,,,,,,,
/investigators=Casey_Smith,Taylor_Garcia,,,,,,,,,
/delimiter=comma,,,,,,,,,,
 
In this case you want to remove all those extra commas in the header, but using s/,//g will cause trouble because it will remove ALL commas in the file, including the useful ones that are separating certain headers and the delimiters in the data matrix. The "bad" commas always show up at the end of a line, while the "good" commas are always followed by more information and never show up at the end of the line. In this case our strategy will be to target just the bad commas and delete them by replacing them with nothing (literally).
 
Search for one or more commas at the end of a line and delete all of them:
s/,+$//g
 
Similarly, sometimes instrument formatted data files have whitespace (usually a combination of one or more spaces and tabs) at the beginning of the data in the text file. To search for one or more whitespace characters (specifically spaces and tabs, the patterns specified within the square brackets) at the beginning of each line (specified by the ^) and delete all of them:
s/^[ \t]+//g
 
Once you understand the basics of regular expressions, you can now put them to use in Perl and to make replacements. A simple way to make replacements within files is to use Perl from a command line. Once you become familiar with regular expressions then you will be able to quickly manipulate a batch of files. For more details on command switches, consult an article such as http://perldoc.perl.org/perlrun.html#Command-Switches. Note that while these methods can save you much time, be careful (particularly as you a first learning) not to accidentally make unintentional changes that scramble your files. If you make a typo, or accidentally write a pattern that matches more text than you intended, you do not want to be in a position where you lose your valuable data. Best practices are to backup your original files in a safe directory and work on copies in a different folder.
 
On Mac/Linux:
 
perl -pi.backup -e "s/search_for_this/replace_with_this/g" FILE(S)
 
The -pi and -e switches are necessary. The -i switch can be followed by an extension (in this case, we used the suffix .backup) which will make a copy of each file before the search and replace occurs. It does not have to be i.backup as listed above, you can write i.anything. If you feel confident, you can skip putting anything after the i flag; however, no automatic copies of your files will be created so you will be stuck with the results if you make a mistake.
 
"FILE(S)" Can be a single file name, or else you can use a wildcard to change several at once (for example, *.txt would affect all the *.txt file names with the .txt suffix... thus, use with caution).
 
However, Windows does not allow you to simply use an asterisk wildcard in the "files" part of the command for matching file names. Depending on your version of the Windows operating system, one solution is to format your command like this:
for %i in (*.txt) do perl -pi.backup -e "s/search_for_this/replace_with_this/g" "%i"
 
Here are a few other examples of commands that can be useful when formatting text files:
 

Replace empty rows:

perl -pi -e "s/^\n//" file.txt
 
Alternate method to remove empty lines, in case the above pattern does not work due to the way your files are formatted.
perl -i.backup -n -e 'print if /\S/' INPUT_FILE(S)
 
Multi-line replace, also demonstrating how to save search patterns and move them around in the replace:
 
perl -0777 -pi -e 's|^(\!C.+)\n/begin_header|/begin_header\n$1|' *.txt
 
To expand on this example, consider a hypothetical example where data files were accidentally created with an important comment printed BEFORE /begin_header. We would want to look for this pattern, and switch the location of those lines. This example demonstrates multiple things, it looks for a line that begins with "!C" followed by a line with "/begin_header". The -0777 addition is necessary to look for patterns that span multiple lines (\n indicates a line break). It then rewrites those lines, saving information that matched the pattern within the parenthesis. The $1 variable prints that captured information back out in the replace section of the command. FYI, it is possible to use multiple sets of parenthesis and capture and print out multiple variables (using $1, $2, $3, etc.)
 
!C1 critical-Information XYZ
/begin_header
/more_headers
 
will be transformed to:
 
/begin_header
!C1 critical-Information XYZ
/more_headers
 
 
That covers the basics, though the information presented so far is only a drop in the metaphorical oceans of regular expressions and Perl. To learn more, use some of the references listed above or search out others.
Tags:
Last edited by Chris Proctor on 2017-04-19
Created by Chris Proctor on 2013-10-24