perl

by Stuart Woodward


Several years ago, in Algorithmica Japonica we had a series of articles about the programming language AWK. At that time several people were having text conversion problems and someone, possibly Maynard Hogg, suggested writing small dedicated programs in the language AWK to solve these problems. Several non-programmers took up the challenge and started to write AWK programs. The next few months of BBS traffic were infused with cryptic program examples and message traffic on how to optimize them.

I never really liked AWK probably because I spend most of my day programming in other more conventional languages. AWK requires a different way of thinking to create a program to solve the problem. Getting my brain around something new on top of solving the problem was sometimes a bit too much like hard work.

A few years ago I would have said that most people don't need to program because these days you can buy, or find, software to do almost anything that you want to do. I do believe that most people though, would benefit from learning a language that they can use to automate small text processing tasks. For me the language that I use, and recommend to do these tasks, is Perl rather than AWK. I tend to use it most when I want to convert data from one format into another.

Here is a problem and a solution to demonstrate what you can do. I must say that this program is one of the first 10 programs I have written in the language Perl, so Perl language lawyers and optimizers beware!

While reading e-mail, magazines, news messages, etc. I was coming across URLs, i.e addresses of World Wide Web pages, that I wanted to follow up, so I started to collect them in a text file. When I was using the WWW I would then cut and paste them into the web browser to display the page. One night when my collection of links to look at was getting quite large I thought, "wouldn't it be great to be able to jump directly from the links file without pasting the links in."

To do this would mean turning the links file into a HTML file, the native language of the World Wide Web. Having written quite a few HTML files by hand I roughly knew what to do and having recently started learning Perl I challenged myself to write a program do it in Perl.

Here is my attempt:

$output_filename = ">c:\\windows\\desktop\\links.htm"; 
 
open(OUTPUT_FILE, $output_filename) || 
	die "Couldn't create $output_filename"; 
 
print OUTPUT_FILE 
"<html>\n<head>\n<title>AutoLinks!</title>\n</head>\n<body>\n"; 
print OUTPUT_FILE "<h1>AutoLinks!</h1>\n"; 
 
while (<>) { 
	print " $_"; 
	if(/(.*)(http:\S+)\s*(.*)/) { 
		print OUTPUT_FILE "$1<a href=\"$2\">$2</a> $3<p> \n"; 
	} 
	else 
	{ 
		if(/\S+/) 
		{ 
			print OUTPUT_FILE $_ . "<br>\n"; 
		} 
	} 
} 
 
print OUTPUT_FILE "</body>\n</html>\n";

If you are not familiar with HTML to understand what we are trying to achieve you need to see the data before and after it has been processed by the perl script.

The original text file looks like this:

The lotus home page http://www.lotus.com

To make it into an HTML file readable by Netscape etc. we need to format it like this:

<html> 
<head> 
<title>AutoLinks!</title> 
</head> 
<body> 
<h1>AutoLinks!</h1> 
The lotus home page  
<a href="http://www.lotus.com">http://www.lotus.com</a><p> 
</body> 
</html>

Now I will go through the program line by line so that you can see what is going on.

$output_filename = ">c:\\windows\\desktop\\links.htm";

This line just creates a filename for the output text, which in our case is HTML. It is not necessary to separate the naming of the file from the function call to open the file but I like to write my programs in a self documenting form. Points to note are that output filenames are signified by starting the filename with a '>', also backslashes need to be doubled to avoid them being treated as special characters within the string. This file is made on the Windows 95 desktop so that the file of links is always readily available. The "Desktop," in case you haven't already realized, is just another normal subdirectory under the main Windows directory.

open(OUTPUT_FILE, $output_filename) || 
	die "Couldn't create $output_filename";

In English this line should be read, "open the output file, creating a file handle OUTPUT_FILE, OR if you can't do that die (i.e stop running) printing out a message telling what is wrong." This is a typical perl idiom. It relies on the boolean logic principle that when evaluating an OR operator (written '||') , if the left hand side of the OR is true the computer doesn't even need to evaluate the right hand side. It already knows that the statement is true. So in the line above the die command is only run if the open fails. This is just a shorthand to avoid using a if() statement.

print OUTPUT_FILE  
"<html>\n<head>\n<title>AutoLinks!</title>\n</head>\n<body>\n"; 
print OUTPUT_FILE "<h1>AutoLinks!</h1>\n";

The output file is now ready to write to, in the lines above we set up the header and the title sections that are required for well behaved HTML files. Perl uses '\n' to signify a new line. Unlike the C language , the print statement doesn't require brackets to surround it and if the file handle OUTPUT_FILE is omitted print prints to the screen.

while (<>) {

Most high level languages have a while statement that executes something "while" something is true, but Perl has a really nice shortcut built in to help when processing files. Inside the parentheses above, the <> means "Try to read a line from the file that is mentioned on the command line. If you can, assign it to the string $_ and return true, if you can't return false." So this while loop will go through the whole of the input file processing each line. If you are wondering which file is being read here, it is the file full of text which contains the links. To understand why, wait for the explanation of the Perl command line which comes later.

print " $_";

This is easy. Print the current line being read, known by the shorthand symbol $_, to the screen to keep the user amused. Probably pointless as the program runs in less than a half a second!!

if(/(.*)(http:\S+)\s*(.*)/) {

This is not easy. This cryptic line is the heart of the whole program does two things. It recognizes lines with a web link in them and chops up the line into separate parts known as $1,$2 etc. ready for processing. The search operator /something/ returns true if the "something" is found. In the above line we are looking for four types of text.

	Arbitrary text		signified by a  '.' (period) which can be anything 
	Specific text		that must be exactly matched in our case 'http:' 
	Non-Space text		signified by a  '\S', anything except spaces, tabs and newlines 
	Space text		signified by a  '\s', spaces, tabs and newlines

We are also looking for various combinations of the above text types on our line.

Zero or more characters of the type of character preceding the postfix  '*' 
One or more characters of the type of character preceding the postfix  '+'

The pattern we are looking for is quite complicated. For example when the the following line:

This is an interesting link: http://www2.gol.com/users/stuart/r.html You might like to check it out.

is processed it is broken up as follows:

Some text followed by a Web link spaces some more text
This is an interesting link: http://www2.gol.com/users/stuart/r.html You might like to check it out
$1=(.*) $2=(http:\S+) \s* $3=(.*)\
Zero or more arbitrary
characters
Followed by http:\ followed by some
non-space characters.
Zero or more spaces Zero or more arbitrary characters
until the end of the line.

The parentheses indicate that we want to keep the text inside them for later use and these fragments, when found, are automatically assigned in order to the variables $1, $2, $3, etc.

	print OUTPUT_FILE "$1<a href=\"$2\">$2</a> $3<p> \n";

The above line uses the the fragments of the text that were found and recreates the line to include a HTML link to the URL mentioned in the text. Since the search operator /(.*)(http:\S+)\s*(.*)/ was the test condition of the if() statement, the above line is executed only when the test is true, i.e a line which looks like it had a link in it was found. In the final document any links to World Wide Web pages will now be highlighted when the output file is viewed under Netscape and the user will be able to click on the links to see the document that they link to.

	} 
	else 
	{ 
		if(/\S+/) 
		{ 
			print OUTPUT_FILE $_ . "<br>\n"; 
		} 
	} 

This part finishes off the if() statement the else part is only run if the line didn't contain a link. This statement contains another search for lines which contain 1 or more non space characters ' \S+' and prints them out, effectively filtering out, and thus not printing lines, which have no text. $_ is a shorthand for the current line we are looking at and it is the unwritten default for many operators including the search operator that we have been using.

} 
 
print OUTPUT_FILE "</body>\n</html>\n";

The final brace ends the while loop and here we print to the output files the "end of body" and "end of document" codes that terminate an HTML file.

Finally to run this text to HTML converter we need to make a DOS batch file.

	perl txt2web.pl C:\WIN_APPS\INTERNET\HTTP.TXT

I always want to convert the same text file so I have have hard coded the input filename in the batch file. Perl scripts, by default, take the name of the input file as the first argument after the name of the perl script itself. The above batchfile runs the txt2web.pl, the program we have created with the default input file:

C:\WIN_APPS\INTERNET\HTTP.TXT

As with most programming languages, perl may look very daunting the first time you encounter it. If you can master opening and closing files, searching for patterns and writing to files you can easily create some very sophisticated programs to manage text files. One of perl's motto is that you can write a perl program in many different ways. If you posted the above program on the Internet perl Conference you would get no end of comments and criticism, for free, on making the script better...

Recommended books:

Programming Perl
Larry Wall & Randal L. Shwartz
O'Reilly & Associates, Inc.
ISBN 0-937175-64-1

Known as the Camel Book, this is the manual co-authored by the creator of perl, Larry Wall.

Learning Perl
Randal L. Schwartz
O'Reilly & Associates, Inc.
ISBN 1-56592-042-2

An easy introduction to Perl. I bought both books at the same time and found it useful to refer to Programming Perl while reading through Learning Perl.

Internet Links

http://www.perl.com

Various links to information including frequently asked questions, and online manual and links to perl implementations for nearly every platform that you can imagine.


© Algorithmica Japonica Copyright Notice: Copyright of material rests with the individual author. Articles may be reprinted by other user groups if the author and original publication are credited. Any other reproduction or use of material herein is prohibited without prior written permission from TPC. The mention of names of products without indication of Trademark or Registered Trademark status in no way implies that these products are not so protected by law.

Algorithmica Japonica

May 1996

The Newsletter of the Tokyo PC Users Group

Submissions : Editor Mike Lloret