Posts Tagged ‘Files’

Using Perl and Regular Expressions to Process Html Files – Part 2

In this article we will discuss how to change the contents of an HTML file by running a Perl script on it.

The file we are going to process is called file1.htm:

Note: To ensure that the code is displayed correctly, in the example code shown in this article, square brackets ‘[..]‘ are used in HTML tags instead of angle brackets ”.

[html]
[head][title]Sample HTML File[/title]
[link rel="stylesheet" type="text/css" onClick="javascript:pageTracker._trackPageview('/outgoing/article_exit_link');" href="style.css"]
[/head]
[body]
[h1]Introduction[/h1]
[p]Welcome to the world of Perl and regular expressions[/p]
[h2]Programming Languages[/h2]
[table border="1" width="400"]
[tr][th colspan="2"]Programming Languages[/th][/tr]
[tr][td]Language[/td][td]Typical use[/td][/tr]
[tr][td]JavaScript[/td][td]Client-side scripts[/td][/tr]
[tr][td]Perl[/td][td]Processing HTML files[/td][/tr]
[tr][td]PHP[/td][td]Server-side scripts[/td][/tr]
[/table]
[h1]Summary[/h1]
[p]JavaScript, Perl, and PHP are all interpreted programming languages.[/p]
[/body]
[/html]

Imagine that we need to change both occurrences of [h1]heading[/h1] to [h1 class="big"]heading[/h1]. Not a big change and something that could be easily done manually or by doing a simple search and replace. But we’re just getting started here.

To do this, we could use the following Perl script (script1.pl):

1 open (IN, “file1.htm”);
2 open (OUT, “>new_file1.htm”);
3 while ($line = [IN]) {
4 $line =~ s/[h1]/[h1 class="big"]/;
5 (print OUT $line);
6 }
7 close (IN);
8 close (OUT);

Note: You don’t need to enter the line numbers. I’ve included them simply so that I can reference individual lines in the script.

Let’s look at each line of the script.

Line 1
In this line file1.htm is opened so that it can be processed by the script. In order to process the file, Perl uses something called a filehandle, which provides a kind of link between the script and the operating system, containing information about the file that is being processed. I’ve called this “opening” filehandle ‘IN’, but I could have used anything within reason. Filehandles are normally in capitals.

Line 2
This line creates a new file called ‘new_file1.htm’, which is written to by using another filehandle, OUT. The ‘>’ just before the filename indicates that the file will be written to.

Line 3
This line sets up a loop in which each line in file1.htm will be examined individually.

Line 4
This is the regular expression. It searches for one occurrence of [h1] on each line of file1.htm and, if it finds it, changes it to [h1 class="big"].

Looking at Line 4 in more detail:

$line – This is a variable that contains a line of text. It gets modified if the substitution is successful.

=~ is called the comparison operator.

s is the substitution operator.

[h1] is what needs to be substituted (replaced).

[h1 class="big"] is what [h1] has to be changed to.

Line 5
This line takes the contents of the $line variable and, via the OUT file handle, writes the line to new_file1.htm.

Line 6
This line closes the ‘while’ loop. The loop is repeated until all the lines in file1.htm have been examined.

Lines 7 and 8
These two lines close the two file handles that have been used in the script. If you missed off these two lines the script would still work, but it’s good programming practice to close file handles, thus freeing up the file handle names so they can be used, for example, by another file.

Running the Script

As the purpose of this article is to explain how to use regular expressions to process HTML files, and not necessarily how to use Perl, I don’t want to spend too long describing how to run Perl scripts. Suffice to say that you can run them in various ways, for example, from within a text editor such as TextPad, by double-clicking the perl script (script1.pl), or by running the script from an MS-DOS window.

(The location of the Perl interpreter will need to be in your PATH statement so that you can run Perl scripts from any location on your computer and not just from within the directory where the interpreter (perl.exe) itself is installed.)

So, to run our script we could open an MS-DOS window and navigate to the location where the script and the HTML file are located. To keep life simple I’ve assumed that these two files are in the same folder (or directory). The command to run the script is:

C:>perl script1.pl

If the script does work (and hopefully it will), a new file (new_file1.htm) is created in the same folder as file1.htm. If you open the file you’ll see the the two lines that contained [h1] tags have been modified so that they now read [h1 class="big"].

In Part 3 we’ll look at how to handle multiple files.

John is a web developer working for My Health Questions Matter, a company dedicated to helping patients to get the most out of their interaction with health care professionals such as doctors, midwives, and consultants by generating a set of health questions a patient can ask at an appointment.

More HTML Tips and Tricks:

Using Perl and Regular Expressions to Process Html Files – Part 1

Like many web content authors, over the past few years I’ve had many occasions when I’ve needed to clean up a bunch of HTML files that have been generated by a word processor or publishing package. Initially, I used to clean up the files manually, opening each one in turn, and making the same set of updates to each one. This works fine when you only have a few files to fix, but when you have hundreds or even thousands to do, you can very quickly be looking at weeks or even months of work. A few years ago someone put me on to the idea of using Perl and regular expressions to perform this ‘cleaning up’ process.

Why write an article about Perl and regular expressions I hear you say. Well, that’s a good point. After all the web is full of tutorials on Perl and regular expressions. What I found though, was that when I was trying to find out how I could process HTML files, I found it difficult to find tutorials that met my criteria. I’m not saying they don’t exist, I just couldn’t find them. Sure, I could find tutorials that explained everything I needed to know about regular expressions, and I could find plenty of tutorials about how to program in Perl, and even how to use regular expressions within Perl scripts. What I couldn’t find though, was a tutorial that explained how to open one or more HTML or text files, make updates to those files using regular expressions, and then save and close the files.

The Goal

When converting documents into HTML the goal is always to achieve a seamless conversion from the source document (for example, a word processor document) to HTML. The last thing you need is for your content authors to be spending hours, or even days, fixing untidy HTML code after it has been converted.

Many applications offer excellent tools for converting documents to HTML and, in combination with a well designed cascading style sheet (CSS), can often produce perfect results. Sometimes though, there are little bits of HTML code that are a bit messy, normally caused by authors not applying paragraph tags or styles correctly in the source document.

Why Perl?

The reason why Perl is such a good language to use for this task is because it is excellent at processing text files, which let’s face it, is all HTML files are. Perl is also the de facto standard for the use of regular expressions, which you can use to search for, and replace/change, bits of text or code in a file.

What is Perl?

Perl (Practical Extraction and Report Language) is a general purpose programming language, which means it can be used to do anything that any other programming language can do. Having said that, Perl is very good at doing certain things, and not so good at others. Although you could do it, you wouldn’t normally develop a user interface in Perl as it would be much easier to use a language like Visual Basic to do this. What Perl is really good at, is processing text. This makes it a great choice for manipulating HTML files.

What is a Regular Expression?

A regular expression is a string that describes or matches a set of strings, according to certain syntax rules. Regular expressions are not unique to Perl – many languages, including JavaScript and PHP can use them – but Perl handles them better than any other language.

In part 2, we’ll look at our first example Perl script

John Dixon is a web developer working through his own company John Dixon Technology. As well as providing web development services, John’s company also provides free open source accounting software written in PHP and MySQL.

More HTML Tips and Tricks:

Create Chm HTML Help Files Easily




Easily Create .CHM Files



Easily Create .CHM Files



Introduction to HTML Help CHM format


Nowadays, HTML Help CHM is the standard help format used in most modern Windows applications. An HTML Help system is completely stand-alone and can be distributed as a single file (for example, “My_Help_File.CHM”). Thus, a CHM file is practically a kind of the portable formats for technical documentation, which can be opened on all Windows PCs since Windows 98. Thus, any Windows user will be able to open such a file under Windows 98, ME, 2000, XP, and the latest Vista operating system as well.


An HTML Help CHM file includes all the features to provide the end-user with easily navigated tutorial. Everybody of us is probably familiar with the HTML Help viewer, which has the Table of Contents, alphabetical Index, and the Search feature, located on the navigation pane to the left side from the help topic text.



How do I create CHM HTML Help Files?



In fact, there are various tools in the marketplace from primitive applications to complex and expensive systems for writing technical documentation that support HTML Help as an output format. However, the common problem of that software is their non-intuitive and sluggish interface, complexity, and high price of about $999 per license or even more! Moreover, you will have to spend a lot of time on learning the tool before can create even a simple CHM file for your software product. Now you may be asking if there is another solution to make the process of creating CHM help an easier way. Fortunately, the answer is “yes”.


HelpSmith has an alternative vision of creating CHM Help. If you download and try HelpSmith available on the vendor’s web site, you will be surprised by its straightforward and easy-to-use user interface. There’s actually NO learning curve like in many other help authoring products making you spend hours to figure out how to add new help topic. Once you have installed HelpSmith on your computer, you can type “Hello, World”, click a button and here is it – your first help file in the HTML Help format. Then you can easily add new help topics, create hyperlinks, help windows, insert graphical files and everything the HTML Help system allows you to do; the process of working is actually as simple as working with Microsoft Office applications.



Creating CHM Files with HelpSmith


HelpSmith allows you to easily create CHM HTML Help files. Based upon the WYSIWYG (“What-You-See-Is-What-You-Get”) principle, HelpSmith provides you with a powerful text editor making the biggest part of working on a help system – writing and editing help topics – a pleasure to do. Use graphical images, insert full-featured tables, create hyperlinks, finally, and check spelling as you type just like in Microsoft Word. Also, you will be able to create the Table of Contents and the keyword Index for your CHM help system just in several minutes. Among other important HelpSmith features are the abilities to create Web Help and printed documentation from the same source help project.



Microsoft HHC.EXE compiler


CHM is not an open file format. So how do third party products allow you to create it? Like other help authoring tools, HelpSmith uses the HHC.EXE help compiler to create CHM Files from the source project. The HHC.EXE HTML Help compiler is freely available with the Microsoft HTML Help Workshop package which can be downloaded from the product’s home page. Once HTML Help Workshop is installed, you should follow these simple steps to link HHC.EXE with HelpSmith:

Choose “OptionsTools” from the menu.
Select “GeneralCompilers” on the left sidebar.
Specify the full path to the “HHC.EXE” file on your computer (for example, “c:Program FilesHTML Help WorkshopHHC.exe”).
Click the OK button to save the changed parameters.



The compiler is installed, the process of working with HHC.EXE is completely transparent to you, making it a breeze to create CHM HTML Help documentation.

Eugene Ivanov is a help authoring specialist, technical writing blogger, and a software developer at http://www.helpsmith.com, a company producing help authoring tools for technical authors and programmers.

More HTML Tips and Tricks:

Compare HTML Files

Fundamentals of HTML. HTML or the Hyper Text Markup Language is a type of language used to explain the texts of a Web page. It describes the syntax and the position of particular information’s known as tags that explain the preferred composition of the text to the browser. An HTML article includes contents and tags that provide the information’s to the browser on how to compose the article. One of the major qualities of HTML is the hyper text link which lets the HTML author to describe pages to linking or jump to another page. This particular capacity to jump to an additional page or to a complete different Web site is exceptional to the Web based article. HTML files can be compared as simple ASCII content files which mean content editor like Notepad for Windows or Simple text for Macintosh. These content editors are used to produce HTML articles or documents. Usually there are two basic HTML editors one is HTML for Windows and the other one is HTML for Macintosh which are optional. WYSIWYG or What You See Is What You Get is also among one of the HTML editors. They are similar like page layout software or word processors for Web documents. Their main aim is to offer authors of Web documents with the process of building pages with no knowledge of HTML.


HTML Compare Files Overview. HTML files compare is said to be one of the best ways to locate modifications in various versions of Website and HTML pages. Similar to the word processor’s revision tracking features the HTML compare works. This is a tool which compares two Web pages and emphasizes the differences regarding modifications, deletions and additions. With a very simple sensitive interface one can easily choose the directories and files to compare and later view the merged file demonstrating various changes to each of those pages. One can also compare several directories and produce a list of all the changes to be done to the directory. This is one of the best ways to locate changed graphics and new files. In order to compare HTML files one needs a Window PC and a disk space which is less the 50K.


How Does the Compare HTML file work? Once the HTML Compare is opened one has to select a file the menu which he wants to compare. The Compare Directories or Files page appears, just click on compare files and select old. Later he has to choose the first directory from the dialog. After selecting the second directory click on the HTML compare. The HTML compare performs the comparison in several minutes which depends upon the size of the content. As it finishes comparing a browser window opens with a report file which shows all the results. The page appears with all the deleted files, new files and the changed files. Inn the changed files there are three links old, new and composite. Old files are those files which are found in the old directory where as new files are found in the new directory. Also by using the command line interface one can compare a series of files in a batch by executing the HTML compare.

If you are interested in compare html file, check this web-site to learn more about web page comparison.

More HTML Tips and Tricks: