Regular Expression Tutorial

Regular expressions are extremely useful for parsing data, IT professional should take advantage of this very powerful tool. There are many different ways to use regular expression to accomplish a wide variety of tasks.

We often have to preform analysis on log or data files to determine the source of a problem. In the following example we will analysis a log file for errors and report the number of occurrences.

Data to Analysis
The following is based on the default W3C Extended Log File Format generated by IIS. (C:\WINNT\system32\Logfiles\W3SVC1)

      11:57:17 127.0.0.1 GET /Website/images/logo.jpg 304
      11:57:17 127.0.0.1 GET /Website/images/pic.gif 304
      12:01:35 127.0.0.1 GET /Website/subfolder/page1.asp 500
      12:01:35 127.0.0.1 GET /Website/css/main.css 304
      12:01:35 127.0.0.1 GET /Website/src/common.js 304
      12:01:35 127.0.0.1 GET /Website/process/page3.asp 200
      12:01:45 127.0.0.1 GET /Website/process/page3.asp 500
      12:01:35 127.0.0.1 GET /Website/subfolder/page1.asp 500
      12:01:45 127.0.0.1 GET /Website/process/page4.asp 500
      12:01:45 127.0.0.1 GET /Website/process/page5.asp 200
    

Step 1 - Analysis the data and determine the information to extract.
Examine the data for a pattern, sometimes this is not an easy task. However, our log file has a simple pattern:

TIME IPADDRESS METHOD URL STATUS

After you have established the pattern, determine what information you would like to extract. For this example we will extract a distinct list of URLs with a 500 (internal server error) status. The output we would like to see is:

URL COUNT

Step 2 - Construct a regular expression
The syntax for a regular expression can be difficult to understand, however, there is lots of help and samples online. Depending on the programming language your using, the syntax may vary so check the documentation.

The regular expression for extacting our data would be a follows:

[GET|POST]\s+(.*)500

[GET|POST]Find text starting with either GET or POST
\s+Followed by spaces
(.*)Match multiple(*) occurances of any character(.) and group them()
500Stop grouping characters when the match end occurs. 500(status code)

Step 3 - Choose a language to implement your data extraction
This expression will extract the matching data, however, some programming code needs to be added to count the distinct matches. There are many scripting languages that can be used to preform the task, however, many require downloading, installing, licensing, etc... If your using a Microsoft Windows based operating system, then you already have a powerful scripting engine installed and available to use. Using a text editor, "I recommend LargeEdit :)", you can write JScript code and execute it directly against the scripting engine(WScript.exe).

Step 4 - Create the Script to extract the data
This task can be simplify by using LargeEdit 2.0. We will create a scripting macro based on JScript and execute it within scripting engine against an open log file.

  1. Open LargeEdit
  2. Select Tools from the main menu
  3. Select Script Macros and click Create
  4. Enter the following Script in the editor window
  5. Save your Script as "distinctlistcounts.js"
    You can also download the file distinctlistcounts.js
      
      function Run() {
          LargeEdit.ResultLog(' Regular Expression Search');

          var inpStr = LargeEdit.CurrentFile.Text; //Text inside the editor window

          var oRe;
          oRe = new RegExp("[GET|POST]\\s+(.*)500", "g"); //Regular Expression object
          var arr; //Array to hold results of expression search
      
          var cntarray = new Array(); //Count array to hold counts assoicated with items found
          var idxarray = new Array(); //Items array to hold all items found

          var idx;
          var cnt = 0;
          var distinct = 0;
          var retStr = '';

          //Begin searching text using regular expression
          while ((arr = oRe.exec(inpStr)) != null) {
            retStr += arr[1];
            cnt++;
            idx = idxarray.indexOf(arr[1]); //Has item already been found?
            if (idx < 0) {
               // Item not found, add to item array and start count at 1
               idxarray.push(arr[1]);
               cntarray.push(1);
               distinct++;
            } else {
              //Item found, increment the count by 1
              cntarray[idx]++;
            }
            retStr = '';
          }

          //Report information to results window
          LargeEdit.ResultLog('');
          LargeEdit.ResultLog('Total matches found ' + cnt)
          LargeEdit.ResultLog('Total distinct matches found ' + distinct)
          LargeEdit.ResultLog('');
          LargeEdit.ResultLog('List of distinct matches with counts');
          LargeEdit.ResultLog('');
          for (var i = 0; i < idxarray.length; i++) {
              LargeEdit.ResultLog( idxarray[i] + ' = ' + cntarray[i] );
          }

      }

        //Extend existing JScript array object with an indexOf method
        function Array_indexOf(text) {
          var res = -1;
          for (var i = 0; i < this.length; i++) {
            if (text.toUpperCase() == this[i].toUpperCase()) {
              res = i;
              break;
            }
          }
          return res;
        }
        Array.prototype.indexOf = Array_indexOf;
      //copyright 2006 All rights reserved
      
      

Step 5 - Execute the scripting macro
Using LargeEdit we can now execute this scripting macro against any file open in the editor.

  1. Open the log file using LargeEdit
    You can also download the example log file example.txt
  2. Select Tools from the main menu
  3. Select Script Macros and click Play
  4. Browse for the scripting macro you saved in the last Step distinctlistcounts.js
  5. The Script will execute and report the information in the result window.

Summary
Using regular expressions and scripting languages can save you huge amounts of time, and you can quick reproduce your data analysis results on any number of files easily. If this task is something you preform regularly, you can create a custom toolbar and add a button to execute your script (see Custom Toolbar tutorial ).