Session 11: Regular expressions and CFScript
expression builder is linked
See O'reilly page 560-561 for examples -- CHAPTER 18 page 545
Refers to chapters 18 - 19 page 567(cfscript)
Regular expressions and string processing
Regular expressions is a way of specifying text processing conditions. It is used in a number of programming languages, but it is unfortunately not standardized. It is a very central part of the scripting language PERL, and the regular expression feature in ColdFusion is almost compatible with PERL 5 . In ColdFusion , regular expressions are used in 2 basic ways:There are 4 RE functions in CFMX :
- Searching for symbol patterns in a string of symbols.
- Replacing symbol patterns in a strings with new symbol patterns.
1. REFind(REGex, String [, Start] [,ReturnSubExpressions])
Re Find - is case sensitive (Re means regular expression)
(contains argumants)
2. REFindNoCase(REGex, String [, Start] [,ReturnSubExpressions)
Re Find No Case - case insensitive
3. REReplace(String, REGex, SunString [,Scope])
Re Place is case sensitive
4. REPlaceNoCase(String, REGex, SubString [,Scope])
Re Place NO Case - case insensitive
The 2 first functions used for searching are identical with the exception that the first is case sensitive and the second is case insensitive . These functions can take up to 4 arguments of which only the 2 first are required. The first required argument is the regular expression containing the condition for identifying a symbol pattern in the text string given as the second required argument.
Similarly for the third and the fourth functions used for replacing. These functions have, however, required arguments: --A string which is to be processed, --a regular expression identifying substrings, --and a new substring to replace identified substring(s).
We have already used the second regular expression function, REFindNoCase(REGex, String [, Start] [,ReturnSubExpressions) , for parsing text in the agent2.cfm template of the Agent2 example in Session 9 in order to identify wanted keywords in a text of news.
The syntax for forming regular expressions is based on sets of operators , character classes and/or Portable Operating System Interface (POSIX) classes . For a more detailed description of the syntax for regular expressions, consult RBB .
Re-visiting the search engine
In Session 6, we studied how to build a search engine by means of the Verity module included in ColdFusion . One of the VERITY functions was indexing. Figure 1 explains the content of this function. Given the documents of a registered collection, the indexing function processes each document by parsing , i.e., identifying and marking each word of the document, for building a frequency list containing each different word appearing in the document. When parsing of all documents is completed, a word frequency list for the whole collection has been generated. For each recorded word, links have been established to to all documents in which the word occurs. The result will be an inverted index of words each with frequency of occurrence in the document collection.
Documents containing requested words can then easily be localized by means of the index. Based on each word's total frequency in the collection and its local frequency in a document, different kinds of scores of relevance for the individual document can be computed.
In the following example, we demonstrate how to construct a template, which reads a text file, parses the text by means of a regular expression and builds an inverted word index for the document. To make the demonstration complete, the application has an introductory form by means of which you can upload your own text file. The second template, parser.cfm , uploads, parses and prints results from the specified file.
The introductory index.cfm file is simple:
1. <!--- index.cfm --->
2. <h3><font color="Blue">Text file indexing</font></h3>
3. <cfform action="parsing.cfm" method="POST">
4. Name of file:<input type="file" name="file_name">
5. <input type="submit" value="Submit">
6. </cfform>
The purpose of this template is to permit the user to specify a text file at the client computer for uploading to the server for processing.
The next step is to search sequentially through the text string in the file to identify each separate word, record the identified word in a list if it not exists already or increment a frequency counter if it exists, and repeat for next word. For each existing word, its frequency counter is incremented by 1. When the document is exhausted, the words are sorted in descending order with the most frequent on top of the list.
In the following template parser.cfm , Lines 2-3, the file you specify in index.cfm is uploaded, read into variable file_up as a string value, and its characters counted. The core of the process is surrounded by 2 pairs of CFLOOP tags, one nested in the other and both function as ' while ' loops. The first, starting at Line 8 and closing at Line 26, is traversed as many times as the process identifies words. The second enclosing Lines 13-20 is traversed for each character in the words.
The core of the parsing process starts by finding each word. This is done by Line 9 in which the position of the end delimiter of the current word is identified. Since the beginning is the character following the end delimiter of the previous word, the word can be extracted by the tag of Line 10.
The position of the delimiter following the end of the current word is found by means of the function REFindNoCase(..) in which the criterion for finding the next word delimiter is the regular expression appearing as the first argument. In this template, two POSIX classes , [:punct:] and [:space:] , are used. Note that in the function, the square bracket must be repeated to be correctly interpreted! The first class matches most punctuation characters not appearing as part of English words. The second class matches spaces. As indicated above, the same results could also be obtained by a regular expression based on character classes .
With the position at the end of a word and the start position of the word, the 3 arguments needed for the string function MID() in Line 10 extract the word are available. Line 11-20 determine if the current word already is in the frequency list. If the word has already been recorded, the frequency number is incremented in Line 16. If not, Line 25 inserts the word, and the next word is extracted and tested against the list of words.
1. <!--- parser.cfm --->
2. < CFFILE action="upload" filefield="file_up" destination="#path#\file_up.txt" nameconflict="overwrite">
3. < CFFILE action="read" file="#path#\file_up.txt" variable="file_up">
4. <CFSET document_size="#len(file_up)#">
5. <CFSET start="1">
6. <CFSET frequency_structure=StructNew()>
7. <CFSET characters_processed="0">
8. <CFLOOP condition="#characters_processed# LT #document_size#">
9. <CFSET position=REFindNo Case("[[:punct:]][[:space:]]",#file_up#, #start#>
10. <CFSET word=Mid(#file_up#,#start#,#position#-#start#)>
11. <CFSET hit="0">
12. <CFSET counter="0">
13. <CFLOOP condition="#counter# LT #StructCount(frequency_structure)# AND #hit# EQ 0">
14. <CFIF StructKeyExists(frequency_structure,#word#)>
15. < CFSET frequency=StructFind(frequency_structure,#word#)>
16. < CFSET StructInsert(frequency_structure,#word#,#frequency#+1,"true")>
17. < CFSET hit="1">
18. </CFIF>
19. <CSET counter=#counter#+1>
20. </CFLOOP>
21. <CFIF #hit# EQ 0>
22. <CSET StructInsert(frequency_structure,#word#,1)>
23. </CFIF>
24. <CFSET start=#position#+1>
25. <CFSET characters_processed=#characters_processed#+#Len(word)#+1>
26. </CFLOOP>
27. <div align="center"><h2><font color="blue"> Word frequency list</font> </h2></div>
28. <table align="center" border="1">
29. <tr>
30. <th>Word:</th><th>Frequency:</th>
31. <CFSET mylist="#ArrayToList(StructSort(frequency_structure,"numeric","desc"))#>
32. <CFLOOP index="word" list="#mylist#">
33. <CFOUPUT>
34. <tr>
35. <td> #word# </td><td> #StructFind(freguency_struct,word)# </td>
36. </tr>
37. </CFOUTPUT>
38. </CFLOOP>
39. </table>
The last part of this template, Line 27-39 is an ordinary tabulation of the list. Line 31 orders the word frequency list in frequency descending order, while Line 35 prepares the word and its frequency for display in a table row.
Figure 2 shows a short text, and Figure 3 demonstrates the indexing result.
Implementation
The application has been implemented as an example and is available by the link at the end of this session. Observe that your file used in this example should be a .txt file and that the absolute address to the file on your own computer must be specified. You can try files with other extensions, but the results may be affected by the formatting code of the particular file type and create problems. If you for example parse a .htm file containing a a number of <P> tags, the P's are surrounded by the symbols < and > which are considered word delimiters. Consequently, the "word" P appears with high frequency. The best way to try out the parser is to prepare/use a document in NotePad (or another ASCCI text processor.)
In a real application, the frequent words such as ' and ', ' or ', ' but ', ' the ' and words such as pronouns ' I ', ' you ', ' she ', etc., are specified in a list called a stop-word list used to exclude these words from the word frequency list since they have little significance when the frequency list is used for searching for keywords. In a real application, the formatting specifications should also be eliminated from the text before indexing.
CFScript language
In the introductory session of this course, it was pointed out that ColdFusion is based on the tag-oriented language CFML . However, CFMX also includes a scripting language which can be activated by the tag CFSCRIPT . The CFScript language has a syntax similar to ordinary programming languages such as C . It has syntactically also many similarities with JavaScript (we shall return to JavaScrip in the next session). While JavaScript is a tool for extending HTML on the client-side , the scripting tool CFScript is an extension of CFMX on the server-side.
The syntax of CFScript is simple , and is discussed by RBB in Chapter 19 of his book. You should take note of the following basic rules:
- CFScripts can be included in CFMX templates, but must always be enclosed by the tag pair CFSCRIPT and /CFSCRIPT .
- CF tags cannot be used within a CFScript .
- CF functions can be used in CFScripts .
- Variables defined in a CF template are available in a CFScript and vice versa.
CFScripts usually give more compact code than CFML , and many programmers trained in conventional programming feel more comfortable with CFScript than with the tag based statements of CFML .
Comparing CFML and CFScript
To demonstrate the CFScript , we are returning to the parser.cfm template of the Regular Expression example discussed in the first part of this session. The code for parsing is now developed by means of CFScript instead of the CFML tags. The CFFILE tags in the beginning of the parser template of the previous session, which are needed for uploading and handling the file to be parsed, and the final TABLE tags for displaying the results, are kept unchanged in the example of this session's parser_script.cfm while the tags for processing and preparing the word frequency list are substituted with CFScript code.
The parser_script.cfm template starts with the CFFILE tags needed to upload to location #path#\file.txt and to read the text file. Then follows the CFSCRIPT block, and the template ends with the CFOUTPUT tag block. The value #path# must be set compatible with the web page organization of your server.
Note that like many other language syntaxes, the syntax of CFScript requires that each statement is terminated with a semicolon. As mentioned, CFScript permits the use of all CFMX functions and variables.
Almost a one-to-one correspondence between the tags of the previous example and the scripting statements can be obtained. Because of the strong similarity between the two templates, parser_script.cfm can easily be interpreted and understood without any further explanation. Note, however, the use of the special CFScript function WriteOutput(string) appearing in Lines 9 and 10. It writes text to the output stream. The visible gain of scripting compared with the tags in this example is less text.
- <!--- parser_script.cfm --->
- <CFFILE action="upload" filefield="file_up" nameconflict="OVERWRITE" destination="#path#\file2.txt">
- <CFFILE action="READ" file="#path#\file2.txt" variable="file_up">
- <CFSET document_size="#len(file_up)#">
- <CFSCRIPT>
- start=1;
- frequency_structure=StructNew();
- characters_processed=0;
- WriteOutput("Text: #file_up#<br>");
- WriteOutput("Document size: #document_size# characters.<br>");
- while (#characters_processed# lt #document_size#){
- position=REFindNoCase("[[:punct:][:space:]]",#file_up#,#start#);
- word=Mid(#file_up#,#start#,#position#-#start#);
- hit=0;
- counter=0;
- while(#counter# LT #StructCount(frequency_structure)# AND #hit# EQ 0){
- if (StructKeyExists(frequency_structure,#word#)){
- frequency=StructFind(frequency_structure,#word#);
- StructInsert(frequency_structure,#word#,#frequency#+1,"true");
- hit=1; }
- counter=#counter#+1; }
- if (#hit# eq 0)
- StructInsert(frequency_structure,#word#,1);
- start=#position#+1;
- characters_processed=#characters_processed#+#Len(word)#+1; }
- </CFSCRIPT>
- <CFOUTPUT>
- tid2: #TimeFormat(now(),'hh:mm:ss:ll')#
- </CFOUTPUT>
- <div align="center"><h2><font color="Blue"> Word frequency list </font></h2></div>
- <table align="center" border="1">
- <tr>
- <th>Word:</th><th>Frequency:</th>
- </tr>
- <CFSET mylist=#ArrayToList(StructSort(frequency_structure,"numeric","desc"))#>
- <CFLOOP index="word" list="#mylist#">
- <CFOUTPUT>
- <tr><td> #word# </td><td> #StructFind(frequency_structure,word)#</td></tr>
- </CFOUTPUT>
- </CFLOOP>
- </table>
For the example of this session, the code was implemented with the same index.cfm template as in the previous example containing the form required for identifying the text file to be uploaded. The example can be tested by using the link at the end of this session. If used at the same text file as you used for the example of the previous session, the results should be identical.
The advantage of CFScript will become significant in more complex processing tasks. It seems to be a trend that scripting is used for developing User Defined Functions and CF Components designed for intensive re-use . These topics will be discussed in the next sessions.
Conclusion
One of the objective of ColdFusion is to be a RAD ( Rapid Application Development ) tool. The advantage of CFScript is more compact, elegant and and efficient code. The price to be paid for using CFScript may, however, may be less rapid development. Optimum development strategy will depend on the nature of the application task and the developer's preferences, knowledge and experience with the CFML tags and scripting.
Exercises
a. In the example, we have not paid attention to how the frequency list should be stored for efficient use. What about links to the indexed text documents? How should a stop word list be taken into account? List the storage alternatives you can think of, and prepare a pro & contra discussion for the alternatives imagining you need to convince a client for the solution you think is most suitable.
b. Implement your storage solution, and consider which search strategy will be optimal if you are searching for documents containing one, two or more words. It is usual to indicate document relevance with a score indicator . How would you assign scores to the different documents? Can you extend your design to work with logical expressions?
c. Do not forget to read Chapter 18 by RBB . The potentials of RE are much wider than what has been discussed in this session, and knowledge about these potential are useful for designing a number of different systems.
d. Select one of the processing templates developed in your project and rewrite it by means of CFScript .
e. Insert time-stamps combined with CFOUTPUT at the beginning and end in both the tag and the script versions. Run both of them, compare the timing outputs and see if you can detect any speed differences . Discuss what you in fact are measuring ?
f. I recommend studying Chapter 19 on Scripting in RBB . My guess is that CFScripting will be become more important in the future in connection with User defined Functions , Components and Web Services which are 3 topics to be discussed in the following sessions.
Link to the session examples.
Link to the session test.
Using ColdFusion regular expression functions
ColdFusion supplies four functions that work with regular expressions:
REFind and REFindNoCase use a regular expression to search a string for a pattern and return the string index where it finds the pattern. For example, the following function returns the index of the first instance of the string " BIG ":
<cfset IndexOfOccurrence=REFind(" BIG ", "Some BIG BIG string")> <!--- The value of IndexOfOccurrence is 5 --->To find the next occurrence of the string " BIG ", you must call the REFind function a second time. For an example of iterating over a search string to find all occurrences of the regular expression, see Returning matched subexpressions .
REReplace and REReplaceNoCase use regular expressions to search through a string and replace the string pattern that matches the regular expression with another string. You can use these functions to replace the first match, or to replace all matches.
Steve McKean
UH-Email
CT FORUM CF
user - enter
Steve McKean
UH-Email
CT FORUM CF
user - enter
CFMX HISTORY RESOURCES
OBJECTIVES
Implementation aspects: