Session 6: Web search engine
A search engine
Among the most used tools on the web are the search engines . Through a systematic and continuous search, huge indexes with references to the addresses of hundred of millions of web files are created and maintained. Also smaller organization may need search engines for their internal files. In this session, the development of the simple search engine is discussed and exemplified with a search engine for this course Information section . You may already have used the search link available in the left window.
In contrast to the databases in which the data are nicely organized in tables, the type of data considered here are text files with varying content and length. Among the properties of text files are, in addition to the size, language, etc., the words included in the documents. The tool we discuss in this session is a full text search engine, which can be implemented by means of features in ColdFusion . The basic approach of a full text search engine for a (large) set of text files is an index file with one record for the different (important) words occurring in the set of texts. Each word record has attached links to the files in which the word occurs one or more times. By asking for files with one or several words contained, a search through the index file will give links to the requested files. Based on the numbers of occurrences in the requested files, scores can be computed.
Included in ColdFusion are the components for such a search engine. They originates from a system module called Verity included in CFMX .
To create search engine templates for your web system using CFML , you must take several steps:
- Prepare a menu
- Register the collection of files which should be covered by the system
- Index the files
- Construct a query interface to the system
- Prepare searching by means of the engine
- Prepare an option for deleting a collection
[top]
Figure 1 indicates the system components. The right hand part of the figure represents how the search results first appear as references while the left side of the figure indicates that these references can be activated to retrieve the files searched. When the user receives the retrieved links, he should be able to select all or some of the links and retrieve the electronic documents in which he/she is interested.
In the following sections, the templates needed for implementing a search engine in CFMX Verity module are discussed.
[top]
Selection menu
The implementation of the full text retrieval system requires a usual Application.cfm template similar to those discussed in earlier applications (no datasource is needed).. The first template, index.cfm , is a menu for selecting the required processes :
1. <!--- index.cfm --->
2. <h2><font color="Blue">Search engine</font></h2>
3. <p>Select the process you want:</p>
4.<p>1. <a href="form_recording.cfm">Define </a> a collection of files </p>
5. <p>2. <a href="form_indexing.cfm">Indexing </a> the files of a collection</p>
6. <p>3. <a href="form_searching.cfm">Searching </a> in a collection index</p>
7. <p>4. <a href="form_deleting.cfm">Deleting </a> a file collection</p>
The menu, Figure 2 , generated by the template, displays 4 options,
- registering
- indexing
- searching
- deleting
Observe that a set of files representing text documents is assumed already stored on your disk. for registration as a collection . Be careful with interpreting this term. For example, when the literature discussing creation of a collection, it does not mean creating the documents, but defining the location of the collection indexes, and when the documentation refers to deleting a collection, it means that the indexes and the definition of the collection are deleted, not the physical text files themselves.
File collection
The collection, which we are going to define for this example, comprises the session text files of this course. Usually, the number of text files comprised by a collection is of course much higher.
The files can be of different types, distinguished by their extensions, .htm , .html , .cfm , .cfml , .jpg , .gif , .pdf , .txt , .xls , etc. Note that graphical files can also be included, but only indexed if they contain some text. Not all types of files are always relevant for the application. The first task is to distinguish between relevant and irrelevant files by extension during the indexing. In the example , we use only .cfm files.
The first step is naming the collection remembering that a collection is not the set of files themselves. The following template makes the required preparation for a collection: .
The form_recording.cfm displays a form, Figure 3 , requesting the information needed for definition of a collection:
1. <!--- form_recording.cfm --->
2. <h2><font color="Blue">Defining a collection</font></h2>
3. <form action="recording.cfm" >
4. <p>Name of collection:<input type="text" name="collection_name">
5. <p>Folder for collection:<input type="text" name="collection_path"> </p>
6. <p><input type="submit" value="Record"></p>
7. </form>
The form requests 2 attribute values:
- What name should be attached to the collection,
- In which folder should the collection, i.e. the index files, be established
Recall that the collection consists of special index files referring to the files in the set in which we are interested, not the data files themselves. The collection path is therefore referring to the collection index.
The FORM tag leaves the process control to the template recording.cfm when the form is submitted. The recording.cfm template looks like this:
1. <!--- recording.cfm --->
2. <CFLOCK TIMEOUT="30" NAME="cfcollection_lock" TYPE="EXCLUSIVE">
3. <CFCOLLECTION ACTION="CREATE" COLLECTION="#collection_name#" PATH="#collection_path#" LANGUAGE="English">
4., </CFLOCK>
5. <p><h3><font color="Red">The collection is registered.</font></h3></p>
Keep in mind that the templates now discussed are tools for the developer to establish the search engine, not for the end users. Still, there might be several people working with the files, and to avoid any problems, this template uses the CFLOCK/CFLOCK tags. The lock makes certain that the developer can make the recording enclosed undisturbed by other developers and/or users. The CFLOCK in Line 2 has many possible attributes of which we need only 2 in the present application. TIMEOUT specifies the maximum time in seconds CFMX should wait to obtain a lock, not the duration of the lock. TYPE of lock and is set to " EXCLUSIVE ", the alternative is " READONLY ". An exclusive lock reserves the files completely for the developer's script while it is working within the locked area. CFLOCK tags should be used around all read-write operations for which there is a risk that 2 or more requests happens at the same time. Experience indicates that in applications with few users, this situation happens rather infrequently.
CFCOLLECTION tag has several options. The first is ACTION , which we give values " create " and " delete " in this example. The name of the COLLECTION is needed and received from the the previous form, and so is also the attribute PATH which is the path of the folder in which the collection is to be established. LANGUAGE has " English " as default. If the text in the files are from other languages, an International Language add- on is available.
[top]
Indexing the files
The defined collection must be populated/indexed with data about the files in which we are interested. This is done by the process indexing . A second form, Figure 4 , is implemented for this purpose by the template form_indexing.cfm :
1. <!--- form_indexing.cfm --->
2. <h2><font color="Blue">Indexing documents for a collection</font></h2>
3. <form action="indexing.cfm" >
4. <p>Name of collection:<input type="text" name="collection_name"></p>
5. <p>URL path to source folder:<input type="text" name="URLPATH"></p>
6. <p>Full path to source folder:<input type="text" name="KEY"></p>
7. <p>File extensions:<input type="text" name="extensions"></p>
8. <input type="submit" value="Index"></p>
9. </form>
In addition to the name of the collection, the URL path and the Full path to the folder (not the connection folder!) containing the documents to be index are required. Remember the URL path must start with http://.. The full path of the same folder on the server should also be provided. This path starts with c:\.. , or another relevant disk reference. Finally, the extension(s) of the files to be included must be specified. The specification must be on the form .cfm , .pdf , .jpg , etc. If more than one is needed, comma should be used as delimiter.
Multiple collections and document folders can also be specified with comma delimiters between names and paths.
From this form, the process control is left to the template indexing.cfm . This short template contains the very powerful tag CFINDEX tag in Line 3:
1. <!--- indexing.cfm --->
2. <CFLOCK TIMEOUT="30" NAME="cfindex_lock" TYPE="EXCLUSIVE">
3. <CFINDEX ACTION="UPDATE"
COLLECTION="#collection_name#"
TYPE="PATH"
EXTENSIONS="#extensions#"
RECURSE="YES"
URLPATH="#urlpath#"
KEY="#key#"
LANGUAGE="English">4. </cflock>
5. <p><h3><font color="Red">The collection is indexed.</font></h3></p>
For the reason given above for specification of collections, CFLOCK-/CFLOCK tags are used to enclose the indexing process because it can take some time if the number of files to be indexed is large. Indexing of the document files (populating the collection) is completely taken care of by means of the CFINDEX tag. The ACTION is specified to "UPDATE" which can be used both for initialization as well as maintenance of an index. RECURSE is set to "YES" indicating that the process should traverse all sub folders of text files if any. The remaining attribute values are obtained from the form_indexing.cfm .
[top]
Searching the collection
The previous processes have established the search engine. We are now ready to prepare the use of the search engine. A simple search form, Figure 5 , is the first step. It is implemented by the template form_searching.cfm :
1. <!--- form_searching.cfm --->
2. <h2><font color="Blue">Searching a collection</font></h2>
3. <form action="searching.cfm" >
4. <p>Name of collection:<input type="text" name="collection_name"></p>
5. <p>Search words:<input type="text" name="criteria">
6. <input type="submit" value="Search"></p>
7. </form>
A search requires the name of the collection(s) and search criteria . A simple search criteria can be a single word , a phrase , several words delimited by comma between them as well as expressions based on the logical operators OR , AND and NOT . More complex search criteria can be created by using a special Search Language available in CFMX VERITY module.
The search data are sent for processing by means of the template searching.cfm . The core of this template is the CFSEARCH tag. This tag has a number of attributes. Execution of the tag also provides a number of variables about the search. Attributes in addition to those transferred from the search form, are TYPE , STARTROW and MAXROW . TYPE is given the value "SIMPLE" which is the most compact form, the alternative is "EXPLICIT" which is more flexible, but requires the criteria spelled out explicitly with all operators. STARTROW and MAXROW refer to the references retrieved. In our example we use the values "1" and "10" , respectively, for the two attributes.
The searching.cfm template is listed below. Following the CFSEARCH tag, the results of the search are sent for display by Lines 3-6. Line 5 makes use of 2 of the variables provided by the search process, i.e. #collection_search.recordssearched# and #collection_search.recordcount# . The value of the first variable informs about the total number of files referred to in the index, the second the number of files relevant according to the criteria specified in the previous form.
1. <!--- searching.cfm --->
2. <CFSEARCH COLLECTION="#collection_name#"
NAME="collection_search"
TYPE="SIMPLE"
CRITERIA="#criteria#"
STARTROW="1"
MAXROWS="10"
LANGUAGE="English">3. <h3><font color="Red">The search gave the following results:</font></h3>
4. <CFOUTPUT>
5. <p>The collection contained #collection_search.recordssearched# files, and #collection_search.recordcount# files satisfying the search criteria "<font color="Red">#criteria#</font>".</p>
6. </CFOUTPUT>
7. <CFIF #collection_search.recordcount# LTE 0>
8. <CFOUTPUT> Sorry , no files were found for the search criteria "<font color="Red">#criteria#</font>".</CFOUTPUT>
9. <cfelse>
10. <p><b>These files found were:</b></p>
11. <table>
12. <tr><td>Score:</td><td>Link:</td></tr>
13. <CFOUTPUT QUERY="collection_search">
14. <tr>
15. <td><b>#collection_search.score#</b></td>
16. <td><a href="#collection_search.url#">#collection_search.url#</a> </td>
17. </CFOUTPUT>
18. </tr>
19. </table>
20. </CFIF>
The remaining Line 7-20, controls a two way branching by a set of CFIF-CFELSE-/CFIF tags. The decision criteria in Line 7 select Line 8 if no relevant files were found. If relevant files were identified, Line 10-19 is executed. The results are presented in a TABLE construct with 2 columns: Score and Link . The score is a number in the range 0 to 1 where 1 is indicating the highest possible relevance. Link is a sensitive link to the actual file which can be retrieved by a click.
[top]
Deleting a registered collection
After some time a search engine can become obsolete. To make the system complete, a possibility to delete a collection with the contained indexes must also be included. The form_deleting.cfm template generates the necessary form:
1. <!--- form_deleting.cfm --->
2. <h2><font color="Blue">Deleting a collection</font></h2>
3. <FORM ACTION="deleting.cfm" >
4. <p>Name of collection:<INPUT TYPE="text" NAME="collection_name"></p>
5. <p><INPUT TYPE="submit" VALUE="Record"></p>
6. </form>
The only required information is the name of the collection. When the form, Figure 6 , is submitted, the template deleting.cfm takes care of the deletion:
1. <!--- deleting.cfm --->
2. <CFLOCK TIMEOUT="30" NAME="cfcollection_lock" TYPE="EXCLUSIVE">
3. <CFCOLLECTION ACTION="delete" COLLECTION="#collection_name#">
4. </CFLOCK>
5. <h3><font color="Red">The collection is deleted.</font></h3>
The deleting.cfm uses the same CFCOLLECTION tag as the recording.cfm did.
[top]
Final remarks
It is important to note that the hidden dynamics in this system, i.e. using the results of a search for output, cannot be done with any static tool.
A search engine of the type we have discussed in this session, is an interesting tool generating a number of research questions. Problems, which can be raised, are how different groups expresses their search criteria? Do they learn over time using a search engine more efficient? How many users are able to express advanced search criteria? Are advanced search criteria more effective than simple? Should large text files be partitioned into smaller sub files to obtain more efficient searches?
Exercises
a. Read Chapter 16 in RBB about the Advanced Techniques and try to see possible improvements of the templates presented in this session.
b. Copy all search engine templates to your PC . Prepare 3-4 small text files in English, or use some files you already have. Try to create a collection and to index the text files. Set up the search engine.
c. Do explorative searches in your collection using words and logical expressions as search criteria. Study the Scores, and compare with the frequencies of the word appearances in you text file collection.
d. Report on the Message board about your experience
e. Developers interested in crawlers, should investigate Web Spider, a command-line utility, which can be found in the COLDFUSIONMX directory in your computer.
Link to the session application example.
This example works with a collection named "Collection" which comprise all files in the Sessions section. You will not be permitted to try the collection , indexing and deleting functions because it requires detailed information about the directory structure of the served used by the course. You should, however, try out the searching with both simple an logical expression criteria.
Link to the session test.
[top]
Article - Macromedia
Steve McKean
UH-Email
CT FORUM CF
user - enter
Steve McKean
UH-Email
2 - enter
CT FORUM CF
user - enter
CFMX HISTORY RESOURCES
OBJECTIVES
Implementation aspects: