This section is the last part of my Understanding Information Store Essentials series and this part is about Full Text Indexing feature of Exchange 2000/2003. In this article I will cover the concepts of Indexing and how Exchange implements it using MSSearch, and also this article contains best practices for deploying full-text indexing with Exchange 2000/2003 Server.
If you have missed the other parts in this article series please read:
Which clients are supported by FTI?FTI on Exchange uses the MS-Search engine – When Exchange is installed, MS-Search 2.0 is installed on the machine. MS-Search is a generic search engine which is also used by other applications such as SQL and SharePoint Portal Server.
Creating the index is a very intensive process on the server – However, this processing can be scheduled to run over night and what the server spends creating the index it saves much more when clients actually submit queries. In effect the server is performing all the processing for searches in advance.
FTI means in much faster searches, but <100% accuracy –The search results are only as accurate as the index is up to date. However, in nearly all observed scenarios a one day old index is acceptable to users.
FTI indexes attachments – So when a user performs a search they are searching attachments as well as the message body. The list of attachments supported and how to extending this list is described later.
FTI preserves permissions – For example if a user’s search hits documents in a folder that they do not have permissions to see, then those hits are not returned to them. FTI achieves this by reading the folder ACLs and storing them with the index.
FTI searches are word and not character based – This means that you have to search on whole words only and wildcards are not supported.
FTI supports stemming – This makes it easier for users to find information. Stemming means that a search for a particular word will also search all variants of the word. e.g. 'start' returns 'starts', 'started', 'starting'.
FTI supports Noise word removal – MS-Search will strip out noise words from content before it indexes it (such as and, if, is, the, etc.). The noise words for each language are stored in text files which can be edited. These files are located here with name "noise.xxx" where xxx is the language specifier.
C:\Program Files\Common Files\Microsoft Shared\MSSearch\Data\Config
Remember to re-index your stores if you edit these noise files. When a client queries the database, their query is also passed through the noise word removal process. So for example, searching for “the sun” is the same as searching for “sun”.
FTI supports a Thesaurus – Thesaurus files are found in the same location as the noise word files with the file names tsxxx.xml, where xxx is the language specifier. These thesaurus files allow you configure expansion and replacement sets.
Expansion Sets - For example; you can specify that a search for any of the words ‘NT5’, ‘W2K’ or ‘Windows 2000’ actually searches for all of these words. So if a user searches on ‘W2K’, they will be returned any documents which contain ‘NT5’, ‘W2K’ or ‘Windows 2000’.
Replacement Sets – For example; you can specify that a search for the words ‘IE’, ‘IE4’ or ‘IE5’ actually search for the words ‘Internet Explorer’. So a user searching on ‘IE5’ will be given all documents which contain the words ‘Internet Explorer’.
The MSSearch process is used to create the indexes, the indexes are control by MSSearch, not Exchange. When instructed to index an Exchange store, the MSSearch engine connects to the store and retrieves information using the ExOLEDB interface. MSSearch consists of two components; the Gatherer and the Indexing Engine. The below visual will give a correct picture on how MS Search control indexing.

Gatherer is responsible for gathering or ‘crawling’ information and passing it to the Indexing Engine. It is utilizing the OLEDB protocol handler in Exchange Server to connect to the data.
When the gatherer locates a file (such as an attachment) it then processes it before passing it to the Indexing engine. Gatherer process is elaborated as below:
MS Search Language Detection process explained as below:

When a client submits a search query to the Store, it is handled by the Query Processor. The store must decide whether MSSearch (i.e. FTI) can handle the search or whether it must perform it itself. Some searches cannot be performed using the indexes. For example if a user searches for all message less than 10KB in size, this type of property search must be handled by the store itself. MSSearch can only handle text queries.Some searches include both text and property queries.
For example consider the following search: “All documents containing the word ‘Exchange’ AND less than 10KB in size” In this case the query processor passes the ‘Exchange’ part of the search to MS-Search for an Indexed search, which will be fast. When MS-Search returns the list of documents containing the word ‘Exchange’, the query processor will then search only those returned documents for ones less than 10KB in size, instead of searching the whole store. The results are merged (ANDed together) and returned to the user. As you can, MSSearch has actually increased the speed of the property search in this case.
This is going to be easy job!
Creating the Index
To create an index, perform the following steps:
1. Right-click the store to be indexed from the Exchange System Manager
2. Click Create Full-Text Index.
3. Type the location of the index catalogue or accept the default
4. Click OK
This process creates the catalogue, or index files.


Populating the Index
The next step is to perform a full population of the index files.
1. Right-click the store to be indexed from Exchange System Manager
2. Click Start Full Population
This will start the crawling process which could take sometime and impact server performance. If you expand the store object being indexed and click on Full-Text Indexing you will see a summary of the indexing process.
You will need to hit the Refresh key (F5) to update this summary. At any time you can manually start a full or incremental population of the indexes for each store. Incremental population only indexes changes made to the store since the last index population.
Other options you will find by right clicking the store object are:
Pause Population – Stops the current population process without deleting objects found to have changed during the current process. The process can be restarted at the same point it was paused.
Stop Population – Stops the current population process. All index updates obtained during the current process are lost.
Delete Full-Text Index – Deletes the index catalogue associated with the store.
If you open the properties of the Store object from ESM you will see a Full Text Indexing tab which contains the following settings:
Update interval – This is a scheduling option that will update the index with any changes that have been made to the store since the last update.

This index is currently available for searching by clients – It is recommended that you clear this option the first time an index is built. This will allow the query processor to process the search request until the index is built. After the index is built, MSSearch will process the search request. Disabling "This index is currently available for searching by clients" while building the index will help to avoid incomplete search results.
You can use gather files, the application log, and the system monitor to troubleshoot full-text indexing issues.When a full-text index is built, gather files are created. These files record errors and other information during indexing. Gather files can be found in the c:\Program Files\Exchsrvr\ExchangeServer\GatherLogs folder. All files with a .gthr filename are text files that you can view to identify every document and message that was not successfully indexed.
Analyzing a Gather File
Ideally, for each crawl, you will only see four lines in a gather file and the file size will be under 300 bytes. Each line in the file after the fourth line identifies the Uniform Resource Locator (URL) to the message or document that failed to index. Each line also includes the subject or file name along with the error code. The last number that you see in the line is the error code.
To decode this error number, use the utility Gthrlog.vbs found in the C:\Program Files\Common Files\System\MSSearch\Bin folder. The syntax for this utility, where filename is the name of the .gthr file, is Gthrlog filename.
You will be prompted with a series of dialog boxes that contain the data found in each line of the gather file. The error that was logged, and the error definition, appear in parentheses at the end of the text in the dialog box.
Note:
You can generate a file instead of a series of dialog boxes by using the following syntax at the command prompt:
Cscript Gthrlog filename > outputfilename.txt
where filename is the name of the .gthr file and outputfilename.txt is the name of the file you want the results written to.Example of a Gather File
The following is an example of a gather file:
cb0ba7f0 1c0020a 4000000f 0 0
cb6a6590 1c0020a 40000017 0 0
d8cf6650 1c0020a 4000001f 0 40d83
e0797310 1c0020a
File:\\.\BackOfficeStorage\Nwtraders.msft\Public Folders\HR2 Folder 8000001c 0 80040e37
Object not found. No replica of the object can be found on this server.
Replication not enabled for this folder or object has been replicated.
e36c51b0 1c0020a 40000020 0 40d83
Using Gthrlog.vbs to translate gives the following output:
8/9/2000 7:04:56 AM Add The gatherer has started
8/9/2000 7:04:56 AM Add The initialization has completed
8/9/2000 7:05:18 AM Add Started Full crawl
8/9/2000 7:05:32 AM File:\\.\BackOfficeStorage\nwtraders.msft\Public Folders\HR2 Folder Add Error fetching URL, (80040e37 – Table does not exist. ) Object not found. No replica of the object can be found on this server. Replication not enabled for this folder or object has been replicated.
8/9/2000 7:05:36 AM Add Completed Full crawl
Application Log
If indexing cannot be performed on an item or is halted, an MSSearch error is logged in the Windows 200x Application Log. You will also see errors in the Application Log if the Microsoft Search service is down when the index is scheduled to update.
For example, if the MSSearch service is down at 12:30 and an incremental rebuild is attempted, an Exchange 200x error is added to the Event Log with the text “Failed to perform scheduled incremental Full-Text Index build at 12:30.”
System Monitor
You can use System Monitor to determine the impact of indexing on your server, and to identify performance bottlenecks. The following objects are available:
Microsoft Gatherer
Microsoft Gatherer Projects
Microsoft Search
Microsoft Search Catalogs
Microsoft Search Indexer Catalogs
Related Links
Microsoft Whitepaper "Best Practices for Deploying Full-Text Indexing"
http://www.microsoft.com/downloads/details.aspx [3]
Graphic Sources: Microsoft & My own Lab
Links:
[1] http://www.messagingtalk.org/content/227.html
[2] http://www.messagingtalk.org/content/252.html
[3] http://www.microsoft.com/downloads/details.aspx?FamilyID=d7d73256-459c-4b5e-827f-256fa21dd38a&displaylang=en