Understanding Information Store Essentials (Part 3)
| Published date | Mon, 2005-12-19 00:03 |
| Category | |
| Author | Shaji Firoz |
| Printable Version | Email this Article | |
|
|
|
| Post to del.icio.us | Furl it | Spurl it | |
|
|
|
Introduction
This section is the last part of my Understanding Information Store Essentials series and this part is about Full Text Indexing feature of Exchange 2000/2003. In this article I will cover the concepts of Indexing and how Exchange implements it using MSSearch, and also this article contains best practices for deploying full-text indexing with Exchange 2000/2003 Server.
If you have missed the other parts in this article series please read:
- Understanding Information Store Essentials (Part1)
- Understanding Information Store Essentials (Part2)
Overview
Full Text Indexing (FTI) enables you to create indexes for your information stores in order to substantially reduce the time it takes to search these databases. FTI is fully integrated within the Exchange Server 2000 & 2003 (later we cover how to enable the FTI using the Exchange System Manager interface). Indexing any Store is possible on an Exchange server, Mailbox or Public. In practice most people will be indexing public folders. Although you can only index at the store level, you cannot create an index just for a specific folder.
Which clients are supported by FTI?
Outlook, IMAP4 and OWA clients are usually supported by FTI. In Outlook versions older than XP (2002), you had to use the Advanced Find in order to use the index; a standard Find did not use FTI. Also, development interfaces including CDO, OLEDB/ADO and WebDAV can also be used to search the index programmatically.
Best Practise Considerations of Full Text Indexing (FTI)
FTI on Exchange uses the MS-Search engine – When Exchange is installed, MS-Search 2.0 is installed on the machine. MS-Search is a generic search engine which is also used by other applications such as SQL and SharePoint Portal Server.
Creating the index is a very intensive process on the server – However, this processing can be scheduled to run over night and what the server spends creating the index it saves much more when clients actually submit queries. In effect the server is performing all the processing for searches in advance.
FTI means in much faster searches, but <100% accuracy –The search results are only as accurate as the index is up to date. However, in nearly all observed scenarios a one day old index is acceptable to users.
FTI indexes attachments – So when a user performs a search they are searching attachments as well as the message body. The list of attachments supported and how to extending this list is described later.
FTI preserves permissions – For example if a user’s search hits documents in a folder that they do not have permissions to see, then those hits are not returned to them. FTI achieves this by reading the folder ACLs and storing them with the index.
FTI searches are word and not character based – This means that you have to search on whole words only and wildcards are not supported.
FTI supports stemming – This makes it easier for users to find information. Stemming means that a search for a particular word will also search all variants of the word. e.g. 'start' returns 'starts', 'started', 'starting'.
FTI supports Noise word removal – MS-Search will strip out noise words from content before it indexes it (such as and, if, is, the, etc.). The noise words for each language are stored in text files which can be edited. These files are located here with name "noise.xxx" where xxx is the language specifier.
C:\Program Files\Common Files\Microsoft Shared\MSSearch\Data\Config
Remember to re-index your stores if you edit these noise files. When a client queries the database, their query is also passed through the noise word removal process. So for example, searching for “the sun” is the same as searching for “sun”.
FTI supports a Thesaurus – Thesaurus files are found in the same location as the noise word files with the file names tsxxx.xml, where xxx is the language specifier. These thesaurus files allow you configure expansion and replacement sets.
-
Expansion Sets - For example; you can specify that a search for any of the words ‘NT5’, ‘W2K’ or ‘Windows 2000’ actually searches for all of these words. So if a user searches on ‘W2K’, they will be returned any documents which contain ‘NT5’, ‘W2K’ or ‘Windows 2000’.
-
Replacement Sets – For example; you can specify that a search for the words ‘IE’, ‘IE4’ or ‘IE5’ actually search for the words ‘Internet Explorer’. So a user searching on ‘IE5’ will be given all documents which contain the words ‘Internet Explorer’.
Knowing MS Search Engine Architecture
The MSSearch process is used to create the indexes, the indexes are control by MSSearch, not Exchange. When instructed to index an Exchange store, the MSSearch engine connects to the store and retrieves information using the ExOLEDB interface. MSSearch consists of two components; the Gatherer and the Indexing Engine. The below visual will give a correct picture on how MS Search control indexing.

What does Gatherer do here?
Gatherer is responsible for gathering or ‘crawling’ information and passing it to the Indexing Engine. It is utilizing the OLEDB protocol handler in Exchange Server to connect to the data.
When the gatherer locates a file (such as an attachment) it then processes it before passing it to the Indexing engine. Gatherer process is elaborated as below:
- Gatherer locates a file – The gatherer will crawl through the Exchange store and process every message that it finds. Also, the ACL for the document is retrieved so that permissions can be preserved during searches.
- An IFilter is selected – The IFilter selected is based on the file extension. Out of the box, Exchange includes IFilters for Word, PowerPoint, Excel, MIME, HTML and TIFF file formats (and of course text files). TIFF is the file format used by faxes. There is an IFilter for each format which understands the structure of the file and can strip it down to leave just the text. Only text is indexed. It is possible to add additional IFilters to MSSearch. These are available from third party vendors. For example, Adobe has produced an IFilter for their PDF file format, Microsoft also has IFilters for Visio and XML file formats. Once processed, the IFilter will pass chunks of text to the gatherer process.
- Gatherer determines language of document – processes such as stemming, noise word removal, word breaking and thesaurus are language specific. Therefore when indexing documents, MSSearch has to determine the language of the document so that it can apply the correct rules to it. (read below for more details in the Ms Search Language detention)
- The document is word broken – A word breaker (DLL) specific to the language of the document is used to break up the text into a list of words.
- Noise Words are removed – The noise word file specific to the language of the document is used to strip out unwanted words.
- Words are passed to the Indexing Engine – The Indexing Engine will then create the index files from these lists of words.
MS Search Language Detection process explained as below:
- Check document’s properties. Some documents (for example HTML files) specify their language in their headers. If found then use that language.
- If there is no language specifier in the document, then MSSearch will detect the language from the actual content. If it finds more than one language in a document, then the language which is the most prevalent is used.
- If MSSearch cannot determine the language or the language found is not supported, then use the neutral word breaker. Stemming, noise word and thesaurus are not applied in this case.
Query-Handling

When a client submits a search query to the Store, it is handled by the Query Processor. The store must decide whether MSSearch (i.e. FTI) can handle the search or whether it must perform it itself. Some searches cannot be performed using the indexes. For example if a user searches for all message less than 10KB in size, this type of property search must be handled by the store itself. MSSearch can only handle text queries.Some searches include both text and property queries.
For example consider the following search: “All documents containing the word ‘Exchange’ AND less than 10KB in size” In this case the query processor passes the ‘Exchange’ part of the search to MS-Search for an Indexed search, which will be fast. When MS-Search returns the list of documents containing the word ‘Exchange’, the query processor will then search only those returned documents for ones less than 10KB in size, instead of searching the whole store. The results are merged (ANDed together) and returned to the user. As you can, MSSearch has actually increased the speed of the property search in this case.
How to Configure Full Text Indexing?
This is going to be easy job!
Creating the Index
To create an index, perform the following steps:
1. Right-click the store to be indexed from the Exchange System Manager
2. Click Create Full-Text Index.
3. Type the location of the index catalogue or accept the default
4. Click OK
This process creates the catalogue, or index files.


Populating the Index
The next step is to perform a full population of the index files.
1. Right-click the store to be indexed from Exchange System Manager
2. Click Start Full Population
This will start the crawling process which could take sometime and impact server performance. If you expand the store object being indexed and click on Full-Text Indexing you will see a summary of the indexing process.
You will need to hit the Refresh key (F5) to update this summary. At any time you can manually start a full or incremental population of the indexes for each store. Incremental population only indexes changes made to the store since the last index population.
Other options you will find by right clicking the store object are:
Pause Population – Stops the current population process without deleting objects found to have changed during the current process. The process can be restarted at the same point it was paused.
Stop Population – Stops the current population process. All index updates obtained during the current process are lost.
Delete Full-Text Index – Deletes the index catalogue associated with the store.
If you open the properties of the Store object from ESM you will see a Full Text Indexing tab which contains the following settings:
Update interval – This is a scheduling option that will update the index with any changes that have been made to the store since the last update.

This index is currently available for searching by clients – It is recommended that you clear this option the first time an index is built. This will allow the query processor to process the search request until the index is built. After the index is built, MSSearch will process the search request. Disabling "This index is currently available for searching by clients" while building the index will help to avoid incomplete search results.
Troubleshooting Full Text Indexing
You can use gather files, the application log, and the system monitor to troubleshoot full-text indexing issues.When a full-text index is built, gather files are created. These files record errors and other information during indexing. Gather files can be found in the c:\Program Files\Exchsrvr\ExchangeServer\GatherLogs folder. All files with a .gthr filename are text files that you can view to identify every document and message that was not successfully indexed.
Analyzing a Gather File
Ideally, for each crawl, you will only see four lines in a gather file and the file size will be under 300 bytes. Each line in the file after the fourth line identifies the Uniform Resource Locator (URL) to the message or document that failed to index. Each line also includes the subject or file name along with the error code. The last number that you see in the line is the error code.
To decode this error number, use the utility Gthrlog.vbs found in the C:\Program Files\Common Files\System\MSSearch\Bin folder. The syntax for this utility, where filename is the name of the .gthr file, is Gthrlog filename.
You will be prompted with a series of dialog boxes that contain the data found in each line of the gather file. The error that was logged, and the error definition, appear in parentheses at the end of the text in the dialog box.
Note:
You can generate a file instead of a series of dialog boxes by using the following syntax at the command prompt:
Cscript Gthrlog filename > outputfilename.txt
where filename is the name of the .gthr file and outputfilename.txt is the name of the file you want the results written to.Example of a Gather File
The following is an example of a gather file:
cb0ba7f0 1c0020a 4000000f 0 0
cb6a6590 1c0020a 40000017 0 0
d8cf6650 1c0020a 4000001f 0 40d83
e0797310 1c0020a
File:\\.\BackOfficeStorage\Nwtraders.msft\Public Folders\HR2 Folder 8000001c 0 80040e37
Object not found. No replica of the object can be found on this server.
Replication not enabled for this folder or object has been replicated.
e36c51b0 1c0020a 40000020 0 40d83
Using Gthrlog.vbs to translate gives the following output:
8/9/2000 7:04:56 AM Add The gatherer has started
8/9/2000 7:04:56 AM Add The initialization has completed
8/9/2000 7:05:18 AM Add Started Full crawl
8/9/2000 7:05:32 AM File:\\.\BackOfficeStorage\nwtraders.msft\Public Folders\HR2 Folder Add Error fetching URL, (80040e37 – Table does not exist. ) Object not found. No replica of the object can be found on this server. Replication not enabled for this folder or object has been replicated.
8/9/2000 7:05:36 AM Add Completed Full crawl
Application Log
If indexing cannot be performed on an item or is halted, an MSSearch error is logged in the Windows 200x Application Log. You will also see errors in the Application Log if the Microsoft Search service is down when the index is scheduled to update.
For example, if the MSSearch service is down at 12:30 and an incremental rebuild is attempted, an Exchange 200x error is added to the Event Log with the text “Failed to perform scheduled incremental Full-Text Index build at 12:30.”
System Monitor
You can use System Monitor to determine the impact of indexing on your server, and to identify performance bottlenecks. The following objects are available:
Microsoft Gatherer
Microsoft Gatherer Projects
Microsoft Search
Microsoft Search Catalogs
Microsoft Search Indexer Catalogs
Related Links
Microsoft Whitepaper "Best Practices for Deploying Full-Text Indexing"
http://www.microsoft.com/downloads/details.aspx
Graphic Sources: Microsoft & My own Lab
Discuss this in

About Shaji Firoz

Shaji Firoz is a Senior Consultant working for IT Outsourcing company in Singapore. Shaji works mostly on Microsoft Technologies - with keen on messaging environment. He leads the development & implementation efforts in Exchange & Active Directory environments. He authors MS Exchange based technical articles in this site and spends time on blogging about Hosted Exchange in msexchange.org blog section. In recognition of his knowledge of Windows & Exchange Servers and his willingness to share the information and help to community, Microsoft recognized him with MVP in Exchange Server.
Shaji is primary contributor to DigWin.com & MessagingBlogs.com. He can be reached on shajifiroz@messagingtalk.org
Recent Articles by the author
Featured Links
-
Free Download Trial: SharePoint Migration, Backup and Recovery Software
DocAve: Enterprise, full-fidelity backup & recovery software for SharePoint provides essential protection & management tools, and allows for a data migration from Exchange Public Folders in to SharePoint 2007 & 2003. -
Microsoft Exchange Hosting
24/7 US based support. 99.9% uptime guarantee. Your Mission Critical E-mail is Our Critical Mission. Sign up for our 30 day trial to see the difference. Questions? Call us toll free at (800) 967-3924. -
QuickEmbeddedTips: Tips for Embedded Systems Professionals
Quick Tips for Embedded System Engineers. Visit the site for the latest tips, tutorials on Arm, Linux and VxWorks.


