As you know, I have written the Part I of this article three part series. You may want to read it here just in case you missed Understanding Information Store Essentials (Part1) [1]
In future, if you would like to get notified when Shaji Firoz releases Understanding Information Store Essentials (Part3) please sign up to our Newsletter [2] and choose Exchange Server articles update
From user standpoint, the store is organised into a hierarchy of folders containing items for example mail messages, contacts and calendar appointments. Internally, the store is structured in a similar way and provide following features.
If you look at the below Figure 1-1, the store was structured into various tables which are linked to each other. Storage of the tables is handled by ESE which I will cover later in this section. The Store maintains the relationship between the tables.When accessing data, clients use these tables to locate and view messages. For example, Outlook will build the folder list from the Folders table.
Figure 1-1
It is important to note that tables are stored only in the EDB file. The STM file only contains raw data. If a message is to be stored in the STM file, then the tables in EDB will point to the location in the STM file where that message is located
The main tables are described as below:
| Table | Significant Columns | Comment |
| Mailbox Table | - Root folder ID | There is one row for each mailbox on this store. This entry will point to an entry in the Folders Table representing the root folder of the mailbox. There is only basic information in this table, there is no data. |
| Folders Table | - Folder ID | This is one big table containing an entry for every single folder in the entire store. Folders have pointers to their children and parent to provide the hierarchical structure. The Folders Table contains only the properties of the folders; message data is not stored here. |
| Message Folder Tables | - Message ID | There are many Message Folder Tables (MFTs). There is a MFT for each view in each folder in the Folders Table. Each row in the Folder Table may have one or more MFTs associated with it. For example, a particular User’s Inbox may have 6 views, therefore there will be 6 MFTs, one for each view. Each MFT will contain the message IDs of the messages in that view, plus the properties of those messages as defined by the view. The actual message body is not stored in the MFT. Note: Each time you create a new View in Outlook, the Store will create a new MFT to represent that view. You may notice that the first time you create the view there is a delay, but subsequent access to that view is much quicker, because the MFT already exists. Any MFT which is unused after 7 days is deleted. |
| Messages Table | - Message ID | This is one big table containing every single message in the entire store. The table contains the actually message body in RTF Compressed format if the email is in the EDB file, or a pointer to the location of the message in the STM file. Each message in this table will have one or more rows in the MFTs pointing to it. When several MFT entries point to the same message in the Messages Table, that represents Single Instance Message, which saves disk space. |
| Attachments Table | - Attachment ID | Similar to the Messages Table but contains attachments to messages. Messages from the message table point to entries in the attachments table if they contain attachments. |
In this section I would like to give an overview of ESE as it is used by Exchange 2000 server. It includes details of how ESE works and how it provides recovery features. Also outlined is the optimum disk configuration for running ESE databases.
The Extensible Storage Engine, or ESE, is the heart of Exchange storage. It is responsible for physically storing Exchange data on the hard disk. ESE is a very robust storage technology and has been designed to be completely recoverable in a disaster scenario. ESE uses transaction logging technology to provide reliability in any disaster recovery.
As we move on to the this section, I would also like to cover little bit on Transactions, Transaction Logs, ESE Operation, new log file creation, Checkpoint file, File Signatures, Soft Recovery & Optimum Disk configuration.
Following Figure 1-2 explains the ESE File Structure in Exchange Store. ESE holds the data in
two files; EDB and STM (the purpose of the STM file is described later, but for ESE, both files have the same structure).
Everything within the file is stored in B+trees (a variation on balanced trees) which provide for fast searching and efficient storage. Each store table (as described in the previous section) is a collection of B+trees. There is a master tree which references all other tables and their B+trees, this ‘tree of trees’ is called the System Catalogue. Because this table is critical it is stored twice within the file (one starting at page 4 and the other at page 24).
B+trees are designed to provide fast access to data on the disk. Going to the disk is expensive for the store, therefore a B+tree is designed so that ESE can get to the data it wants using the minimum number of disk I/Os.
A B+tree is broken down into 4KB pages. Each of these pages contains either pointers to other pages or the actual data that is being stored in the B+tree. All data is read from and written to the disk in units of 4KB. To increase performance the pages are cached in memory buffers for as long as possible, thus reducing the need to go to the disk. Pages are numbered within the ESE file from 0 up to the size of the file.
The structure of the ESE B+trees is not important to the store process, and in fact the store does not see this structure. Instead it only sees its own set of tables as described in the previous section. When the store saves a message to ESE, that message will be written to one or more pages within a B+tree in the ESE files.
The first four bytes of every page contains the checksum of the page. When a page is ready to be written to the disk, the last thing ESE does is calculate a checksum based on the data in the page and writes it to the checksum. Also, the page header contains the page number of the page itself.
Each time a page is read of the disk the first thing ESE does is recalculate the checksum of the data and compare it to the checksum on the first four bytes of the page. It should be identical. Also, ESE checks the page number of the page to make sure this is the page we asked for. If either of these tests fail ESE reports a -1018 error (or Jet Read Verify errors). Basically ESE is saying that the data it got off the disk was not the same data written to the disk.
Note:
You can dump any page from an offline ESE file using ESEUTIL as follows:
Example: ESEUTIL /m I:\Exchsrvr\mdbdata\SG1MS1.edb /p100

The above command will read and dump page 100 from SG1MS1.edb. For example, if you dump page 4 and page 24 you will see that they are identical because that is where the System Catalogue is mirrored for resilience as described earlier.
Sometimes -1018 errors are returned as a result of heavy workload which exposes a bug in the hardware or software driver, and that the data physically on the disk is actually OK. This is considered a transient fault.
A hard fault means that the data is actually wrong (or corrupt) on the hard disk surface. It means something went wrong when the page was physically written or the disk became damaged afterwards.
Through experience with customers over the years, Microsoft concluded that a lot of -1018 errors are actually transient. So in Exchange 5.5 SP2 retry logic was added to ESE; if a read fails with -1018 then it retries up to 16 times with a short pause between each read. Each attempt is logged in the Application log. If after 16 retries ESE still cannot successfully read the page then the store will be dismounted. If you find transient faults in the event logs you should investigate your disk immediately as there could be an imminent failure.
Changes to the database are applied through transactions. A transaction is a series of operations which when complete leave the database in a healthy (consistent) state. An operation is the smallest unit of change to a database, and a transaction must always end with a commit operation to indicate that it has finished. By bundling operations into a transaction we can ensure that operations which leave the database in an inconsistent state can be rolled back if there is a crash halfway through a transaction.
Transactions carries ACID properties as defined below:
Atomic – Transactions are the smallest unit of change in a database, either all operations of a transaction are completed, or else they are rolled back to the last completed transaction.
Consistent – A committed (i.e. completed) transaction always leaves the database in a consistent state.
Isolated – The operations of a transaction are not visible to other processes until the whole transaction is complete. In other words, during a transaction, other processes will see the database as it was before the transaction started. Only when the transaction is complete will they see changes in one go.
Durable – Once a transaction has completed then it is permanent. In other words you will not lose that transaction. Transaction log files help to achieve this, ensuring that a database can always be recovered to its last transaction when a failure occurs.
For example consider moving a message from one folder to another. This will involve many modifications; delete message from source folder, add message to target folder, update folder properties on each folder (e.g. item count, read/unread status etc). All these operations must be completed or not at all otherwise we end up with inconsistencies, such as the message disappearing if we crashed half way through.ESE will ensure that none of the operations are permanently applied until the transaction is committed, and when it is committed all of the operations will be permanently applied.

For performance reasons, ESE performs all transactions in memory. However, we must consider what will happen if ESE or the machine crashes; all data in memory will be lost. Remember that once a transaction has committed it must be permanent otherwise we could lose data or introduce corruption.
Writing the transaction to the EDB/STM file is an expensive process. These files are large random access file, and ESE would spend the majority of its time waiting for the disk heads to visit all the pages affected by the transaction (remember that data maybe fragmented across multiple pages in the database file).
Instead, ESE uses very fast (and small), sequential transaction log files. Every transaction that occurs in memory must be immediately written to the end of the log file before the transaction is considered committed. That way we can guarantee that all committed transactions are indeed permanent, even if we have a crash and lose everything in memory.
Transaction log files are always 5MB in size for Exchange databases. Transaction log files are shared by all databases within each storage group. The current transaction log file that ESE is using is called Exx.log where xx is the storage group identifier (E00.log for the first storage group, E01.log for the second, E02.log for the third and E03.log for the fourth).
When a transaction log file becomes full, it is renamed using a 5-digit sequential Hex number. A new Exx.log is then created. Previous log files are critical to recovery procedures and should never be deleted manually because they can be used to reconstruct information which may be missing from a backup. See the Disaster Recovery section for more details.
Transactions have to be applied to the EDB/STM files at some point in time. A background process will handle this task. It is possible that modifications to the database remain in memory for many seconds before being written to the database file. What this means is, that while the store is mounted, the database file will not contain all the complete information. We say that the database is inconsistent. Consistency in this case has nothing to do with the health of the database file, the database is perfectly fine. We are simply indicating that not all of the data is in the file. In fact the EDB/STM header contains a consistency bit, which is always set to False when the database file is online.
If a database is dismounted (or shut down) cleanly, the ESE will flush all transactions in memory to the file and mark the file as consistent. This indicated that the all the data in the database is contained in the EDB/STM files.
If the database crashes, then the file would be inconsistent and ESE would discover this when it tried to mount the database. In this case ESE would initiate a Soft Recovery which is covered in a later section.
To view the consistency bit on a database file run the following command to dump the database header (the file must be offline forESEUTIL to work):
ESEUTIL /mh priv1.edb
Look for the property called State, it will indicate if the database file is consistent. An example screen shot is given below:
Figure 1-4
Log files should never be deleted manually if you want to be able to recover data. However, as log files are generated they are taking up disk space. An Exchange online backup will remove log files older than the checkpoint once the database has been backed up to tape.
If you need to recover disk space fast, you can dismount all the databases in a storage group. This ensures that all transactions have been flush to the database files. You can then delete all log files. When the stores are remounted, ESE creates a new Exx.log file and starts a new log series. You should immediately perform a full backup of Exchange, since previous backup are now invalidated because the log sequence has been reset.
Figure 1-5
The slide shows a typical cycle that occurs when a transaction needs to be executed. In the example the transaction involves modifying page number 7 in the EDB file. The process is as follows:
ESE is basically repeating this cycle continuously for transactions during normal operations. It is important to note a couple of important points:
Once the transaction log is full (5MB) then Exchange needs to create a new one:
Create new Log:
|
Figure 1-6
The checkpoint file is a small (8KB) which contains information about which transactions in the log file have already been flushed to the disk. The checkpoint file points to the next un-flushed transaction in the log series. In other words every transaction before (or older) than the checkpoint we know has already been written to the database file.
Transactions after the checkpoint may or may not have been flushed. Remember that transactions are not flushed in the same order that they occur in the logs. ESE uses an arbitrary algorithm to flush transactions in order to free up memory.
The checkpoint is only ever used during Soft Recovery (see the soft recovery slide later) and is in fact not essential. If the checkpoint is not available during soft recovery, ESE can still recover the database but the operation may take much longer. Hard Recovery (see the Disaster Recovery module) does not use the checkpoint at all.
The checkpoint file is called Exx.chk, where xx is the storage group designator.
Note:
To see the state of the checkpoint file you can use the following:
ESEUTIL /mk E00.chk
You can run this command even if the databases are online. You will see a label called Checkpoint: It will show three values separated by commas; the first value indicates the transaction log file that the checkpoint is at. The other two values are the offset into that transaction log file.
|
It is vital that ESE associates database files with their own transaction logs. Introducing transaction logs from a different set of database files (e.g. during recovery) will generate corruption in the database.
For this reason ESE uses File Signatures to verify that the correct log files are being used. Each set of log files have a unique signature. This signature is recorded in the header of every log file in a particular series. If a new series is generated (e.g. if you manually deleted log files) then a new file signature is also generated.
The database files contain a reference to the log file signatures. They also have their own file signatures which are in turn referenced by the log files. ESE cross-matches both sets of signatures when mounting a store to ensure that the log files do indeed belong to the current database files. If there is a signature mismatch then the store will not be mounted and event errors are logged.
The signatures themselves consist of a timestamp and a random number to ensure that they are unique, as shown in the slide.
There are two types of recovery that ESE can perform, Soft recovery and Hard recovery. Here we explain only Soft recovery. Soft recovery is an automatic process where Exchange can recover data after an unexpected shutdown such as a computer crash or forced power down.
Remember that when a database is online, its state is set to not consistent. If there is a crash or an unexpected stop then the database will not have shutdown cleanly. This will cause the following:
As soon as the Exchange is restarted and database is remounted dirty transactions in memory just before the crash will start recovering automatically . The following process happens when the store is brought back online:
*dbTime is a number which starts at zero and is incremented every time there is a change in the database.Each ESE database has a dbTime which is recorded in the EDB file header. Every time a page is modified, ESE increments dbTime and stamps the new value on the page. It allows ESE to work out whether a transaction in a log file is newer or older than the page it is trying to modify on the disk.
You can view the current dbTime of a database with the following command:
ESEUTIL /mh priv1.edb
Look for the dbTime: value. You can also view the dbTime on an individual page in the database. This value indicates when the page was last modified (or dirtied):
ESEUTIL /m priv1.edb /p150
The above example will dump page number 150 from priv1.edb. Look for the value called dbTimeDirtied
Figure 1-7
By default, on Exchange 2000, circular logging is disabled. This simply means that when ESE needs a new transaction log file it creates a brand new one by grabbing 5MB of space from the disk. In other words, it keeps all previous log files. This of course will take up disk space. The only safe way of removing previous log files is to perform regular online back ups.
With circular logging enabled, when ESE needs a new transaction log file it will simply rename and overwrite an existing previous log whose transactions have already been flushed to the database file. In other words it will overwrite log files which are older than the checkpoint. If there are no log files older than the checkpoint (maybe because of high load the server has not had time to flush information to the database file) only then will it create a new log file.
Typically with circular logging you will see a handful of log files (four or five) which ESE is constantly circulating through. It is important to note that although soft recovery is still available with circular logging (because ESE only overwrites logs older than the checkpoint), hard recovery is not.
This means that if you lose the database completely you can only recover to when the database was backed up. Without circular logging, ESE can roll forward changes made after the backup and bring the database to its most current state.
In short; never if your data is important enough to be backed up. The cost of an extra disk to hold transaction logs is most probably much less than the cost of losing days of data after a restore.
However, you should use circular logging on databases which are never backed up. If you do not, then your disks will quickly fill up and cause the store to stop because there is no other process which removes the log files. Examples of databases which do not need to be backed up include dedicated connector servers and public stores holding NNTP newsfeeds.
|
When designing a disk configuration for an Exchange server, there are two main criteria to keep in mind:
Understanding how ESE operates under normal conditions and how it can recover lost data is essential in order to design the optimum disk configuration. The points are as follows:
Fact 1
ESE can fully recover a failure in the database files, or a failure in the transaction log files, but not from both failures occurring at the same time. When we say ESE cannot fully recover from a simultaneous failure we mean that in such a scenario you can only recover up to the last back up time. You will lose data that was introduced after the backup. A full recovery will recover all data right up to the point of failure:
Fact 2
ESE is constantly writing to the log files in a sequential manner. ESE never reads from the log files under normal operations:
Important:
If using write-back caching, ensure that you use battery backed up controllers to minimise the risk of data loss during a power failure.
Fact 3
The database files (EDB and STM) are accessed randomly and will be the largest files on an Exchange server. Also, the majority of disk I/O accesses on database files are read operations; approximately twice (or more) the number of reads as there are writes:
Fact 4
The OS (system and boot partitions) and the Exchange binary files (\Program Files\Exchsrvr) are fairly static files. The paging file (Pagefile.sys) is accessed fairly regularly by Windows. Finally, the SMTP Queue folder is accessed heavily by Exchange, especially on a connector server sending and receiving many messages via SMTP. This folder is by default located with the Exchange binaries (\Program Files\Exchsrvr\Mailroot\vs1\Queue).
STM File
| In previous versions of Exchange - All mail was stored in MAPI format (compressed rich text) - IMAIL process used to converet between MAPI and internet formats (MIME and UUencode) - In an internet based environment (e.g ISP) this results in a big performance hit In Exchange 2000 - Messages submitted by a MAPI client are stored in MAPI format| - Messages submitted by other clients (e.g. SMTP/HTTP/WebDAV/IFS) are stored in their native format i.e. not converted - Reduces IMAIL conversion in both internet and MAPI environments - in mixed environments, still needs of conversions |
Since Exchange 2000/2003 are targeted at ISPs and Hosted Exchange providers, many such organisations are now in a pure Internet environment with very few or no MAPI clients. Therefore Exchange needs to be able to store data in its native (in this case Internet) format. However, at the same time we still need to cater for corporate organisations which are predominantly MAPI based. This is resolved in Exchange by implementing two database files for each store:
Exchange will not store data in either EDB and the STM file by deciding where to place an incoming message as it arrives. To do this Exchange must predict who is likely to access this message.If it is likely that it will be accessed by an Internet user then it should be placed in the STM file. If it is likely to be accessed by a MAPI client then it should go to the EDB file. Following are the algorithm Exchange use to do this which appears to be not very sophisticated:
Of course this simple logic will not always get it right. An incoming Internet message may be read by a MAPI client or vice-versa. In this case IMAIL has to be used to convert the message before being given to the client, as before.
Where the algorithm really succeeds is in environments which are predominantly MAPI or predominantly Internet based. A mixed environment will see relatively little gain, i.e. IMAIL will still be used extensively. Most Exchange organisations are predominantly one type or the other, so in most cases Exchange succeeds in reducing IMAIL activity.
Figure 1-13
As you have seen the Store decides where to store messages based on the origin of message. EDB for MAPI clients and STM for Internet clients.
There is an additional consideration. The folder view tables are all stored in the EDB file. All clients, Internet (e.g. POP3, IMAP4 and NNTP) and MAPI, use these tables to see what messages are available to download.
A client cannot download a message unless it can see it in a view table. For example, the POP3 command ‘LIST’ will result in a list of messages in the Inbox showing a message number and the message size. This is derived from a view table in the EDB file.
Therefore when a message arrives from the Internet, although it will reside it its entirety in the STM files, some of its header properties are copied to the EDB file in order to populate the view tables for that folder. This process is referred to as Property Promotion. The entry in the view table will contain a pointer to the location of the actual message (which is in the STM file).
So in the above example, after issuing a LIST command, the POP3 client may issue a RETR command to read the message off the server. The store will locate the message in the STM file using the pointer in the view table (which is in the EDB file), and stream the message directly to the client without the need to convert it using IMAIL.
MAPI messages are placed directly into the EDB file where again their properties are copied to the view tables. However, this all happens within the EDB file. Promotion only ever occurs in one direction, from the STM to the EDB file.
Figure 1-14
There is one caveat to the process discussed on the previous slide. Exchange 2000 uses SMTP to communicate between servers. Therefore what happens when a message is sent by a MAPI client to another user on a different server? The message should end up in the EDB file, but since it is arriving via SMTP then the store logic will place it in the STM file, therefore breaking our efficiency.
To handle this situation, the Store makes an exception for any messages arriving from non-MAPI clients.Messages which originate from MAPI clients and sent across SMTP always have a header indicating that the content is MAPI. The header is implemented as content-type: application/ms-tnef. So such a message would go through the following procedure when it arrives on an Exchange store:
We have drilled down to the Information Store internal Structure. We found the a relational database technology at its simplest form and it stores information in tables and uses matching values in the tables to relate information between the tables. If you understand Exchange Server's database technology, you can head off problems and optimize performance. We shared a lot of basics in this Part 2 of three part series by looking at the Exchange Information Store's structure, discussing how transaction logging works, and understanding the inside of .EDB & .STM files.
On Part 3 I'll show you how full-text indexing is helping the content in an exchange database is indexed with the result of faster content searching. Also, you will be interested to see how the Exchange full-text indexing providing your Outlook users with a fast searching capability of their e-mails and public folders.
Exchange Information Store Service Architecture
[3]
[4]Responsibilities of the Information Store
[4]How to Start the Microsoft Exchange Information Store Service (MSExchangeIS)
Microsoft Whitepaper "Best Practices for Deploying Full-Text Indexing"
[5]Microsoft Exchange Server Information Store Viewer (MDBVU32) [6]
[5]
Links:
[1] http://www.messagingtalk.org/content/227.html
[2] http://www.messagingtalk.org/user/5/edit/newsletter
[3] http://www.microsoft.com/technet/prodtechnol/exchange/guides/E2k3TechRef/b5b94b4d-02d3-49e4-959f-b8bcf53d340b.mspx
[4] http://www.microsoft.com/technet/prodtechnol/exchange/guides/E2k3TechRef/8bc90fa8-4f2d-4ccc-81a7-3434ee1656c2.mspx
[5] http://www.microsoft.com/downloads/details.aspx?FamilyID=d7d73256-459c-4b5e-827f-256fa21dd38a&displaylang=en
[6] http://www.microsoft.com/downloads/details.aspx?familyid=3D1C7482-4C6E-4EC5-983E-127100D71376&displaylang=en#overview