Analyzing Exchange 2000 Server Performance Problems

This document covers what to capture with Performance Monitor to troubleshoot Exchange performance problems.  Section 1 lists the Performance Objects & Counters used when analyzing a performance problem with an Exchange Server.  Section 2 explains what data is gathered by the counters and some values to look for.

NOTE: If possible, capturing system performance using these counters before the performance degradation can establish a baseline when comparing normal to poor performance.

[Section 1]

Breakdown of performance objects and corresponding counters needed when troubleshooting Exchange performance issues.

Database (Information Store)
-    Database Cache Size
-    Log Record Stalls/sec
-    Log Threads Waiting
-    Log Writes/sec
-    Table Opens/sec   
PhysicalDisk
-    % Disk Time
-    Avg. Disk sec/Read
-    Avg. Disk sec/Transfer
-    Avg. Disk sec/Write
-    Current Disk Queue Length
-    Disk Transfers/sec

 LogicalDisk (select all instances)
(To enable additional disk counters,
run "diskperf –y" at a command
prompt and reboot the server):
-    % Disk time (only after diskperf –y)
-    % Free Space
-    Avg. Disk Queue Length
-    Avg. Disk sec/Read
-    Avg. Disk sec/Write
-    Avg. Disk sec/Transfer
-    Current Disk Queue Length
-    Free Megabytes

 Process (select all instances)
-    % Processor Time
-    % User Time
-    Elapsed Time
-    Handles
-    Page Faults/sec
-    Page File Bytes
-    Pool Nonpaged Bytes
-    Private Bytes
-    Virtual Bytes
-    Working Set
 Memory
-    Available Bytes
-    Committed Bytes
-    Page Faults/sec
-    Pages/sec
-    Pool Nonpaged Bytes
-    Pool Paged Bytes
 
Processor (select all instances)
-    % Privileged Time
-    % Processor Time
-    % User Time

Redirector
-    Bytes Total/sec
-    Network Errors/sec

 MSExchangeIS
-    Active Connection Count
-    Active User Count
-    Connection Count
-    RPC Average Latency
-    RPC Operations/sec
-    User Count
-    Virus Scan Queue Length
-    VM Largest Block Size
-    VM Total 16MB Free Blocks
-    VM Total Free Blocks
-    VM Total Large Free Block Bytes   
 
Server
-    Bytes Total/sec
-    Pool Nonpaged Bytes
-    Pool Nonpaged Failures
-    Work Item Shortages

Server Work Queues
-    Active Threads
-    Bytes Sent/sec
-    Queue  Length
-    Read Bytes/sec
-    Write Bytes/sec
-    Write Operations/sec
 MSExchangeIS Mailbox
-    Active Client Logons
-    Average Delivery Time
-    Average Local Delivery Time
-    Message Opens/sec
-    Received Queue Size
-    Send Queue Size   
 SMTP Server
-    Categorizer queue Length

System
-    Processor Queue Length
-    System Up Time
 MSExchangeIS Public
-    Average Delivery Time
-    Average Local Delivery Time
-    Folders Open/sec
-    Message Opens/sec
-    Received Queue Size
-    Send Queue Size   
 TCP
-    Segments Received/sec
-    Segments Retransmitted/sec

Thread (select all instances)
-    % Processor Time
-    ID Thread
-    Thread Wait State
-    Thread Wait Reason
 
Network Interface (select all instances)
-    Bytes Received/sec
-    Bytes Sent/sec
-    Bytes Total/sec
-    Output Queue Length
 Paging File
-    % Usage

 

[Section 2]

Performance Objects and Counters, including their description and values to look for.

Database (Information Store)

Database Cache Size: Database Cache Size is the amount of system memory used by the database cache manager to hold commonly used information from the database file(s) to prevent file operations.  If the database cache size seems to be too small for optimal performance and there is very little available memory on the system (see Memory/Available Bytes), adding more memory to the system may increase performance.  If there is a lot of available memory on the system and the database cache size is not growing beyond a certain point, the database cache size may be capped at an artificially low limit.  Increasing this limit may increase performance.  Jet DBA will grow to 900mb by default.

Log Record Stalls/sec: Number of log records that cannot be added to the log buffers per second because they are full.  If this counter is non-zero most of the time, the log buffer size may be a bottleneck.

Log Threads Waiting: Number of threads waiting for their data to be written to the log in order to complete an update of the database.  If this number is too high, the log may be a bottleneck.

Log Writes/sec: Number of times the log buffers are written to the log file(s) per second.  If this number approaches the maximum write rate for the media holding the log file(s), the log may be a bottleneck.

Table Opens/sec: Good rate counter for how busy JET is.

LogicalDisk

% Disk Time: Percentage of elapsed time that the selected disk drive is busy servicing read or write requests.  A sustained value above 90 percent indicates that the hard drive is a performance bottleneck.

% Free Space: Recommended Threshold - 15%

Avg. Disk Queue Length: Average queue length during the monitoring period. This value should not average more than two in normal operating conditions.

Avg. Disk sec/Read: Average time in seconds of a read of data from the disk.

Avg. Disk sec/Write: Average time in seconds of a write of data to the disk.

Avg. Disk sec/Transfer: Avg. Disk sec/Transfer is the time in seconds of the average disk transfer.

Current Disk Queue Length: Interpreting this counter depends on the function of the logical disk being monitored. On most Exchange servers there are two key logical disks - one for the Information Store and the other for the transaction logs. The Current Disk Queue Length must be interpreted differently for each.

- The log volume should never have a queue length above 1, because the I/Os are synchronous and single-threaded. Do not assume that you do not have a disk performance problem if the queue length is not above 1. It will NEVER be above one, in normal operations (not including backup operations). If a performance problem is detected on the log volume the only real remedy is to employ a write-back cache.

- The database volume can be subject to a burst of write operations every 30 seconds with a maximum of 64. In the between two bursts, the only I/O activity is read operations. So you will get peaks above the acceptable queue length, which is generally the number of spindles divided by 2, every thirty seconds. If you do have a queue length larger that half of the spindles between the peaks, it means that you are short on read I/Os and that you should add more spindles. To shorten the duration of the peak queue length, you should use caching (write-back), and increase the number of spindles, and possibly shift away from RAID5 to RAID0+1, if the RAID array controller is not very powerful.

Free Megabytes: Displays the unallocated space on the disk drive in megabytes. One megabyte = 1,048,576 bytes. 

Memory

Available Bytes: Shows the amount of physical memory, in bytes, available to processes running on the computers. Microsoft recommends keeping this value above 4000 KB (4MB).

Committed Bytes: Displays the size of virtual memory (in bytes) that has been Committed (as opposed to simply reserved).  Committed memory must have backing (i.e., disk) storage available, or must be assured never to need disk storage (because main memory is large enough to hold it.)  This is an instantaneous count, not an average over the time interval. Acceptable average range is less than the amount of physical RAM on the server. However, before making such an assumption, check Memory > Pages/sec and Memory > Page Faults/sec. If the Memory > Pages/sec is greater than 10 (10 is a reasonable guideline, but varies with disk hardware) and Memory > Page Faults/sec is greater than Memory > Cache Faults/sec then there is too much paging.

Page Faults/sec: This is the actual count of the times that application data was not found in its physical memory working set and had to be paged from disk. The page faults/sec counter should also never show a consistently high single figure amount.

Pages/sec: This counter is not pages to/from disk.  It includes pages on the standby list which may not hit disk.  Use the Pages Reads/sec and Page Writes/sec to tell you how many disk I/Os your paging is doing.  This counter is a primary indicator of the type of page faults that can significantly slow down your system. It is the sum of Memory > Pages Input/sec and Memory > Page Faults/sec. Microsoft recommends keeping this value below 20. 5 pages/sec is a better target to shoot for.  Once this counter starts to average consistently at 10 or above, performance is significantly degraded and disk thrashing is probably occurring.

Pool Nonpaged Bytes: Limited kernel memory counter than usually maxes out around 96mb on server with more than 512mb.  A leak here is usually indicative of a driver leak (HBA/SCSI etc).  System will become unresponsive and may blue-screen when maxed out.  Nonpaged Pool pages cannot be paged out to the paging file, but instead remain in main memory as long as they are allocated.  If this value is steadily increasing, look for a memory leak (this should be pretty level).

Pool Paged Bytes: Limited kernel memory counter than usually maxes out around 196mb on server with more than 1024mb and has /3gb switch (270mb without /3gb switch set).  When this maxes out the server can become unresponsive.  This value growing can be indicative of handle leaks (check process handles counters) or a growing SMTP queue.  .NET server also began caching files more aggressively so this counter can grow due to the system cache.

MSExchangeIS

Active Connection Count: Number of connections that have shown some activity in the last 10 minutes.  Baseline required, depends on number of users.

Active User Count: Number of user connections that have shown some activity in the last 10 minutes.  [Baseline required, depends on number of users]

Connection Count: Connection Count is the number of client processes connected to the information store.  Baseline required, depends on number of users.

RPC Average Latency: Look at the average RPC latencies.  This is usually in the 10-20ms range on healthy servers.

RPC Operations/sec: RPC Operations/sec is the rate that RPC operations occur.  Baseline required, depends on number of users.

User Count: User Count is the number of users connected to the information store.  Baseline required, depends on number of users.

Virus Scan Queue Length: Current number of outstanding requests that are queued for virus scanning.

VM Largest Block Size: Size of the largest free virtual memory block.  This counter is a line that slopes down as virtual memory is consumed. When this counter drops below 32 MB, Exchange 2000 SP1+ logs a warning in the event log (Event ID=9582) and logs an error if this drops below 16 MB. It is important to monitor this counter to ensure that it stays above 32 MB.

VM Total 16MB Free Blocks: Total number of free Virtual Memory blocks larger than or equal to 16MB.  This line forms a pyramid as you monitor it. It starts with one block of virtual memory greater than 16 MB and progresses to smaller blocks greater than 16 MB. Monitoring the trend on this counter should allow a system administrator to predict when the number of 16 MB blocks is likely to drop below 3, at which point restarting all the services on the node is recommended.

VM Total Free Blocks: Displays the total number of free virtual memory blocks regardless of size. This line forms a pyramid as you monitor it. This counter can be used to measure the degree to which available virtual memory is being fragmented. The average block size is the Process\Virtual Bytes\STORE instance divided by MSExchangeIS\VM Total Free Blocks.

VM Total Large Free Block Bytes: Displays the sum in bytes of all the free virtual memory blocks that are greater than or equal to 16 MB. This line slopes down as memory is consumed.  This counter monitors store memory fragmentation.  Should stay above 50mb on a healthy server.

MSExchangeIS Mailbox

Active Client Logons: Active Client Logons is the number of clients that performed any action within the last ten minute time interval.  Baseline required, depends on number of users.

Average Delivery Time: Average Delivery Time is the average time between the submission of a message to the mailbox store and submission to other storage providers for the last 10 messages.  Make note of this delay time when the load is low. A high value could indicate a performance problem with the MTA.

Average Local Delivery Time: The average length of time the last 10 local delivery messages waited for transport to a mailbox in the same information store. Make note of this delay time when the load is low. A high value could indicate a performance problem with the private information store. This counter should never remain at a non-zero value for no longer than a few seconds.

Message Opens/sec: Message Opens/sec is the rate that requests to open messages are submitted to the information store.  Baseline required, depends on number of users.

Received Queue Size: Receive Queue Size is the number of messages in the mailbox store's receive queue.  Should be close to 0 most of the time.

Send Queue Size: Send Queue Size is the number of messages in the mailbox store's send queue.  Should be close to 0 most of the time.

MSExchangeIS Public

Average Delivery Time: The average time between the submission of a message to the public store and submission to other storage providers for the last 10 messages.  Probably should be close to 0 most of the time.

Average Local Delivery Time: Average Local Delivery Time is the average time between the submission of a message to the public store and the delivery to all local recipients (recipients on the same server) for the last 10 messages.  Probably should be close to 0 most of the time.

Folders Open/sec: Folder opens/sec is the rate that requests to open folders are submitted to the information store.  This is a good indication of the Public Folder usage on the server.  Baseline required, depends on number of users.

Message Opens/sec: Message Opens/sec is the rate that requests to open messages are submitted to the information store.  Baseline required, depends on number of users.

Received Queue Size: Receive Queue Size is the number of messages in the public store's receive queue.  Should be close to 0 most of the time.

Send Queue Size: Send Queue Size is the number of messages in the public store's send queue.  Should be close to 0 most of the time.

Network Interface

Bytes Received/sec: Bytes Received/sec is the rate at which bytes are received on the interface, including framing characters.  Baseline required, depends on number of users.

Bytes Sent/sec: Bytes Sent/sec is the rate at which bytes are sent on the interface, including framing characters.  Baseline required, depends on number of users.

Bytes Total/sec: Bytes Total/sec is the rate at which bytes are sent and received on the interface, including framing characters.  Baseline required, depends on number of users.

Output Queue Length: Indicates the length of the output packet queue. A queue length of 1 or 2 is often satisfactory. Longer queues indicate that the adapter is waiting for the network and thus cannot keep pace with the server.

Paging File

% Usage: Shows the amount of the paging file in use during the sample interval, as a percentage. A high value indicates that you may need to increase the size of your Pagefile.sys file or add more RAM. Microsoft recommends keeping this value below 75 percent.

PhysicalDisk

% Disk Time: Percentage of elapsed time that the selected disc drive is busy servicing read or write requests. This should be under 50%.

Avg. Disk sec/Read: Check the specified transfer rate for your hard disks to verify that this rate does not exceed the specifications. Some SCSI disks can handle 50 to 70 I/O operations per second. Recommended Threshold - depends on manufacturer’s specification, but generally should be below 20ms.

Avg. Disk sec/Transfer: Indicates how fast data is being moved, in seconds. A high value might mean that the system is retrying requests due to lengthy queuing or, less commonly, a disk failure. There are no benchmark recommendations from Microsoft. Watch for significant variances from baseline data.

Avg. Disk sec/Write: See Disk sec/Read.  Should be below 20ms.

Disk Transfers/sec: Shows the number of completed read and write operations per second. This counter measures disk utilization and is expressed as a percentage. Values over 50 percent might indicate that the disk is becoming a bottleneck.

Current Disk Queue Length: See Logical > Current Disk Queue Length.  Should see it hit 0 at least 4 times in a 60 second sample.

Process

% Processor Time: Records the percentage of time the processor is running non-idle threads. If your server has multiple processors, you can watch each instance. Microsoft Exchange Server services can use multiple processors. An average value that is below 20 percent indicates the server is unused or services are down. An average value that is consistently above 75-80 percent indicates that the server is overburdened.

% User Time: The percentage of elapsed time that this process' threads have spent executing code in user mode.  Applications, environment subsystems and integral subsystems execute in user mode.  Code executing in user mode cannot damage the integrity of the Windows NT Executive, Kernel, and device drivers.  Unlike some early operating systems, Windows NT uses process boundaries for subsystem protection in addition to the traditional protection of user and privileged modes.  These subsystem processes provide additional protection.  Therefore, some work done by Windows NT on behalf of your application might appear in other subsystem processes in addition to the privileged time in your process.

Elapsed Time: Records the number of seconds a process has been running. It gives you a quick way to see whether a server or service has recently been restarted without looking through the event log. A zero value here indicates a non-active process.

Handle Count: Sum of the handles currently open by each thread in the process.  MAD, MTA and Store handles should remain fairly constant.  Inetinfo handles can grow radically during queue buildup.

Page Faults/sec: This counter can be used to monitor individual processes. This helps to identify the process that is suffering most from lack of virtual memory.

Page File Bytes: The current number of bytes this process has used in the paging file(s).  Paging files are used to store pages of memory used by the process that are not contained in other files.  Paging files are shared by all processes, and lack of space in paging files can prevent other processes from allocating memory.

Pool Nonpaged Bytes: Number of bytes in the nonpaged pool, an area of system memory (physical memory used by the operating system) for objects that cannot be written to disk, but must remain in physical memory as long as they are allocated.  Memory: Pool Nonpaged Bytes is calculated differently than Process: Pool Nonpaged Bytes, so it might not equal Process: Pool Nonpaged Bytes: _Total.  This counter displays the last observed value only; it is not an average. 

Private Bytes: Current number of bytes this process has allocated that cannot be shared with other processes.  MAD, MTA and Store private bytes should remain fairly constant except when background tasks run.  Inetinfo private bytes can grow radically during queue buildup.

Virtual Bytes: Current size in bytes of the virtual address space the process is using.  Use of virtual address space does not necessarily imply corresponding use of either disk or main memory pages.  Virtual space is finite, and by using too much, the process can limit its ability to load libraries.  Virtual bytes should remain fairly constant across processes.  Virtual bytes is most important for the store process where it only has 2GB or 3GB of VA to work with when running with /3GB switch or not.  On a large server with the /3GB switch, this counter should stay below 2.8GB.

Working Set: Current number of bytes in the Working Set of this process.  The Working Set is the set of memory pages touched recently by the threads in the process.  If free memory in the computer is above a threshold, pages are left in the Working Set of a process even if they are not in use.  When free memory falls below a threshold, pages are trimmed from Working Sets.  If they are needed they will then be soft-faulted back into the Working Set before they leave main memory.  MAD, MTA and Store working set should remain fairly constant except when background tasks run.  Inetinfo working set can grow radically during queue buildup.

Processor

% Priviledged Time: The percentage of non-idle processor time spent in privileged mode.  (Privileged mode is a processing mode designed for operating system components and hardware-manipulating drivers.  It allows direct access to hardware and all memory.  A high rate of privileged time might be attributable to a large number of interrupts generated by a failing device.  This counter displays the average busy time as a percentage of the sample time.

% Processor Time: % Processor Time is the percentage of time that the processor is executing a non-Idle thread.  This counter displays the average percentage of busy time observed during the sample interval.  It is calculated by monitoring the time the service was inactive, and then subtracting that value from 100%.

% User Time: Percentage of processor time spent in User Mode in non-Idle threads.  All application code and subsystem code execute in User Mode.  The graphics engine, graphics device drivers, printer device drivers, and the window manager also execute in User Mode.  Code executing in User Mode cannot damage the integrity of the Windows NT Executive, Kernel, and device drivers.  Unlike some early operating systems, Windows NT uses process boundaries for subsystem protection in addition to the traditional protection of User and Privileged modes.  These subsystem processes provide additional protection.  Therefore, some work done by Windows NT on behalf of your application may appear in other subsystem processes in addition to the Privileged Time in your process. This should stay under 75%.

Redirector

Bytes Total/sec: Measures the number of bytes per second sent and received by the network redirector. Compare the maximum throughput of your network card with the maximum value of this counter to see if network traffic is a bottleneck in your system.

Network Errors/sec: Measures the number of unexpected errors the redirector receives. If you suspect network problems, check to see whether this counter is above zero. If it is above zero, check the system event log for details on the network error.

Server

Bytes Total/sec: If the sum of bytes total/sec for all servers is roughly equal to the maximum transfer rates of your network, you might need to segment the network.

Pool Nonpaged Bytes: The number of bytes of non-pageable computer memory the server is using.  This value is useful for determining the values of the MaxNonpagedMemoryUsage value entry in the Windows NT Registry.

Pool Nonpaged Failures: Indicates the number of times that allocations from the paged pool have failed. If this number is high, either the amount of RAM is too little or the pagefile is too small or both. If this number is consistently increasing, increase the physical RAM and the size of the pagefile.

Work Item Shortages: If the value reaches recommended threshold (3), consider tuning the InitWorkItems or MaxWorkItems entries in the registry (in HKEY_LOCAL_MACHINE\SYSTEM\ CurrentControlSet\Services\lanmanserver\ Parameters).

Server Work Queues

Active Threads: Number of threads currently working on a request from the server client for this CPU.  The system keeps this number as low as possible to minimize unnecessary context switching.  Baseline required, depends on number of users.

Bytes Sent/sec: The rate at which the Server is sending bytes to the network clients on this CPU.  This value is a measure of how busy the Server is.  Baseline required, depends on number of users.

Queue  Length: If the value reaches recommended threshold (4), there might be a processor bottleneck. This is an instantaneous counter; observe its value over several intervals.

Read Bytes/sec: The rate the server is reading data from files for the clients on this CPU.  This value is a measure of how busy the Server is.  Baseline required, depends on number of users.

Write Bytes/sec: The rate the server is writing data to files for the clients on this CPU.  This value is a measure of how busy the Server is.  Baseline required, depends on number of users.

Write Operations/sec: The rate the server is performing file write operations for the clients on this CPU.  This value is a measure of how busy the Server is.  This value will always be 0 in the Blocking Queue instance.

SMTP Server

Categorizer Queue Length: Counter tells you how well SMTP is doing LDAP lookups against GC’s.  This should be at or around zero unless expanding DLs where it can go up higher occasionally.  This is an excellent counter to tell you how healthy the GC’s are.  If there are slow GC’s, you will see this counter creep up.

System

Processor Queue Length: Processor Number of threads in the processor queue.  There is a single queue for processor time even on computers with multiple processors.  Unlike the disk counters, this counter counts ready threads only, not threads that are running.  A sustained processor queue of greater than two threads generally indicates processor congestion. 

System Up Time: System Up Time is the elapsed time (in seconds) that the computer has been running since it was last started.  This counter displays the difference between the start time and the current time.

TCP

Segments Received/sec: Shows the rate at which segments are received, including those received in error. This count includes segments received on currently established connections. A low value means that you have too much broadcast traffic.

Segments Retransmitted/sec: Gives the rate at which segments containing one or more previously transmitted bytes are retransmitted. A high value might indicate either a saturated network or a hardware problem.

Thread

% Processor Time: The percentage of elapsed time that this thread used the processor to execute instructions.  An instruction is the basic unit of execution in a processor, and a thread is the object that executes instructions.  Watch for threads that consume a lot of processor time, this might indicate spinning thread(s) or a deadlock condition.  Both of those two conditions would require a dump of the process to determine the exact issue.

ID Thread: The unique identifier of this thread.  ID Thread numbers are reused, so they only identify a thread for the lifetime of that thread.

Thread Wait State: The current state of the thread.  It is 0 for Initialized, 1 for Ready, 2 for Running, 3 for Standby, 4 for Terminated, 5 for Wait, 6 for Transition, 7 for Unknown.  A Running thread is using a processor; a Standby thread is about to use one.  A Ready thread wants to use a processor, but is waiting for a processor because none are free.  A thread in Transition is waiting for a resource in order to execute, such as waiting for its execution stack to be paged in from disk.  A Waiting thread has no use for the processor because it is waiting for a peripheral operation to complete or a resource to become free.

Thread Wait Reason: Thread Wait Reason is only applicable when the thread is in the Wait state (see Thread State).  It is 0 or 7 when the thread is waiting for the Executive, 1 or 8 for a Free Page, 2 or 9 for a Page In, 3 or 10 for a Pool Allocation, 4 or 11 for an Execution Delay, 5 or 12 for a Suspended condition, 6 or 13 for a User Request, 14 for an Event Pair High, 15 for an Event Pair Low, 16 for an LPC Receive, 17 for an LPC Reply, 18 for Virtual Memory, 19 for a Page Out; 20 and higher are not assigned at the time of this writing.  Event Pairs are used to communicate with protected subsystems (see Context Switches).

Additional References

Q317411 XADM: How to Gather Data to Troubleshoot Exchange Virtual Memory

Q266096 XGEN: Exchange 2000 Requires /3GB Switch with More Than 1 GB RAM

Q296073 XADM: Monitoring for Exchange 2000 Memory Fragmentation

Q273177 XADM: Exchange Database Engine Counters Not Installed

Q302254 XADM: Computer That Is Running Exchange 2000 and Windows 2000 Server May Run Out of Virtual Memory with Event ID 12800\

Q289109 XADM: Limit the Number of Messages Opened for Each MAPI Logon

Q302254 XADM: Exchange 2000 Windows 2000 Server Out of Virtual Memory

Q319937 XADM: Perfmon counter for Active users not accurate


Featured Links

Microsoft Messaging Resources Search