
This is the first installment in a series of blogs that will discuss SAN performance monitoring and troubleshooting.
It was a typical crisis in the data center. Another application slowdown has the team working late nights, and working with vendors to determine whose equipment is causing problems. Application performance monitoring says the servers have CPU and memory to spare. Storage Resource Monitoring (SRM) tools are telling you there are loads of capacity and bandwidth, but the application is still unresponsive and the trouble tickets are pouring in.
Sound familiar? In spite of the numerous tools available to administrators today, often understanding and overcoming application I/O performance problems requires a deeper understanding of the protocol conversations occurring between devices in the storage network.
Traditional tools leave administrators with I/O ‘blind spots’. Diagnosing problems in these blind spots forces administrators to start searching for clues in hard to reach places. This search often involves adjusting driver settings, attaching ‘taps’ to capture traffic, contacting equipment vendors, and re-educating your team on the finer details of Storage protocols.
There are the obvious ‘capacity’ related slow-downs which occur when interconnect or storage equipment is overloaded. In these cases, the adapters, switches, or storage arrays are not able to handle the load. These types of ‘physical’ limits can significantly affect I/O performance but are often easily resolved by procuring more capacity or redistributing the I/O load. To avoid them, most organizations deploy tools that monitor and report when certain physical thresholds have been exceeded before they become problems.
Lesser known are the many ‘soft’ performance issues caused by misbehaving or misconfigured infrastructure attached to the SAN. These ‘harder to detect’ problems can silently impact the performance of other devices (servers) sharing the same infrastructure. Too often, the first signs of trouble are alerts sent from application monitoring tools and users. Diagnosis of these issues often occurs during ‘downtime’ events and little to no proactive detection techniques are available.
But there is good news here, these types of soft problems can be understood using the intelligence built-in to every one of your SAN-attached server. This intelligence is often hidden from users but when exposed and understood, it can be extremely helpful in diagnosing tricky interaction problems.
It’s a fundamental part of the architecture of storage networking protocols (like SCSI, Fibre Channel and Fibre Channel over Ethernet). As the initiator of an I/O transaction, the server (or adapter within it) can expect that all other devices must make every effort to push warnings and failures back to it. Generally speaking, these events can be classified into four categories; Extended Link Service (ELS), Fabric, SCSI Protocol, and physical hardware (on the initiator). Each deal with a separate layer of the conversation and all are important to understanding the ‘end-to-end’ conversation.
In a sense, every initiator, when properly configured, can serve as a probe for its own I/O traffic. By looking down the path to the storage it’s interacting with, each initiator can be used to construct a detailed picture of the health of the devices involved in the conversation. Taken together, this initiator view combined from multiple servers can be used to form a ‘whole SAN’ view of I/O performance and health.
In the next installment, we will talk about interpreting the SCSI conversation taking place in your SAN and using that to give you clues to potential problems brewing. So watch this space for more. And, as always, let us know what you think.
The purpose of our blog are to facilitate an ongoing conversation on what's going on in our industry, with our partners and customers. We encourage your comments. Your ideas and feedback are what makes our blogs interesting, timely and useful for our readers.
We want to publish your comments, however, all comments are moderated. Offensive, off-topic or fraudulent comments won't be approved. We also expect a basic level of civility; disagreements are expected, but mutual respect is a must. We will not post comments that contain vulgar or abusive language; personal attacks of any kind; or offensive terms that target specific ethnic or racial groups. Comments that make accusations will also not be posted.
By submitting a comment, you agree to these terms; having your name displayed with your comment and that you are 18 years old or older. Your name and personal information will not be used for any other purpose, and your e-mail address will not be published.