|
Resources:
|
What’s new in 2.3?
|
|
The following are examples of real, high-priority data center performance or outage issues. I/O management helped data center managers set, monitor and control the right service level agreements for their in-house or outsourced environments. In some cases, I/O management techniques were applied to get to the root cause of an issue quickly, where in some instances, traditional methods were used for days, with little progress, before bringing I/O management into the picture.
Do you need to outsource your storage but have reservations? Manage it with OneCommand™ Vision!
Internal storage IT SLAs: define and manage them with OneCommand™ Vision!
Manage the blind spot between the server and the storage with OneCommand™ Vision
Financial trading application is intermittently down, for days, with no clear explanation
Video editing performance is slow, but all the king’s horses (tools) and all the king’s men (experts) can’t restore performance
I/O contention causes Big Data application slow-down
1) Do you need to outsource your storage but have reservations? Manage it with OneCommand™ Vision!
Many companies are considering the financial benefit of outsourcing their on-premise storage infrastructure to a third party. Others, however, are hesitant about losing control over that infrastructure, particularly when it comes to ensuring adequate and competitive levels of performance to keep the business healthy. At a business level, it makes financial sense to outsource, reducing capital expenditures (CapEx) and operational expenses (OpEx), while maintaining a competitive performance level. But data center managers feel a little uneasy over their loss of control. This is where OneCommand Vision excels. It is often used to manage a data center’s third-party storage service level agreements (SLAs) from their domain.
Today storage IT SLAs are a vital part of business service management because so many companies rely on third-party vendors to host all or parts of their storage. To ensure quality service delivery, accelerate problem identification and protect customer experiences, Emulex customers turn to OneCommand Vision to validate application performance and report on the compliance of external providers.
OneCommand Vision is a tried and trusted software product that allows data center managers to monitor IT storage SLAs and performance metrics from outside the storage "cloud." OneCommand Vision helps businesses monitor application performance against a defined set of objectives that have been agreed to by external providers, and accelerate problem identification when issues occur.
IT storage SLAs and objectives are a key component of successful external vendor relationships. When SLAs are not being met, both the business and IT need to be quickly alerted to the problem. If you use OneCommand Vision to ensure that your SLAs are being met, your team will have the diagnostic detail needed to quickly determine the root cause of the problem and restore compliance before brand damage occurs and customer experience is impacted.
OneCommand Vision puts all the control you had with internally supported storage back into your hands with your externally supported storage. Feel confident about your outsourced storage by using OneCommand Vision to monitor your storage IT SLAs.
Return to top
2) Internal storage IT SLAs: define and manage them with OneCommand™ Vision!
Defining and managing storage performance can cause conflict within an organization. A business unit VP reported, “We can’t compete if the application is running too slow.” That's when fingers start pointing, and storage becomes an easy target. This usually happens for two reasons: (1) there have been times when a storage slowdown slowed the application which, in turn, slowed the business; and (2) in many of these same cases, storage was originally deemed to be running just fine. Despite the fact that many times the storage is not to blame, storage performance has assumed the position of guilty until proven innocent. The reason is that storage performance management is difficult. It is almost always monitored from the storage end of the equation, causing both false-negatives and false-positives during root cause analysis of issues that are impacting the business. This no longer has to be the case with Emulex OneCommand Vision which is used to manage a data center’s internal IT storage SLAs.
Today's leading-edge enterprises and providers are establishing storage performance and availability of SLAs as a vital part of business service management and maintaining optimum competitiveness. To ensure quality service delivery, accelerate problem identification and protect customer experiences, Emulex customers turn to OneCommand Vision to validate application performance and report on SLA compliance of internal providers.
OneCommand Vision is a tried and trusted software product that allows data center managers to establish and monitor IT storage SLAs and performance metrics from where they matter, the application interface. OneCommand Vision helps businesses monitor the critical storage component of application performance against defined sets of objectives that have been agreed to by both IT providers and the business units they support. This results in proactive performance management that keeps the business running at full speed.
IT storage SLAs and objectives are key components of successful relationships between an internal provider and the business units it supports. When SLAs are not being met, both IT and the business unit need to be quickly alerted to the problem.
OneCommand Vision fosters a teamwork approach between a business unit and the internal IT storage provider supporting it. Using OneCommand Vision to ensure that your SLAs are being met by your internal provider, your team will not only know that SLAs are not being met, but will also have the diagnostic detail needed to quickly determine the root cause of the problem and restore compliance before brand damage occurs and customer experience is impacted.
OneCommand Vision gives you the control and management capabilities you need with your internally supported storage. Feel confident about your internal storage provider by using OneCommand Vision to monitor your storage IT SLAs.
Return to top
3) Manage the blind spot between the server and the storage with OneCommand™ Vision
When I/O slowdowns occur, there is a significant difference in what the application is seeing for storage I/O performance and what the network or storage device is seeing, creating a blind spot between the server and the storage. Many enterprises are learning this the hard way. They spend money on storage management software that has monitoring capabilities on the storage side. But then they find that they continue to have a hard time helping the team answer the question, "is the application slowdown related to storage performance?" They then decide to buy tools that delve much deeper, maybe interrogate the network itself, only to arrive at the same dilemma again. Now they've spent additional money, have ongoing OpEx for a complex system, but still can't reliably answer the question, "is this a storage problem or not?"
The reason is embedded in a complex topic called queuing theory. The best way to describe it is to think about how you would answer the question, "how early should I leave for the airport?" You immediately start to think about what kind of traffic you might run into, if you have to go to the ticket counter and how long those lines might be, how busy airport security is going to be, etc. The answer is likely "anywhere from 45 minutes to two hours." Now consider that your flight is a one-hour flight. The variability in how long it takes to get to the airport is longer than the flight itself. This is how storage works. The duration of the flight is what you are considering when you are monitoring performance at the storage side. This is certainly an important factor. But, given the example and how storage I/O works, it is easy to understand why only monitoring performance at the storage side does not answer the question of, "how is the storage performance impacting the business?" It is because the storage I/O starts when the application requests it and there are many queues, lots of waiting in line if you will, that can happen before the request gets to the storage.
If you lose sight of that, you lose sight of reality. Consider the following, which uses the airport analogy to show what often results in extended periods of time when you are chasing the wrong problem, thereby impacting the business, sometimes for months:
- It is often a storage-related slowdown when all your performance data would tell you otherwise. This is the case when the flight took an hour, but traffic resulted in a one-hour drive to the airport.
- It is also often not a storage-related issue when you think it is. This is the case when the flight took 1.5 hours, 50% more than normal, but the trip to the airport only took 20 minutes and there was no line at security.
Emulex customers are using OneCommand Vision to address this issue. They refer to the problem as the storage performance "blind spot." They use OneCommand Vision to understand the real storage performance, from the application's perspective, along with Vision's unique analytics and correlation engine, to determine where slowdowns are happening downstream. Using traditional tools leaves you with traditional problems. Namely, it's never a storage problem when you think it is. And, it's always a storage problem when you’re sure it isn't. With OneCommand Vision, you have control of, and can manage, all of the possible slowdown spots.
Without a clear view of the blind spot between the server and the storage, you can experience I/O-related degradation issues, causing operational and economic pain points for your organization. When network performance becomes slow or unpredictable, employee productivity is affected and so is your competitiveness. If you experience application outages, you can greatly affect your brand image and customer experience. This blind spot can make it difficult to maximize I/O infrastructure utilization without impacting performance, resulting in IT organizations that just throw more hardware at the problem.
Emulex OneCommand Vision is an economical and insightful solution that can readily tell you where the slowdown is: server, network or storage domain, and which specific infrastructure pieces are involved.
With OneCommand Vision, you not only catch slowdowns in the blind spot, you also have a much more manageable job fixing the problem before it causes a competitiveness issue or business outage. According to Gartner’s Sept 2011 Magic Quadrant for Application Performance Monitoring, “half or more of the database time is attributable to storage I/O.” Your applications operate on storage. If your storage is slower than your competition, it’s like racing with a slower car. Don’t let the blind spot between the server and the storage slow you down. Use OneCommand Vision and know you are getting the most out of your applications.
For more information, read this interesting blog on the blind spot.
Return to top
4) Financial trading application is intermittently down, for days, with no clear explanation
A large Wall Street institution just had a catastrophic application outage. Their derivatives trading application is down, preventing critical market data from reaching the floor traders who depend on such data to conduct their business. The issue is quickly raised to the data center CTO level. In the middle of this outage, the team learns that the immediate issue is the application has lost all visibility to its storage, including primary and redundant paths. What caused the loss of visibility is completely unclear.
All of the vendors are brought around the table for a tense, three-day, finger-pointing session. Included are the OS vendor, the server vendor, the multi-pathing vendor, the volume management vendor, the fabric vendor, the Host Bus Adapter (HBA) vendor and the RAID array vendor. The business unit managers are on hand to repeatedly remind everyone of the financial impact of the outage.
Vendors first scramble to ensure that the configuration of the offending servers is pristine. This takes over a day, as more and more configuration variables are surfaced over multiple roundtable sessions. In the end, it is determined that all the configurations are up to speed and, further, the server configuration has been stable for over a year, running the same application. Other variables are also validated, including the fabric configuration and the primary storage configuration. At this point, a change to the primary storage controller microcode is noted, but quickly dismissed as unassociated with the negative system behavior. This is not taken lightly, with technicians weighing in to come to this conclusion. The microcode change does not exactly coincide with the occurrence of the problem (it happened weeks before), and the same code change is working very well in the other 98%+ of the data center installation.
While trying to avoid perturbations to the environment, the team has decided to install network analyzers in an attempt to understand what may be happening at the protocol layer. The analyzers are programmed in an attempt to capture the conditions that cause the issue, but there is an inability to gather helpful information. In addition, the outage is affecting anywhere from one to ten application servers, and the analyzer cannot be placed on all of the appropriate links. Either the analyzer is not inserted in the right place, not triggered properly to capture the right window of information, or the information is captured, but too hard to find within the massive amount data in the logs.
Failing to isolate the cause after three days, I/O management (IOM) experts are solicited for advice. A decision is made to enable logging on all affected servers. Based on their experience solving similar issues at other customer sites, the IOM team selects an important subset of protocol events to capture at the production server. This allows collection of information on all potentially affected servers and ensures that the overrun of log data is unlikely. They have applied the I/O protocol analysis technique, simultaneously, to all affected servers. Within four hours, the problem recurs. This time, important SCSI and Fibre Channel protocol events from the logs are captured and interpreted. In response to a target reset, the storage array is logging out all attached servers, and the affected servers are not appropriately logging back into the array.
The focus immediately turns to the devices that are mapped to the storage array port in question. The configuration of the servers connected to the particular storage port is inspected, and one is found to be misconfigured. Further inspection, also applying I/O protocol analysis at the misconfigured server, indicates that this server is sending sporadic target resets to the array. This server, which was provisioned on the SAN but had not yet been running an application, was then removed from the SAN. The intermittent, highly problematic outages ceased thereafter. The team was then able to quickly summarize the issue as follows:
- There was an array microcode update weeks before the issue started to appear. The microcode update made changes, improvements really, in how targets would respond to resets.
- A misconfigured application server was attached to the SAN. The misconfiguration caused reset behavior that triggered the new microcode response.
- The new microcode response triggered a race-condition inside the application SCSI stack that caused the storage to go offline.
While the problem had festered for three days, the problem was removed within a few hours after applying IOM techniques, and the root cause was obtained almost as quickly. Applying IOM techniques proactively could have reduced this even further.
Return to top
5) Video editing performance is slow, but all the king’s horses (tools) and all the king’s men (experts) can’t restore performance
A large broadcasting company is experiencing very slow application performance. Its video editing and archiving system is barely keeping up with the needs of the time-critical news editing function. The application monitoring tools are showing low transactions, but the application folks have concluded that nothing has changed in the environment. As a result, they quickly turn to the distributed system and SAN teams for help. The SRM software is showing that the links are healthy and link utilization is well within range. The RAID array tools are showing that all I/O is completing in a timely fashion. The team goes as far as to manually inspect the I/O latency from the server’s perspective, and finds that the numbers appear to be OK. With no clear historical trends, it is difficult to understand, or implicate, slow response time on the network. Highly paid experts and tools are brought in, but they all draw the same conclusions:
- The application environment has not changed.
- The SAN has low utilization, and doesn’t seem to be a problem.
- The RAID array is completing I/O in what appears to be a timely fashion (less than 10ms).
- The server’s view of the latency also appears to be within range.
But the system is markedly slower than it was days earlier, before the performance issue arose.
At this point, additional experts are pulled in, and I/O management (IOM) techniques are applied. The team installs instrumentation to determine if any pushback is coming from the arrays. This would happen in the form of Queue-Fulls or SCSI-Buys responses. Low and behold, there are lots of them coming back. The fabric is doing fine, and the RAID array quickly completes I/O, that it accepts. The problem is that the broadcasting company has added too many high-speed application servers, and inadvertently connected them all to a single array port. The array port is overloaded and rejecting many of the requests, just as designed. These rejected I/Os do not show up as increased latency, do not cause issues with link utilization and do not affect the timeliness that the array handles I/Os that are not rejected. All of this masks the root cause of the issue.
Several IOM techniques were applied to identify and resolve this issue quickly. I/O protocol analysis was used to understand that the array port was over-provisioned. I/O utilization profiles were used to confirm the overload of the array port and to locate another array port that had a lesser load. These, as well as other IOM functions, could have been used to avoid this problem by more proactively managing the utilization of the data center infrastructure.
Return to top
6) I/O contention causes Big Data application slow-down
A large data center with vast amounts of Big Data has just purchased a handful of mid-range arrays. The array vendor performed an analysis of the application’s data patterns and guaranteed a service level. This was important because the Big Data application had a window of time to complete its job. In particular, this application was crunching data, overnight. The system was installed, tested and put into production. The system ran fine for months before data center staff started to notice that the jobs were not completing on time. They called in performance experts to look at the storage area network (SAN) performance. They looked at the fabric link utilization, quickly determining that link utilization was very low, which is good. Next, they had the array vendor service experts look into the response time of the array. They found that the response time was in line with the guaranteed service levels, but it is important to note that these measurements were taken during normal daytime working hours.
Days into the problem escalation, I/O management techniques were used to apply I/O protocol analysis and I/O latency measurements. The analysis was installed in such a fashion that performance could be monitored continuously, specific to the I/O paths used by the application. The I/O protocol analysis data revealed a very clean I/O operation throughout the system. The I/O latency analysis, however, indicated that between 1 a.m. and 3 a.m., latency increased significantly. This, effectively, added another two hours processing time to the job. This latency issue was not seen earlier, as latency experiments were done during daytime hours. I/O utilization profiling was then used to investigate the I/O demand on the array ports during this time frame. The results indicated that, during this time frame, there was four times the I/O demand on the port. They would quickly determine that a backup job was running in this time frame, causing contention within the cache as well as contention at the logical volume’s spindles. This contention caused increase in latency for all system components. The resolution involved the distributed system, SAN and backup teams working together to better schedule the backup.
Return to top


Twitter
Blogs
RSS Feed
Slideshare
LinkedIn
Facebook
YouTube
Google+
