Loading...

In this article, performance specialist Steven Parker explains the USE method and how to use this key performance methodology. By understanding and using these techniques, developers can minimize cost, improve performance, and improve reliability of their cloud deployments.

At the heart of performance evaluation is the USE Method: Usage, Saturation, and Errors. This performance methodology is all about getting a handle on performance, cost, and efficiency of resource usage. You are probably aware of it even if you didn´t know it by name, because it is the under-pinning of Azure Insights. Azure Insights applies a set of “obvious” default resources and capacities and provides diagnosis based on the results it finds. In this article I will explain the methodology and terms. Besides generally helping to improve performance, it can be used to reduce cost of your cloud deployment, and demonstrate that you are fully utilizing what you’re paying for.

Terminology and Measures
First, let’s consider what we mean by these three key terms:

  • Usage: Over a period of time, the percentage of time the resource was busy. Examples of this would be CPU % utilization per minute, or percentage of available RAM in use.
  • Saturation: The degree to which the resource has work beyond its capacity. An example might be a disk queue, where I/O requests which cannot be serviced, yet are counted by queue length.
  • Errors: The count of error events

Errors need to be investigated immediately because they may indicate work not actually completed, or incompletely processed. Therefore, any other measurements may be deceptive. For example, if you ran a test of your web service at 1,000 requests per second, but if you got 999 errors, you probably haven’t measured the performance of anything except error handling.

It is also important that you keep in mind the time period of these metrics. If you measure the average CPU utilization over an hour, if there is a brief period where it is 100% and the cpu wait queue gets long, the average CPU may still be very low. Basically, any peak burst in load which is significantly smaller than the time period causes it to potentially be lost in the measurement. Knowing the sampling interval helps keep you aware of the limits of the accuracy of your measurements.

Ideal goals might be near 100% utilization, no saturation, and no errors. Of course, we know those ideal goals do not make sense, but it shows the direction you are hoping for. Along with these measures, you want to consider how fast (latency) the responsiveness is. It may be that at greater than 50% CPU, the response time gets worse. (I will return to talking about this balance in a subsequent article.)

The Procedure
Before you begin, you will need to identify all the resources in your deployment, and then iterate over each resource checking it with the USE Method. And then you review each of those resources according to this flowchart:

Source: Brendan Gregg

As you use this method, you end up eventually having audited all possible cost savings from over-provisioning, validating that there are not undetected problems identified in error events, and discover any areas under-provisioned which you may not have realized.

Goals to Balance
A key way this can work well is to settle on specifics which balance performance and capacity. Since most web applications and cloud deployments focus on interactive performance, you will probably want to identify a goal latency. For example, an HTTP GET should take no more than .5s. Then by conducting changes in CPU capacity, and then checking its impact on latency, you see beyond the idealistic goals like 100% CPU utilization. For example, if you aim for .5s average response time, then you may discover that when CPU utilization per 1 minute interval goes above 65%, you no longer maintain .5s average response time. That can now inform your goal. An example set of final metrics after experimentation might well be:

  • Peak CPU % over a 1 minute interval is 55%
  • Peak CPU wait queue length over a 1 minute interval
  • Average disk response time over a 15 minute period is 3.5ms
  • RAM utilization over a one hour period is peak 80%

Happy Cost Control Meeting!
With the above, you are in a position to provide a justification for your configuration, and coupled with your performance goals, you can have a meaningful conversation about the costs with the Finance department. “If we are going to maintain the user-experience of no single fetch takes more than 0.5s we need the level of CPU we have because of the measured results for the above.” Not only that, your review of errors will allow you to be confident there weren’t hidden problems in the deployment.

I hope this has been a helpful review of a key aspect of how we monitor performance of a cloud deployment.

Hopefully you enjoyed this article, and if you want to read more about the USE method, check out Brendan Gregg’s book “Systems Performance: Enterprise and the Cloud”.

About Steve:
Steven Parker works as a Full Stack Architect at Avanade Norway and has many years’ experience working with software engineering, operating systems, software development, project management. Originally from the US, Steve worked many years for Apple, Sun Microsystems and Fujitsu before joining Avanade earlier this year.

LinkedIn

Share this page
CLOSE
Modal window
Contract