user warning: Duplicate entry '827686' for key 1
query: INSERT INTO dr_accesslog (title, path, url, hostname, uid, sid, timer, timestamp) values('Your Pit Crew Through and Through', 'node/133', '', '38.107.191.85', 0, '6embtud40on8btpkpfo7jdofd4', 242, 1268788627) in /usr/local/apache/htdocs/includes/database.mysql.inc on line 172.
Wed, 02/13/2008 - 23:18 by Jarod Jenson
Okay, maybe not totally meaningless, but the way that many people use the metric certainly is meaningless. This is especially apparent when it comes to what some believe is a black art - Capacity Planning. This ties back to my first blog entry when I spoke of how often simple upgrades go horribly wrong.
This problem is exacerbated by the fact that the systems we use today are becoming significantly more multi-[core|processor|thread]. Many people look at processor utilization as a single number and with no context as to the type or size of the underlying machinery. I routinely hear comments similar to: "We know it is not a CPU issue because the system is only 25% busy."
That number by itself is absolutely meaningless. Are we talking 25% busy on a single CPU system? Or, more likely, 25% of a multi-core system? If this is 25% busy on a 4-way box - then are we talking 1 thread at 100% utilization of a single core, 2 threads at 50% utilization of two cores, or 25 threads 1% busy across all four cores? Each of these characterizes a completely different usage profile that would dramatically change our perception of whether or not the application is processor bound or if the traditional capacity planning metrics could be used.
Many times, it is the first or second profile that we encounter we when dig further into the problem. Sadly, the third is the only one that even has a slim possibility of fitting the original assumption that this is not a "CPU issue". What does all this mean then?
Well, it means that we have to provide context so that the utilization number will make sense. For instance, Solaris has a tool called prstat(1M) that allows us to observe utilization on a per application thread basis. This is especially useful when combined with the microstate accounting information available in Solaris 10 (these would be the '-L' and '-m' options respectively - in fact, I never run prstat without these two options). This would allow us to better qualify whether or not we are running into one of the first two described issues. If we find that we have one thread at 100% utilization in an application, then the 25% number is irrelevant (meaningless). The application will probably not perform any better regardless of the number of cores, and we do indeed have a "CPU issue". The only way to address this is with 1) wallet tuning¹ - buy faster cores (how boring is that?) or 2) address the application issues and either reduce the number of instructions required or parallelize the operation where possible.
In addition to these simplistic cases, there are others that are even more challenging. I'll illustrate with an example. In the past month or so, I have run into a couple of applications that were "maxed" out. This is a simple way of saying that if more units of work are added, the latency per transaction becomes unacceptable. Looking at the systems, neither was more than 50% utilized from a processor perspective when looked at as a single value. In both cases, however, I was able to improve the performance by as little as 30% and as high as about 120% by running a single Solaris OS level command - psradm(1M). Specifically, using the '-f' option. That is correct; I disabled anywhere from 1/4 to 1/2 of the CPUs on the systems and performance improved.
This is what is called a negative scalability problem. The more CPUs we add, the worse the performance. This is usually the case of lock contention that results in overly burdensome e-cache line contention. This is the one that will really give those capacity planners heartburn.
So what is one to do? Well, always put processor utilization in context. Almost never is it meaningful for me to hear a system wide utilization number. I would prefer a per-thread breakdown, and even better is to get some idea of the application metrics that accompany each utilization value. By far the best approach is to profile the application using tools like DTrace. With DTrace, we can very easily determine the running profile of an application and make informed decisions about how to interpret processor utilization metrics.
In coming entries, I will show some of the approaches we can use with DTrace to allow us to make these informed decisions. I just had to get this off my chest since it is only Monday, and I have hit this issue twice already this week.
¹I have been using the phrase "wallet tuning" for some time now, and I think it is high time that it makes its way into the lexicon. This is (sadly) the predominant method by which applications are "tuned". Spending hard earned money for performance gains should be the last resort.