Sunday, July 4, 2010

Performance Testing - Why site could be slow, even with low CPU/RAM/disk utilization.

Why site could be slow, even with low CPU/RAM/disk utilization.

Some times site appeared to slow down significantly, despite the fact that their CPU, RAM, and disk utilization did not rise in utilization significantly. While those three metrics are often good indicators of why systems can “slow down”, there are many other causes of performance problems. Today, we’re going to discuss one common root cause for slow websites that often gets overlooked: connection management.

Until very recently, most web browsers would only issue a maximum of two connections per host, as per the recommendation by the original HTTP/1.1 specification. This meant that if 1000 users all hit your home page at the same time, you could expect ~2000 open connections to your server. Let’s suppose that each connection consumes, on average, 0.01% of the server’s CPU and no significant RAM or disk activity.

That would mean that 2000 connections should be consuming 20% of the CPU, leaving a full 80% ready to handle additional load – or that the server should be able to handle another 4X load (4000 more users). However, this type of analysis fails to account for many other variables, most importantly the web server’s connection management settings.
Just about every web server available today (Apache, IIS, nginx, lighthttpd, etc) has one or more settings that control how connections are handle. This includes connection pooling, maximum allowed connections, Keep-Alive timeout values, etc. They all work basically the same way:
  • When a request (connection) comes in to the server, the server will look at the maximum active connections setting (ie: MaxClients in Apache) and decide if it can handle the request.
  • If it can, the request is processed and the number of active connections is incremented by one.
  • If it can’t, the request is placed in to a queue, where it will wait in line until it finally can be processed.
  • If that queue is too long (also a configuration setting in the server), the request will be rejected outright, usually with a 503 response code.
It’s this queue that can make your site to appear to be slow, despite low server utilization. Say the server allows up to 256 concurrent requests and each request takes 1 second to complete. That means if 1000 users visited the site at the same time, causing 2000 requests, then the first 128 (256/2) users would get a 1 second response time, the second 128 users would get a 2 second response time, and the last user would get an EIGHT SECOND response time.

The simple solution is to raise the concurrent request limit. However, be careful here: if you raise it too high it’s possible your server won’t have enough CPU or RAM to handle all the requests, resulting in all users be affected (rather than just some of them, like in the last example).
Also remember that not all requests are equal: a request to a dynamic search result will be much more expensive than one to a static CSS file. This is why larger sites optimize their hosting to place static files on special web servers with different configurations, usually with host names like images.example.com, while leaving their more complex content to be handled by a larger quantity of servers with a fewer number of concurrent requests on each server.

So next time you’re wondering why your site is slow, take a look at more than just CPU and RAM. Find out how the server is processing the content and see if perhaps your web server is the bottleneck.
Source: browsermob



---

No comments:

Post a Comment