A scalable service is a service that can maintain constant response time as the load increases as more nodes are added to the cluster and new server instances are running. Why it is so difficult to build a scalable service?

Of course, the modern internet services are complex. They usually contain multiple components with totally different responsibilities: web service, applications service, messaging system, caching system, database system, etc. What if we start from fundamentals: a web based service with moderate amount of business logic and a backend database. Would it still be challenging to build a high concurrent web service?

To answer that question, we first need to understand where the bottleneck comes from. The traditional approaching of handling web request is to fork a new thread for each request, and hopefully the thread can finish execution before the request times out.

However, as the load increase, the number of threads will need to be increased as well. While threads are lightweight, it comes with a memory footprint, so the total number of threads a server can handle is bounded by the memory size.

Then, why not increasing the number of server nodes: adding more process to the cluster. Yes, adding more server node solves the problem temporarily. But don’t forget, the server needs to handle DB operations, which is very slow comparing to CPU operations.

I/O operations are often time the bottleneck of the service, because I/O operations are millions of times slower than CPU. For instance, a 3GHZ CPU can finish one operation at 0.3 ns, while a hard drive disk needs 1 ~ 10 ms, which is 3000000 times slower.

For traditional web service with blocking I/O, the control flow is sequential. When an I/O operations is triggered, the control flow has to wait until it is finished. Often times, the thread will be hang, and the kernel will switch it off the current execution phase.

It’s true that the CPU can still be utilized by other threads while the I/O thread is waiting, however, there are two issues: 1) the context switch comes with a cost, and as the context is switched, the CPU cache is invalided, 2) the number of concurrent I/O operation the backend storage system can handle is limited.

Hence, increasing the number of server might help handling more request initially, however, as we start the hit the bottleneck of the database system, no more concurrent request can be handled, more requests will be blocked by the I/O operations. And before the current request can return, the following incoming request can only wait or queue, and the overall latency of the service would increase.

And as you can see, the critical bottlenecks of scaling the service are two: 1) the blocking mode for handling I/O operation and 2) the number of concurrent requests the backend storage system can handle. I will start to explain how to solve these problems in the following posts.