Release It: Stability Antipatterns

What is antipattern?

Stability antipatterns are systematic patterns of failure about service stability.

Why study antipatterns?

Software rarely crashes today. Each of the antipatterns will create, accelerate, or multiply cracks in the system. Learning the antipatterns helps us avoid apply them in our system and software design.

Pattern 1: Integration Point

Integration point refers to the point where a system integrates with a remote system, for example, a remote service, or a Database service. Integration points are the number-one killer of systems, but it is not an option to build a system without integration point. The key is to remember any integration point you introduced can cause a failure on the system. Circuit breaker pattern is a good strategy to handle the integration point: fails the request fast when the circuit break happens, which is often triggered when enough failure is seen. The fast failure protect the system from continuing to hammer the dependent integration system or from holding the critical system resource in integrating with the failed remote system.

Pattern 2: Chain Reaction

A chain reaction occurs when there is some defect in an application?usually a resource leak or a load-related crash. A chain reaction happens because the death of one server makes the others pick up the slack. Most of the time, a chain reaction happens when your application has a memory leak. Partitioning servers, with Bulkheads, can prevent Chain Reactions from taking out the entire service. Use Circuit Breaker also helps.

Pattern 3: Cascading Failures

A cascading failure occurs when cracks jump from one system or layer to another, usually because of insufficiently paranoid integration points. A cascading failure can also happen after a chain reaction in a lower layer. A cascading failure often results from a resource pool, such as a connection pool, that gets exhausted when none of its calls return. The threads that get the connections block forever; all other threads get blocked waiting for connections. Safe resource pools always limit the time a thread can wait to check out the source. A cascading failure happens after something else has already gone wrong. Circuit Breaker protects your system by avoiding calls out to the troubled integration point. Using Timeouts ensures that you can come back from a call out to the troubled one.

Pattern 4: Users

Users in the real world do things that you won?t predict If there?s a weak spot in your application, they?ll find it through sheer numbers. Test scripts are useful for functional testing but too predictable for stability testing. You should have a special load test to validate the stability of your service. Become intimate with your network design, it should help avert attacks.

Pattern 5: Blocked Threads

The blocked threads antipattern is the proximate cause of almost all failure. And it usually happens when you check resources out of a connection pool, deal with caches or object registries or make calls to external systems. However, it is very difficult to find hanging threads during development due to the essential complexity on concurrent programming and testing. You should be careful when you introduce the synchronized keyword to your code. That means one thread is executing the code, other threads will be blocked. When you have to design the concurrency programming, you should try to use proven primitives and you should defend the blocking thread with timeout settings.

You should also be careful with the third-party libraries, they often come with some resource pooling that is built on the multithreading. Your first problem with these libraries is determining exactly how they behave. If the library breaks easily, you need to protect your request-handling threads. If the library allows you to set timeouts, use them.

Pattern 6: Attack Of Self-Denial

In the retail or similar website, the special offer could cause an attack of self-denial. In such cases, protect the shared resource would be critical.

Pattern 7: Scaling Effects

The shared resource can be a bottleneck when we try to scale the service. In most of the cases, the shared resource can only be used exclusively. When the shared resource saturates, you get a connection backlog. A shared-nothing architecture can solve this problem largely: each request is processed by one node, and the node does not have shared memory or storage. It is relatively hard to achieve the sharing nothing architecture, however, many approaches can be taken in order to reduce the dependency on the shared resource, for examples partition the database table.

Pattern 8: Unbalanced Capacities

Production systems are deployed to some relatively fixed set of resources. The unbalanced capacities problem rarely observed during QA as inmost QA phase the system is scaled down to two servers. How do you test if your system is under the unbalanced capacities? You could use the capacity modeling to make sure at least you are on the ballpark. Then, test your system with excessive workload.

Pattern 9: Slow Response

A slow response is worse than refusing a connection or returning an error. The fast failure allows the caller to finish processing the trasaction quickly. The slow response from the lower layers can gradually propagate to the upper layer and cause a cascaded failure. Some common seen slow response root cause include the contention of DB connection, the memory leaks, and the inefficient low-level protocols, etc.

Pattern 10: SLA Inversion

A service-level agreement (SLA) is a contractual agreement about how well the organization must deliver its services. However, when building a software system, the best you can possibly do is the SLA of the worst of your service providers. And SLA inversion means a system that must meet a high-availability SLA depends on systems of lower availability. To handle the SLA inversion, on the one hand, you can decouple from the lower availability system or handle the degradation gracefully. You need to make sure your system can continue to operate without the remote system. On the other hand, when you craft your SLAs, you should focus on particular functions and features. For features that require a remote dependency, you can only achieve the best SLAs the remote system offers. Pattern 11: Unbounded Result Sets A common seen DB query pattern is that the application sends a query to the DB and run a for loop to traversal all the results without realizing that the result size can be way lager than the server could handle. This pattern is hard to detect during the development phase as the testing data set is often very small, sometimes not even right after ramped to production as until the data set grows to be too big to handle. One solution is to use the LIMIT keyword to limit the result size sent back from DB. For large query set, it is worthy of introducing the pagination on the API and on the database level.