Re-Architecting For the Cloud

Last year I brought a room full of talented architects and lead engineers together to answer a simple question: how do we leverage the cloud? The assembled team was responsible for multiple complex and distributed SaaS products that leveraged traditional physical and virtual infrastructure. The group wrestled with different options, eventually arriving at the conclusion that leveraging the cloud in their existing products was at best difficult, and at worst, impractical.

This question is being asked all across the world. How do I take advantage of the cloud in my existing services or products? How can I get the value of the pay per use pricing, elasticity, and on-demand infrastructure?

Pre-Cloud Architectures

I learned the fundamentals of building complex distributed systems as a software engineer in the early dot-com boom. Through trial and error, I learned to build service-oriented architectures that were loosely coupled, scaled horizontally, and were highly optimized for performance. I started my learning at ProCD, where we built an early proprietary application server, and continued it at FireFly, which was a top 10 traffic site in its day. Those were heady days, when we were pioneering new techniques to build distributed web-based systems, eventually standardizing on many of the architectural components we all know today (e.g. application servers, distributed memory caches). The products I built were based on classic pre-cloud distributed architecture, and an entire tools industry quickly evolved to make building these architectures faster and easier. While we generally designed these architectures around highly available components - e.g. fault tolerant servers, firewalls, storage, routers, databases, and application servers - we expected failure to be the exception and not the rule.

Cloud Architecture

In 2000, UC Berkeley professor Eric Brewer postulated that you could have only two of the following: consistency, availability and partition tolerance. This conjecture became known as the CAP Theorem, and was the beginning of a new architectural model for distributed computing. By 2005, CAP became the underlying design principle around a series of products, including Google’s BigTable, Amazon’s Dynamo, and Facebook’s Cassandra. In Amazon’s 2007 paper Dynamo: Amazon’s Highly Available Key-value Store, its authors stated that “customers should be able to view and add items to their shopping cart even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.” In other words, the new architectures assumed failure is the rule and not the exception.

Comparing Pre- and Post-Cloud Architecture

While pre- and post-cloud architectures share many similarities, CAP Theorem suggests there is one area where they differ dramatically: their expectations toward failure. Pre-cloud architectures assume failure could occur - but cloud architectures assume failure will occur. Cloud architectures use shared compute, database and storage infrastructure optimized to CAP Theorem. The storage services may be optimized for availability and consistency at the expense of partition tolerance; the compute services may be optimized for partition tolerance and consistency at the expense of availability; etc...

Conclusions

I’ve never been a proponent of “big bang” re-architectures, instead preferring to incrementally evolve a product over time. But when it comes to leveraging the cloud, I think we need to take a close look at the cloud-readiness of our pre-cloud architectures. The lack of adherence to CAP Theorem has created a line of demarcation in software architectures that will likely only be crossed with new products or new architectures for existing products.

So when you ask your architect to evaluate your readiness for the cloud, make sure to take a long and hard look at the ability of your product to withstand environments where failures - the absence of consistency, availability or partition tolerance - is the rule and not the exception.