Search
Friday 29 May 2020
  • :
  • :

Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services

Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services

Authors: Mark Simms and Michael Thomassy

 

Overview

 

Cloud computing is distributed computing; distributing computing requires thoughtful planning and delivery – regardless of the platform choice. The purpose of this document is to provide thoughtful guidance based on real-world customer scenarios for building scalable applications on Windows Azure and SQL Database, leveraging the Platform-as-a-Service (PaaS) approach (such applications are built as Windows Azure Cloud Services, using web and worker roles).

 

ImportantImportant
NOTE: All of the best practice guidance in this paper is derived from deep engagement with customers that are running production code on Windows Azure. This paper discusses the Windows Azure Cloud Services (PaaS) platform based on the v1.6 SDK release; nor does it cover upcoming features such as Windows Azure Web Sites and Windows Azure Virtual Machines (IaaS). Future releases of this document will incorporate best practices associated with these and other features and services.

 

This document covers the underlying design concepts for building Windows Azure applications, key Windows Azure platform capabilities, limits and features, as well as best practices for working with the core Windows Azure services. The focus is on those applications amenable to a to a loosely-consistent distributed data store (as opposed to strictly consistent or high-density multi-tenant data models).

 

Shifting aspects of your applications and services into Windows Azure can be attractive for many reasons, such as:

  • Conserving or consolidating capital expenditures into operational expenditures (capex into opex).
  • Reducing costs (and improving efficiency) by more closely matching demand and capacity.
  • Improve agility and time to market by reducing or removing infrastructure barriers.
  • Increasing audience reach to new markets such as mobile devices.
  • Benefiting from the massive scale of cloud computing by building new applications that can support a global audience in geo-distributed datacenters.

There are many excellent technical reasons to develop new applications or port some or all of an existing application to Windows Azure. As the environment is rich with implementation alternatives, one must carefully evaluate your specific application pattern in order to select the correct implementation approach. Some applications are a good fit for Windows Azure Cloud Services (which is a Platform-as-a-Service, or PaaS approach), whilst others might benefit from a partial or complete infrastructure-as-a-service (or IaaS) approach, such as Windows Azure Virtual Machines. Finally, certain application requirements might be best served by using both together.

 

Your application should have one or more of the following three key aspects in order to maximize the benefits of Windows Azure Cloud Services. (Not all of these need to be present in your application; an application may well yield a strong return on your investment by making very good use of Windows Azure with only one of the following aspects. However, a workload that does not exhibit any of these characteristics is probably not an ideal fit for Windows Azure Cloud Services.)

 

The important aspects that an application should be evaluated are:

  • Elastic demand. One of the key value propositions of moving to Windows Azure is elastic scale: the ability to add or remove capacity from the application (scaling-out and scaling-in) to more closely match dynamic user demand. If your workload has a static, steady demand (for example, a static number of users, transactions, and so on) this advantage of Windows Azure Cloud Services is not maximized.
  • Distributed users and devices. Running on Windows Azure gives you instant access to global deployment of applications. If your workload has a captive user base running in a single location (such as a single office), cloud deployment may not provide optimal return on investment.
  • Partitionable workload. Cloud applications scale by scaling out – and thereby adding more capacity in smaller chunks. If your application depends upon scaling up (for example, large databases and data warehouses) or is a specialized, dedicated workload (for example, large, unified high-speed storage), it must be decomposed (partitioned) to run on scale-out services to be feasible in the cloud. Depending on the workload, this can be a non-trivial exercise.

 

To reiterate: When evaluating your application, you may well achieve a high return on the investment of moving or building on Windows Azure Cloud Services if your workload has only one of the preceding three aspects that shine in a platform-as-a-service environment like Windows Azure Cloud Services. Applications that have all three characteristics are likely to see a strong return on the investment.

Design Concepts

While many aspects of designing applications for Windows Azure are very familiar from on-premises development, there are several key differences in how the underlying platform and services behave. Understanding these differences, and as a result how to design for the platform – not against it – are crucial in delivering applications that redeem the promise of elastic scale in the cloud.

 

This section outlines five key concepts that are the critical design points of building large-scale, widely distributed scale-out applications for Platform-as-a-Service (PaaS) environments like Windows Azure Cloud Services. Understanding these concepts will help you design and build applications that do not just work on Windows Azure Cloud Services, but thrive there, returning as many of the benefits of your investment as possible. All of the design considerations and choices discussed later in this document will tie back to one of these five concepts.

 

In addition, it’s worth noting that while many of these considerations and best practices are viewed through the lens of a .NET application, the underlying concepts and approaches are largely language or platform agnostic.

Scale-out not Scale-up

 

The primary shift in moving from an on-premises application to Windows Azure Cloud Services is related to how applications scale. The traditional method of building larger applications relies on a mix of scale-out (stateless web and application servers) and scale-up (buy a bigger multi-core/large memory system, database server, build a bigger data center, and so on). In the cloud scale-up is not a realistic option; the only path to achieving truly scalable applications is by explicit design for scale-out.

 

As many of the elements of an on-premise application are already amenable to scale-out (web servers, application servers), the challenge lies in identifying those aspects of the application which take a dependency on a scale-up service and converting (or mapping) that to a scale-out implementation. The primary candidate for a scale-up dependency is typically the relational database (SQL Server / Windows Azure SQL Database).

 

Traditional relational designs have focused around a globally coherent data model (single-server scale-up) with strong consistency and transactional behavior. The traditional approach to providing scale against this backing store has been to “make everything stateless”, deferring the responsibility of managing state to the scale-up SQL server.

 

However, the incredible scalability of SQL Server carries with it a distinct lack of truly elastic scale. That is, instead of extremely responsive resource availability, you must instead buy a bigger server with an expensive migration phase, with capacity greatly exceeding demand, and that each time you expand scale-up. In addition, there is an exponential cost curve when scaling past mid-range hardware.

 

With the architectural underpinning of Windows Azure Cloud Services being scale-out, applications need to be designed to work with scale-out data stores. This means design challenges such as explicitly partitioning data into smaller chunks (each of which can fit in a data partition or scale-out unit), and managing consistency between distributed data elements. This achieves scale through partitioning in a way that avoids many of the drawbacks of designing to scale up.

 

Pivoting from well-known scale-up techniques and toward scale-out data and state management is typically one of the biggest hurdles in designing for the cloud; addressing these challenges and designing applications that can take advantage of the scalable, elastic capabilities of Windows Azure Cloud Services and Windows Azure SQL Database for managing durable data is the focus of much of this document.

Everything has a Limit: Compose for Scale

 

In a world where you run your own data center you have a nearly infinite degree of control, juxtaposed with a nearly infinite number of choices. Everything from the physical plant (air conditioning, electrical supply, floor space) to the infrastructure (racks, servers, storage, networking, and so on) to the configuration (routing topology, operating system installation) is under your control.

 

This degree of control comes with a cost – capital, operational, human and time. The cost of managing all of the details in an agile and changing environment is at the heart of the drive towards virtualization, and a key aspect of the march to the cloud. In return for giving up a measure of control, these platforms reduce the cost of deployment, management and increase agility. The constraint that they impose is that the size (capacity, throughput, and so on) of the available components and services is restricted to a fixed set of offerings.

 

To use an analogy, commercial bulk shipping is primarily based around shipping containers. These containers may be carried by various transports (ships, trains and trucks) and are available in a variety of standard sizes (up to 53 feet long). If the amount of cargo you wish to ship exceeds the capacity of the largest trailer, you either need to use:

  • Use multiple trailers. This involves splitting (or partitioning) the cargo to fit into different containers, and coordinating the delivery of the trailers.
  • Use a special shipping method. For cargo which cannot be distributed into standard containers (too large, bulky, and so on), highly specialized methods, such as barges, need to be used. These methods are typically far more expensive than standard cargo shipping.

 

Bringing this analogy back to Windows Azure (and cloud computing in general), every resource has a limit. Be it an individual role instance, a storage account, a cloud service, or even a data center – every available resource in Azure has some finite limit. These may be very large limits, such as the amount of storage available in a data center (similar to how the largest cargo ships can carry over 10,000 containers), but they are finite.

 

With this in mind, the approach to scale is to: partition the load, and compose it across multiple scale units – be that multiple VMs, databases, storage accounts, cloud services, or data centers.

 

In this document we use the term scale unit to refer to a group of resources that can (a) handle a defined degree of load, and (b) composed together to handle additional load. For example, a Windows Azure Storage account has a maximum size of 100 TB. If you need to store more than 100 TB of data, you will need to use multiple storage accounts (i.e. at least two scale units of storage).

 

The general design guidelines for capacity of each core service or component of Azure is discussed in later sections, along with recommended approaches for composing these services for additional scale.

Design for Availability

 

Enormous amounts of time, energy and intellectual capital have been invested into building highly resilient applications on-premise. This typically boils down into separating the application into low-state components (application servers, networking) and high-state components (databases, SANs), and making each resilient against failure modes.

 

In this context, a failure mode refers to a combination of (a) observing the system in a failure state, (b) as a result of failure cause. For example, a database being inaccessible due to a misconfigured password update is a failure mode; the failure state is the inability to connect (connection refused, credentials not accepted), and failure cause is a password update which was not properly communicated to application code.

 

Low state components deliver resiliency through loosely coupled redundancy, with their “integration” into the system managed by external systems. For example, placing additional web servers behind a load balancer; each web server is identical to the others (making adding new capacity a matter of cloning a base web server image), with integration into the overall application managed by the load balancer.

 

High state components deliver resiliency through tightly coupled redundancy, with the “integration” tightly managed between components. Examples are:

  • SQL Server. Adding a redundant SQL Server instance as part of an active/passive cluster requires careful selection of compatible (that is, identical!) hardware, and shared storage (such as a SAN) to deliver transactional consistent failover between multiple nodes.
  • Electrical supply. Providing redundant electrical supply presents a very sophisticated example, requiring multiple systems acting in concert to mitigate local (multiple power supplies for a server, with onboard hardware to switch between primary and secondary supplies) and center (backup generators for loss of power) wide issues.

 

Resiliency solutions based around tightly coupled approaches are inherently more expensive than loosely coupled “add more cloned stuff” approaches, by requiring highly trained personnel, specialized hardware, with careful configuration and testing. Not only is it hard to get it right, but it costs money to do it correctly.

 

This approach of focusing on ensuring that the hardware platforms are highly resilient can be thought of as a “titanium eggshell”. To protect the contents of the egg, we coat the shell in a layer of tough (and expensive) titanium.

 

Experience running systems at scale (see http://www.mvdirona.com/jrh/TalksAndPapers/JamesRH_Lisa.pdf for further discussion) has demonstrated that in any sufficiently large system (such as data systems at the scale of Windows Azure) the sheer number of physical moving parts leads to some pieces of the system always being broken. The Windows Azure platform was designed in such a way to work with this limitation, rather than against it, relying on automated recovery from node-level failure events. This design intent flows through all core Windows Azure services, and is key to designing an application that works with the Windows Azure availability model.

 

Shifting to Windows Azure changes the resiliency conversation from an infrastructure redundancy challenge to a services redundancy challenge. Many of the core services which dominate availability planning on premise “just keep working” in Azure:

  • SQL Database automatically maintains multiple transactionally consistent replicas of your data. Node level failures for a database automatically fail over to the consistent secondary; contrast the ease of this experience with the time and expense required to provide on-premise resiliency.
  • Windows Azure Storage automatically maintains multiple consistent copies of your data (For further reading, see http://sigops.org/sosp/sosp11/current/2011-Cascais/11-calder-online.pdf). Node level failures for a storage volume automatically fail over to a consistent secondary. As with SQL Database, contrast this completely managed experience with the direct management of resilient storage in an on-premise cluster or SAN.

However, the title of this section is availability, not resiliency. Resiliency is only one part of the overall story of continuously delivering value to users within the bounds of an SLA. If all of the infrastructure components of a service are healthy, but the service cannot cope with the expected volume of users, it is not available, nor is it delivering value.

 

Mobile- or social-centric workloads (such as public web applications with mobile device applications) tend to be far more dynamic than those that target captive audiences, and they require a more sophisticated approach to handling burst events and peak load. The key concepts to keep in mind for designing availability into Windows Azure applications is discussed in detail throughout this document, based on these pivots:

  • Each service or component in Windows Azure provides a certain Service Level Agreement (SLA); this SLA may not be directly correlated with the availability metric required to run your application. Understanding all of the components in your system, their availability SLA, and how they interact is critical in understanding the overall availability that can be delivered to your users.
    • Avoid single points of failure that will degrade your SLA, such as single-instance roles.
    • Compose or fall back to multiple components to mitigate the impact of a specific service being offline or unavailable.
  • Each service or component in Windows Azure can experience failure, either a short-lived transient, or a long-lived event. Your application should be written in such a way as to handle failure gracefully.
    • For transient failures, provide appropriate retry mechanisms to reconnect or resubmit work.
    • For other failure events, provide rich instrumentation on the failure event (reporting the error to operations), and a suitable error message back to the user.
    • Where possible, fall back to a different service or workflow. For example, if a request to insert data into SQL Database fails (for any non-transient reason, such as invalid schema), write the data into blob storage in a serialized format. This would allow the data to be durably captured, and submitted to the database after the schema issue has been resolved.
  • All services will have a peak capacity, either explicitly (through a throttling policy or peak load asymptote) or implicitly (by hitting a system resource limit).
    • Design your application to gracefully degrade in the face of hitting resource limits, and taking appropriate action to soften the impact to the user.
    • Implement appropriate back-off/retry logic to avoid a “convoy” effect against services. Without an appropriate back-off mechanism, downstream services will never have a chance to catch up after experiencing a peak event (as your application will be continuously trying to push more load into the service, triggering the throttling policy or resource starvation).
  • Services which can experience rapid burst events need to gracefully handle exceeding their peak design load, typically through shedding functionality.
    • Similar to how the human body restricts blood flow to the extremities when dealing with extreme cold, design your services to shed less-important services during extreme load events.
    • The corollary here is that not all services provided by your application have equivalent business criticality, and can be subject to differential SLAs.

 

These high level concepts will be applied in more detail in each of the sections describing the core Windows Azure services, along with availability goals for each service or component, and recommendations on how to design for availability. Keep in mind that the datacenter is still a single-point-of-failure for large applications; from electrical supplies (example here) to systems error (example here), infrastructure and application issues have brought down data centers. While relatively rare, applications requiring the highest levels of uptime should be deployed in multiple redundant data centers.

 

Deploying applications in multiple data centers requires a number of infrastructure and application capabilities:

  • Application logic to route users of the services to the appropriate data center (based on geography, user partitioning, or other affinity logic).
  • Synchronization and replication of application state between data centers, with appropriate latency and consistency levels.
  • Autonomous deployment of applications, such that dependencies between data centers are minimized (that is, avoid the situation wherein a failure in data center A triggers a failure in data center B).

 

Design for Business Continuity

 

As with availability, providing disaster recovery solutions (in case of data center loss) have required enormous time, energy and capital. This section will focus on approaches and considerations for providing business continuity in the face of system failure and data loss (whether system or user triggered), as the term “disaster recovery” has taken on specific connotations around implementation approaches in the database community.

 

Delivering business continuity breaks down to:

  • Maintaining access to and availability of critical business systems (applications operating against durable state) in the case of catastrophic infrastructure failure.
  • Maintaining access to and availability of critical business data (durable state) in the case of catastrophic infrastructure failure.
  • Maintaining availability of critical business data (durable state) in the case of operator error or accidental deletion, modification or corruption.

 

The first two elements have traditionally been addressed in the context of geographic disaster recovery (geo-DR), with the third being the domain of data backup and data restoration.

 

Windows Azure changes the equation significantly for availability of critical business systems, enabling rapid geographically distributed deployment of key applications into data centers around the globe. Indeed, the process of rolling out a geographically distributed application is little different than rolling out a single cloud service.

 

The key challenge remains managing access to durable state; accessing durable state services (such as Windows Azure Storage and SQL Database) across data centers typically produces sub-optimal results due to high and/or variable latency, and does not fulfill the business continuity requirement in case of data center failure.

 

As with resiliency, many Windows Azure services provide (or have in their roadmap) automatic geo replication. For example, unless specifically configured otherwise, all writes into Windows Azure storage (blob, queue, or table) are automatically replicated to another data center (each data center has a specific “mirror” destination within the same geographic region). This greatly reduces the time and effort required to provide traditional disaster recovery solutions on top of Windows Azure. An overview of the geo replication capabilities of the core Windows Azure services managing durable state is provided in later sections.

 

For maintaining business continuity in the face of user or operator error, there are several additional considerations to account for in your application design. While Windows Azure Storage provides limited audit capability through the Storage Analytics feature (described in a later section), it does not provide any point-in-time restore capabilities. Services requiring resiliency in the face of accidental deletion or modification will need to look at application-centric approaches, such as periodically copying blobs to a different storage account.

 

SQL Database provides basic functionality for maintaining historical snapshots of data, including DB Copy and import/export via bacpac. These options are discussed in detail later in this document.

Density is Cost of Goods

 

With the elastic scale provided by the Windows Azure platform, the supply curve can closely match the demand curve (rather than having a large amount of extra capacity to account for peak load).

 

With elastic scale, cost of goods is driven by:

  • How many scale units are employed in a solution; VMs, storage accounts, and so on (composing for scale)
  • How efficiently work is performed by those scale units.

 

We refer to how much work can be performed for a given amount of capacity as the density of the application. Denser services and frameworks allow a greater amount of work to be performed for a given resource deployment; that is to say improving density enables reduction in deployed capacity (and cost), or the ability to absorb additional load with the same deployed capacity. Density is driven by two key factors:

  • How efficiently work is performed within a scale unit. This is the traditional form of performance optimization – managing thread contention and locks, optimizing algorithms, tuning SQL queries.
  • How efficiently work is coordinated across scale units. In a world where systems are made up of larger numbers of smaller units, the ability to efficiently stitch them together is critical in delivering efficiency. This involves frameworks and tools that communicate across components, such as SOAP messaging stacks (such as WCF, ORM’s (such as Entity Framework), TDS calls (SQL client code), and object serialization (such as Data Contracts or JSON).

 

In addition to the traditional optimization techniques used against a single computer (or database), optimizing the distributed communication and operations is critical in delivering a scalable, efficient Windows Azure service. These key optimizations are covered in detail in later sections:

  • Chunky not chatty. For every distributed operation (that is, one resulting in a network call) there is a certain amount of overhead for packet framing, serialization, processing, and so on. To minimize the amount of overhead, try to batch into a smaller number of “chunky” operations rather than a large number of “chatty” operations. Keep in mind that batching granular operations does increase latency and exposure to potential data loss. Examples of proper batching behavior are:
    • SQL. Execute multiple operations in a single batch.
    • REST and SOAP services (such as WCF). Leverage message-centric operation interfaces, rather than a chatty RPC style, and consider a REST-based approach if possible.
    • Windows Azure storage (blobs, tables, queues). Publish multiple updates in a batch, rather than individually.
  • Impact of serialization. Moving data between machines (as well as in and out of durable storage) generally requires the data be serialized into a wire format. The efficiency (that is, the time taken and space consumed) of this operation quickly dominates overall application performance for larger systems.
    • Leverage highly efficient serialization frameworks.
    • Use JSON for communication with devices, or for interoperable (human readable) applications.
    • Use very efficient binary serialization (such as protobuf or Avro) for service-to-service communication when you control both endpoints.
  • Use efficient frameworks. There are many rich frameworks available for development, with large, sophisticated feature sets. The downside to many of these frameworks is that you often pay the performance cost for features you don’t use.
    • Isolate services and client APIs behind generic interfaces to allow for replacement or side-by-side evaluation (either through static factories, or an inversion of control container). For example, provide a pluggable caching layer by working against a generic interface rather than a specific implementation (such as Windows Azure caching).

Exploring Windows Azure Cloud Services

 

In the previous section we introduced the key design concepts and perspectives for building applications that take advantage of the cloud fabric provided by Windows Azure. This section will explore the core platform services and features, illustrating their capabilities, scale boundaries and availability patterns.

 

As every Windows Azure service or infrastructure component provides a finite capacity with an availability SLA, understanding these limits and behaviors is critical in making appropriate design choices for your target scalability goals and end-user SLA. Each of the core Windows Azure services is presented in the context of four key pivots: features and their intent; density; scalability; and availability.

Windows Azure Subscription

 

A Windows Azure subscription is the basic unit of administration, billing, and service quotas. Each Windows Azure subscription has a default set of quotas that can be increased by contacting support, and are intended to prevent accidental overages and resource consumption.

 

Each subscription has an account owner, and a set of co-admins, authorized through Microsoft Accounts (formerly Live IDs), who have full control over the resources in the subscription through the management portal. They can create storage accounts, deploy cloud services, change configurations, and can add or remove co-admins.

 

The Windows Azure Management APIs (REST-based web services) provide an automation interface for creating, configuring and deploying Windows Azure services (used by the management portal under the hood). Access to these APIs is restricted using management

 

Contributing authors: Jason Roth and Ralph Squillace

Reviewers: Brad Calder, Dennis Mulder, Mark Ozur, Nina Sarawgi, Marc Mercuri, Conor Cunningham, Peter Carlin, Stuart Ozer, Lara Rubbelke, and Nicholas Dritsas.

Original Post:   Best Practices for the Design of Large-Scale Services on Windows Azure Cloud Services




Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.