Skip to content

Distributed Systems - what should be considered

Introduction

In today’s world, many organizations are looking to build large, complex systems that can handle massive amounts of data and traffic. These systems are known as distributed systems, and they are becoming increasingly popular due to their ability to scale and handle high volumes of traffic. In this blog post, we will discuss what distributed systems are, their benefits, and some of the challenges that come with building them.

homepage-banner

Basic requirement

For a distributed system to be truly reliable, it must possess the following characteristics:

  • Fault-Tolerant: It can recover from component failures without performing incorrect actions.
  • Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed.
  • Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
  • Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system.
  • Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a “non-scalable” system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect.
  • Predictable Performance: The ability to provide desired responsiveness in a timely manner.
  • Secure: The system authenticates access to data and services.

Design in Distributed World

Composition

Typical Architecture

  • Load balancer with multiple backend replicas
  • Server with multiple backends
  • Server tree

Distributed State

The CAP Principle

  • Consistency
  • Availability
  • Partition Tolerance

8 Fallacies

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn’t change.
  6. There is only one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

Design for Operations

Operational Requirements

  • Configuration
  • Startup and shutdown
  • Queue draining
  • Software upgrades
  • Backups and restores
  • Redundancy
  • Replicated databases
  • Hot swaps
  • Toggles for individual features
  • Graceful degradation
  • Access controls and rate limits
  • Data import controls
  • Monitoring
  • Auditing
  • Debug instrumentation
  • Exception collection
  • Documentation for Operations

Platform Selection

Platform Description

A platform may be described along three axes

  • Level of service abstraction: IaaS, PaaS, SaaS
  • Type of machine: Physical, virtual, or process container
  • Level of resource sharing: Shared or private

Selection Strategies

Common Strategies

  • Default to Virtual
  • Make a Cost-Based Decision
  • Leverage Provider Expertise
  • Get Started Quickly
  • Implement Ephemeral Computing
  • Use the Cloud for Overflow Capacity
  • Leverage Superior Infrastructure
  • Develop an In-House Service Provider
  • Contract for an On-Premises, Externally Run Service
  • Implement a Bare Metal Cloud

Application Architectures

General architecture categories

  • Single-Machine Web Server
  • Two-Tier Web Service
  • Three-Tier Web Service
  • Four-Tier Web Service

Load Balancer Types

  • DNS Round Robin
  • Layer 3 and 4 Load Balancers
  • Layer 7 Load Balancer

Load Balancing Methods

  • Round Robin (RR)
  • Weighted RR
  • Least Loaded (LL)
  • Least Loaded with Slow Start
  • Utilization Limit
  • Latency
  • Cascade

Also need to consider

  • Reverse Proxy Service
  • Cloud-Scale Service
  • Message Bus Architectures
  • Service-Oriented Architecture

Design for Scaling

General Strategy

  1. Identify Bottlenecks
  2. Reengineer Components
  3. Measure Results
  4. Be Proactive

The AKF Scaling Cube

developed by Abbott, Keeven, and Fisher

  • x: Horizontal Duplication (also known as horizontal scaling or scaling out.)
  • y: Functional or Service Splits
  • z: Lookup-Oriented Split

Others need to consider

  • Caching
  • Data Sharding
  • Threading
  • Queueing
  • CDN

Caching

  • Cache Effectiveness
  • Cache Placement
  • Cache Persistence
  • Cache Replacement Algorithms
  • Cache Entry Invalidation
  • Cache Size

Design for Resiliency

key features for Resiliency

  • Everything Malfunctions Eventually
  • Resiliency through Spare Capacity
  • Failure Domains
  • Software Failures
  • Physical Failures
  • Overload Failures
  • Human Error

Operations in Distributed World

Distributed Systems Operations

  • Defining SRE
  • Change versus Stability
  • Operations at Scale

Service Life Cycle

  • Service Launch
  • Emergency Tasks
  • Nonemergency Tasks
  • Upgrades
  • Decommissioning
  • Project Work

Organizing Strategy

Daily work categories

  • Emergency Issues
  • Normal Requests
  • Project Work

Team Member Day Types

  1. Project-Focused Days
  2. Oncall Days
  3. Ticket Duty Days

DevOps

Three Ways of DevOps

  1. Workflow
    • Ensure each step is done in a repeatable way
    • Never pass defects to the next step
    • Ensure no local optimizations degrade global performance
    • Increase the flow of work
  2. Improve Feedback
    • Understand and respond to all customers, internal and external
    • Shorten feedback loops
    • Amplify all feedback
    • Embed knowledge where it is needed
  3. ContinualExperimentationand Learning
    • Rituals are created that reward risk taking
    • Management allocates time for projects that improve the system
    • Faults are introduced into the system to increase resilience
    • You try “crazy” or audacious things

Common Technical DevOps Practices

  • Same Development and Operations Toolchain
  • Consistent Software Development Life Cycle (SDLC)
  • Managed Configuration and Automation
  • Infrastructure as Code
  • Automated Provisioning and Deployment
  • Artifact-Scripted Database Changes
  • Automated Build and Release
  • Release Vehicle Packaging
  • Abstracted Administration

Service Delivery

Build-Phase Steps: Develop -> Commit -> Build -> Package -> Register

delivery platform should consider

  • Confidence
  • Reduced Risk
  • Shorter Interval from Keyboard to Production
  • Less Wait Time
  • Less Rework
  • Improved Execution
  • A Culture of Continuous Improvement
  • Improved Job Satisfaction

Deployment-Phase Steps: promoted, installed, and configured

Upgrading

There are many kinds of upgrading.

  • Taking the Service Down for Upgrading
  • Rolling Upgrades
  • Canary
  • Phased Roll-outs
  • Proportional Shedding
  • Blue-Green Deployment
  • Toggling Features

Taking Toggling Features as an example.

reasons to use flag flips

  • Rapid Development
  • Gradual Introduction of New Features
  • Finely Timed Release Dates
  • Dynamic Roll Backs
  • Bug Isolation
  • A-BTesting
  • One Percent Testing
  • Differentiated Services

Continuous Deployment

factors should be taken into consideration when deciding whether to pause continuous delivery

  • Build Health
  • Test Comprehensiveness
  • Test Reproducibility
  • Production Health
  • Schedule Permission
  • Oncall Schedule
  • Manual Stop
  • Push Conflicts
  • Intentional Delays
  • Resource Contention

Terms to Know

  • Server
  • Service
  • Machine
  • QPS
  • Traffic
  • Performant: A neologism from merging “performance” and “conformant”.
  • IaaS
  • Saas
  • Paas
  • Oversubscribed
  • Undersubscribed
  • Static Content
  • Dynamic Content
  • Database-Driven Dynamic Content
  • Control Panel
  • Main Database
  • Trend Server
  • Link Redirect Servers
  • Content Delivery Networks
  • Outage
  • Failure
  • Malfunction
  • MTBF
  • Innovate
  • Oncall
  • Soft launch
  • SRE
  • Stakeholders
  • Artifacts
  • Service Delivery Flow
  • Cycle Time
  • Deployment
  • Release Candidate
  • Release
  • Domain-Specific Language
  • Toil

Reference

  • The Practice of Cloud System Administration - DevOps and SRE Practices for Web Services Volume 2 (Thomas A. Limoncelli Strata R. Chalup Christina J. Hogan)
  • https://www.atlassian.com/microservices/microservices-architecture/distributed-architecture
  • https://en.wikipedia.org/wiki/Distributed_computing
  • https://aws.amazon.com/builders-library/challenges-with-distributed-systems/
  • Understanding Distributed Systems (Roberto Vitillo)
  • Foundations of Scalable Systems (Ian Gorton)
  • Distributed Systems: Concepts and Design, 5th ed. (Pearson, 2001)
  • Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services (Brendan Burn)
  • The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise by Abbott and Fisher (2009)
  • Scalability Rules: 50 Principles for Scaling Web Sites, also by Abbott and Fisher (2011)
  • Introduction to Distributed System Design
Buy Me a Coffee