Distributed Systems - what should be considered

Introduction

In today’s world, many organizations are looking to build large, complex systems that can handle massive amounts of data and traffic. These systems are known as distributed systems, and they are becoming increasingly popular due to their ability to scale and handle high volumes of traffic. In this blog post, we will discuss what distributed systems are, their benefits, and some of the challenges that come with building them.

Basic requirement

For a distributed system to be truly reliable, it must possess the following characteristics:

Fault-Tolerant: It can recover from component failures without performing incorrect actions.
Highly Available: It can restore operations, permitting it to resume providing services even when some components have failed.
Recoverable: Failed components can restart themselves and rejoin the system, after the cause of failure has been repaired.
Consistent: The system can coordinate actions by multiple components often in the presence of concurrency and failure. This underlies the ability of a distributed system to act like a non-distributed system.
Scalable: It can operate correctly even as some aspect of the system is scaled to a larger size. For example, we might increase the size of the network on which the system is running. This increases the frequency of network outages and could degrade a “non-scalable” system. Similarly, we might increase the number of users or servers, or overall load on the system. In a scalable system, this should not have a significant effect.
Predictable Performance: The ability to provide desired responsiveness in a timely manner.
Secure: The system authenticates access to data and services.

Design in Distributed World

Composition

Typical Architecture

Load balancer with multiple backend replicas
Server with multiple backends
Server tree

Distributed State

The CAP Principle

Consistency
Availability
Partition Tolerance

8 Fallacies

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn’t change.
There is only one administrator.
Transport cost is zero.
The network is homogeneous.

Design for Operations

Operational Requirements

Configuration
Startup and shutdown
Queue draining
Software upgrades
Backups and restores
Redundancy
Replicated databases
Hot swaps
Toggles for individual features
Graceful degradation
Access controls and rate limits
Data import controls
Monitoring
Auditing
Debug instrumentation
Exception collection
Documentation for Operations

Platform Selection

Platform Description

A platform may be described along three axes

Level of service abstraction: IaaS, PaaS, SaaS
Type of machine: Physical, virtual, or process container
Level of resource sharing: Shared or private

Selection Strategies

Common Strategies

Default to Virtual
Make a Cost-Based Decision
Leverage Provider Expertise
Get Started Quickly
Implement Ephemeral Computing
Use the Cloud for Overflow Capacity
Leverage Superior Infrastructure
Develop an In-House Service Provider
Contract for an On-Premises, Externally Run Service
Implement a Bare Metal Cloud

Application Architectures

General architecture categories

Single-Machine Web Server
Two-Tier Web Service
Three-Tier Web Service
Four-Tier Web Service

Load Balancer Types

DNS Round Robin
Layer 3 and 4 Load Balancers
Layer 7 Load Balancer

Load Balancing Methods

Round Robin (RR)
Weighted RR
Least Loaded (LL)
Least Loaded with Slow Start
Utilization Limit
Latency
Cascade

Also need to consider

Reverse Proxy Service
Cloud-Scale Service
Message Bus Architectures
Service-Oriented Architecture

Design for Scaling

General Strategy

Identify Bottlenecks
Reengineer Components
Measure Results
Be Proactive

The AKF Scaling Cube

developed by Abbott, Keeven, and Fisher

x: Horizontal Duplication (also known as horizontal scaling or scaling out.)
y: Functional or Service Splits
z: Lookup-Oriented Split

Others need to consider

Caching
Data Sharding
Threading
Queueing
CDN

Caching

Cache Effectiveness
Cache Placement
Cache Persistence
Cache Replacement Algorithms
Cache Entry Invalidation
Cache Size

Design for Resiliency

key features for Resiliency

Everything Malfunctions Eventually
Resiliency through Spare Capacity
Failure Domains
Software Failures
Physical Failures
Overload Failures
Human Error

Operations in Distributed World

Distributed Systems Operations

Defining SRE
Change versus Stability
Operations at Scale

Service Life Cycle

Service Launch
Emergency Tasks
Nonemergency Tasks
Upgrades
Decommissioning
Project Work

Organizing Strategy

Daily work categories

Emergency Issues
Normal Requests
Project Work

Team Member Day Types

Project-Focused Days
Oncall Days
Ticket Duty Days

DevOps

Three Ways of DevOps

Workflow
- Ensure each step is done in a repeatable way
- Never pass defects to the next step
- Ensure no local optimizations degrade global performance
- Increase the flow of work
Improve Feedback
- Understand and respond to all customers, internal and external
- Shorten feedback loops
- Amplify all feedback
- Embed knowledge where it is needed
ContinualExperimentationand Learning
- Rituals are created that reward risk taking
- Management allocates time for projects that improve the system
- Faults are introduced into the system to increase resilience
- You try “crazy” or audacious things

Common Technical DevOps Practices

Same Development and Operations Toolchain
Consistent Software Development Life Cycle (SDLC)
Managed Configuration and Automation
Infrastructure as Code
Automated Provisioning and Deployment
Artifact-Scripted Database Changes
Automated Build and Release
Release Vehicle Packaging
Abstracted Administration

Service Delivery

Build-Phase Steps: Develop -> Commit -> Build -> Package -> Register

delivery platform should consider

Confidence
Reduced Risk
Shorter Interval from Keyboard to Production
Less Wait Time
Less Rework
Improved Execution
A Culture of Continuous Improvement
Improved Job Satisfaction

Deployment-Phase Steps: promoted, installed, and configured

Upgrading

There are many kinds of upgrading.

Taking the Service Down for Upgrading
Rolling Upgrades
Canary
Phased Roll-outs
Proportional Shedding
Blue-Green Deployment
Toggling Features

Taking Toggling Features as an example.

reasons to use flag flips

Rapid Development
Gradual Introduction of New Features
Finely Timed Release Dates
Dynamic Roll Backs
Bug Isolation
A-BTesting
One Percent Testing
Differentiated Services

Continuous Deployment

factors should be taken into consideration when deciding whether to pause continuous delivery

Build Health
Test Comprehensiveness
Test Reproducibility
Production Health
Schedule Permission
Oncall Schedule
Manual Stop
Push Conflicts
Intentional Delays
Resource Contention

Terms to Know

Server
Service
Machine
QPS
Traffic
Performant: A neologism from merging “performance” and “conformant”.
IaaS
Saas
Paas
Oversubscribed
Undersubscribed
Static Content
Dynamic Content
Database-Driven Dynamic Content
Control Panel
Main Database
Trend Server
Link Redirect Servers
Content Delivery Networks
Outage
Failure
Malfunction
MTBF
Innovate
Oncall
Soft launch
SRE
Stakeholders
Artifacts
Service Delivery Flow
Cycle Time
Deployment
Release Candidate
Release
Domain-Specific Language
Toil

Reference

The Practice of Cloud System Administration - DevOps and SRE Practices for Web Services Volume 2 (Thomas A. Limoncelli Strata R. Chalup Christina J. Hogan)
https://www.atlassian.com/microservices/microservices-architecture/distributed-architecture
https://en.wikipedia.org/wiki/Distributed_computing
https://aws.amazon.com/builders-library/challenges-with-distributed-systems/
Understanding Distributed Systems (Roberto Vitillo)
Foundations of Scalable Systems (Ian Gorton)
Distributed Systems: Concepts and Design, 5^th ed. (Pearson, 2001)
Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services (Brendan Burn)
The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise by Abbott and Fisher (2009)
Scalability Rules: 50 Principles for Scaling Web Sites, also by Abbott and Fisher (2011)
Introduction to Distributed System Design

Leave a message