Systems Design in real world
Introduction
In today’s world, many organizations are looking to build large, complex systems that can handle massive amounts of data and traffic. These systems are known as distributed systems, and they are becoming increasingly popular due to their ability to scale and handle high volumes of traffic. In this blog post, we will discuss what distributed systems are, their benefits, and some of the challenges that come with building them.
Design in Distributed World
Composition
Typical Architecture
- Load balancer with multiple backend replicas
- Server with multiple backends
- Server tree
Distributed State
The CAP Principle
- Consistency
- Availability
- Partition Tolerance
Design for Operations
Operational Requirements
- Configuration
- Startup and shutdown
- Queue draining
- Software upgrades
- Backups and restores
- Redundancy
- Replicated databases
- Hot swaps
- Toggles for individual features
- Graceful degradation
- Access controls and rate limits
- Data import controls
- Monitoring
- Auditing
- Debug instrumentation
- Exception collection
- Documentation for Operations
Platform Selection
Platform Description
A platform may be described along three axes
- Level of service abstraction: IaaS, PaaS, SaaS
- Type of machine: Physical, virtual, or process container
- Level of resource sharing: Shared or private
Selection Strategies
Common Strategies
- Default to Virtual
- Make a Cost-Based Decision
- Leverage Provider Expertise
- Get Started Quickly
- Implement Ephemeral Computing
- Use the Cloud for Overflow Capacity
- Leverage Superior Infrastructure
- Develop an In-House Service Provider
- Contract for an On-Premises, Externally Run Service
- Implement a Bare Metal Cloud
Application Architectures
General architecture categories
- Single-Machine Web Server
- Two-Tier Web Service
- Three-Tier Web Service
- Four-Tier Web Service
Load Balancer Types
- DNS Round Robin
- Layer 3 and 4 Load Balancers
- Layer 7 Load Balancer
Load Balancing Methods
- Round Robin (RR)
- Weighted RR
- Least Loaded (LL)
- Least Loaded with Slow Start
- Utilization Limit
- Latency
- Cascade
Also need to consider
- Reverse Proxy Service
- Cloud-Scale Service
- Message Bus Architectures
- Service-Oriented Architecture
Design for Scaling
General Strategy
- Identify Bottlenecks
- Reengineer Components
- Measure Results
- Be Proactive
The AKF Scaling Cube
developed by Abbott, Keeven, and Fisher
- x: Horizontal Duplication (also known as horizontal scaling or scaling out.)
- y: Functional or Service Splits
- z: Lookup-Oriented Split
Others need to consider
- Caching
- Data Sharding
- Threading
- Queueing
- CDN
Caching
- Cache Effectiveness
- Cache Placement
- Cache Persistence
- Cache Replacement Algorithms
- Cache Entry Invalidation
- Cache Size
Design for Resiliency
key features for Resiliency
- Everything Malfunctions Eventually
- Resiliency through Spare Capacity
- Failure Domains
- Software Failures
- Physical Failures
- Overload Failures
- Human Error
Operations in Distributed World
Distributed Systems Operations
- Defining SRE
- Change versus Stability
- Operations at Scale
Service Life Cycle
- Service Launch
- Emergency Tasks
- Nonemergency Tasks
- Upgrades
- Decommissioning
- Project Work
Organizing Strategy
Daily work categories
- Emergency Issues
- Normal Requests
- Project Work
Team Member Day Types
- Project-Focused Days
- Oncall Days
- Ticket Duty Days
DevOps
Three Ways of DevOps
- Workflow
- Ensure each step is done in a repeatable way
- Never pass defects to the next step
- Ensure no local optimizations degrade global performance
- Increase the flow of work
- Improve Feedback
- Understand and respond to all customers, internal and external
- Shorten feedback loops
- Amplify all feedback
- Embed knowledge where it is needed
- ContinualExperimentationand Learning
- Rituals are created that reward risk taking
- Management allocates time for projects that improve the system
- Faults are introduced into the system to increase resilience
- You try “crazy” or audacious things
Common Technical DevOps Practices
- Same Development and Operations Toolchain
- Consistent Software Development Life Cycle (SDLC)
- Managed Configuration and Automation
- Infrastructure as Code
- Automated Provisioning and Deployment
- Artifact-Scripted Database Changes
- Automated Build and Release
- Release Vehicle Packaging
- Abstracted Administration
Service Delivery
Build-Phase Steps: Develop -> Commit -> Build -> Package -> Register
delivery platform should consider
- Confidence
- Reduced Risk
- Shorter Interval from Keyboard to Production
- Less Wait Time
- Less Rework
- Improved Execution
- A Culture of Continuous Improvement
- Improved Job Satisfaction
Deployment-Phase Steps: promoted, installed, and configured
Upgrading
There are many kinds of upgrading.
- Taking the Service Down for Upgrading
- Rolling Upgrades
- Canary
- Phased Roll-outs
- Proportional Shedding
- Blue-Green Deployment
- Toggling Features
Taking Toggling Features as an example.
reasons to use flag flips
- Rapid Development
- Gradual Introduction of New Features
- Finely Timed Release Dates
- Dynamic Roll Backs
- Bug Isolation
- A-BTesting
- One Percent Testing
- Differentiated Services
Continuous Deployment
factors should be taken into consideration when deciding whether to pause continuous delivery
- Build Health
- Test Comprehensiveness
- Test Reproducibility
- Production Health
- Schedule Permission
- Oncall Schedule
- Manual Stop
- Push Conflicts
- Intentional Delays
- Resource Contention
Terms to Know
- Server
- Service
- Machine
- QPS
- Traffic
- Performant: A neologism from merging “performance” and “conformant”.
- IaaS
- Saas
- Paas
- Oversubscribed
- Undersubscribed
- Static Content
- Dynamic Content
- Database-Driven Dynamic Content
- Control Panel
- Main Database
- Trend Server
- Link Redirect Servers
- Content Delivery Networks
- Outage
- Failure
- Malfunction
- MTBF
- Innovate
- Oncall
- Soft launch
- SRE
- Stakeholders
- Artifacts
- Service Delivery Flow
- Cycle Time
- Deployment
- Release Candidate
- Release
- Domain-Specific Language
- Toil
Reference
- The Practice of Cloud System Administration - DevOps and SRE Practices for Web Services Volume 2 (Thomas A. Limoncelli Strata R. Chalup Christina J. Hogan)
https://www.atlassian.com/microservices/microservices-architecture/distributed-architecture
https://en.wikipedia.org/wiki/Distributed_computing
https://aws.amazon.com/builders-library/challenges-with-distributed-systems/
- Understanding Distributed Systems (Roberto Vitillo)
- Foundations of Scalable Systems (Ian Gorton)
- Distributed Systems: Concepts and Design, 5th ed. (Pearson, 2001)
- Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services (Brendan Burn)
- The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise by Abbott and Fisher (2009)
- Scalability Rules: 50 Principles for Scaling Web Sites, also by Abbott and Fisher (2011)
https://github.com/ByteByteGoHq/system-design-101
Small world. Big idea!
- Welcome to visit the knowledge base of SRE and DevOps!
- License under CC BY-NC 4.0
- No personal information is collected
- Made with Material for MkDocs and generative AI tools
- Copyright issue feedback me#imzye.com, replace # with @
- Get latest SRE news and discuss on Discord Channel