The Site Reliability Workbook. Practical Ways to Implement SRE
- Ocena:
- Bądź pierwszym, który oceni tę książkę
- Stron:
- 512
- Dostępne formaty:
-
ePubMobi
Opis ebooka: The Site Reliability Workbook. Practical Ways to Implement SRE
In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.
This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.
Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.
You’ll learn:
- How to run reliable services in environments you don’t completely control—like cloud
- Practical applications of how to create, monitor, and run your services via Service Level Objectives
- How to convert existing ops teams to SRE—including how to dig out of operational overload
- Methods for starting SRE from either greenfield or brownfield
Wybrane bestsellery
-
PowerShell scripts provides a convenient method for automating tasks, using them proficiently can be challenging. This all-inclusive guide begins at the basics and covers advanced concepts, equipping you with tips to become an expert in PowerShell Core 7.3 scripting.
Mastering PowerShell Scripting. Automate repetitive tasks and simplify complex administrative tasks using PowerShell - Fifth Edition Mastering PowerShell Scripting. Automate repetitive tasks and simplify complex administrative tasks using PowerShell - Fifth Edition
-
Mastering Linux Administration, this book will help you become a proficient sysadmin and quickly adapt to the challenges of modern server and cloud administration technologies.
Mastering Linux Administration. Take your sysadmin skills to the next level by configuring and maintaining Linux systems - Second Edition Mastering Linux Administration. Take your sysadmin skills to the next level by configuring and maintaining Linux systems - Second Edition
-
Discover a proven method to learning programming in an accessible style. Ideal for enthusiasts, this book guides your from fundamentals to advanced concepts, enabling you to code confidently and build your tools and libraries using PowerShell 7.
PowerShell 7 Workshop. Learn how to program with PowerShell 7 on Windows, Linux, and the Raspberry Pi PowerShell 7 Workshop. Learn how to program with PowerShell 7 on Windows, Linux, and the Raspberry Pi
-
With this new edition, get to grips with Linux kernel development on the long-term 6.1 (S)LTS kernel in a hands-on way with the help of brilliant code examples. Linux Kernel Programming 2E teaches you how to write high-quality kernel modules suitable for real-world products, following industry be...
Linux Kernel Programming. A comprehensive and practical guide to kernel internals, writing modules, and kernel synchronization - Second Edition Linux Kernel Programming. A comprehensive and practical guide to kernel internals, writing modules, and kernel synchronization - Second Edition
-
This practical guide enables you to implement DevOps best practices while building systems with automation and reusability in mind. You’ll learn the modern-day infrastructure design best practices needed to create an impact on data-persistent technologies.
DevOps for Databases. A practical guide to applying DevOps best practices to data-persistent technologies DevOps for Databases. A practical guide to applying DevOps best practices to data-persistent technologies
-
Implementing CI/CD Using Azure Pipelines contains everything you need to automate your CI/CD pipelines using Microsoft Azure. You’ll learn how to efficiently manage your CI/CD pipelines, deploy your apps, and set up workflow pipelines on Azure DevOps portal.
Implementing CI/CD Using Azure Pipelines. Manage and automate the secure flexible deployment of applications using real-world use cases Implementing CI/CD Using Azure Pipelines. Manage and automate the secure flexible deployment of applications using real-world use cases
-
Okta is one of the leading IAM platforms that consolidate identities for company tools. Okta Administration Up and Running is a comprehensive introduction for anyone new to Okta’s products, and aims to help you understand and implement Okta’s features for enhanced security in your a...
Okta Administration Up and Running. Drive operational excellence with IAM solutions for on-premises and cloud apps - Second Edition Okta Administration Up and Running. Drive operational excellence with IAM solutions for on-premises and cloud apps - Second Edition
-
Unlock the understanding of the Microsoft 365 identity platform and security technologies for the MS-102 exam. From Entra ID essentials to core Microsoft 365 Defender deployment and key governance concepts, gain practical insights for success.
Microsoft 365 Administrator MS-102 Exam Guide. Master the Microsoft 365 Identity and Security Platform and confidently pass the MS-102 exam Microsoft 365 Administrator MS-102 Exam Guide. Master the Microsoft 365 Identity and Security Platform and confidently pass the MS-102 exam
-
This comprehensive guidebook provides a detailed overview of 100 essential Linux commands that every system administrator should know. With clear explanations and practical examples, this book is an invaluable resource for improving your skills and expertise in Linux administration. From package ...
Essential Linux Commands. 100 Linux commands every system administrator should know Essential Linux Commands. 100 Linux commands every system administrator should know
Betsy Beyer, Niall Richard Murphy, David K. Rensin - pozostałe książki
-
Whether you're part of a small startup or a multinational corporation, this practical book shows data scientists, software and site reliability engineers, product managers, and business owners how to run and establish ML reliably, effectively, and accountably within your organization. You'll gain...(207.52 zł najniższa cena z 30 dni)
207.47 zł
249.00 zł(-17%) -
Can a system be considered truly reliable if it isn't fundamentally secure? Or can it be considered secure if it's unreliable? Security is crucial to the design and operation of scalable systems in production, as it plays an important part in product quality, performance, and availability. In thi...
Building Secure and Reliable Systems. Best Practices for Designing, Implementing, and Maintaining Systems Building Secure and Reliable Systems. Best Practices for Designing, Implementing, and Maintaining Systems
(206.94 zł najniższa cena z 30 dni)206.89 zł
249.00 zł(-17%) -
Jeśli chcesz zrozumieć filozofię SRE, trzymasz w ręku właściwą, choć nietypową książkę. Jest to zbiór najciekawszych esejów i artykułów autorstwa osób odpowiedzialnych za SRE w Google. Z lektury tych esejów dowiesz się, w jaki sposób zaangażowanie w cały cykl życia oprogramowania umożliwił skutec...
Site Reliability Engineering. Jak Google zarządza systemami producyjnymi Site Reliability Engineering. Jak Google zarządza systemami producyjnymi
Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy
(39.50 zł najniższa cena z 30 dni)39.50 zł
79.00 zł(-50%) -
The overwhelming majority of a software systemâ??s lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?In this collection of essays and articl...
Site Reliability Engineering. How Google Runs Production Systems Site Reliability Engineering. How Google Runs Production Systems
(173.23 zł najniższa cena z 30 dni)172.73 zł
219.00 zł(-21%) -
Run your entire corporate IT infrastructure in a cloud environment that you control completely—and do it inexpensively and securely with help from this hands-on book. All you need to get started is basic IT experience.You’ll learn how to use Amazon Web Services (AWS) to build a privat...
Building a Windows IT Infrastructure in the Cloud. Distributed Hosted Environments with AWS Building a Windows IT Infrastructure in the Cloud. Distributed Hosted Environments with AWS
(98.12 zł najniższa cena z 30 dni)97.92 zł
129.00 zł(-24%) -
What once seemed nearly impossible has turned into reality. The number of available Internet addresses is now nearly exhausted, due mostly to the explosion of commercial websites and entries from an expanding number of countries. This growing shortage has effectively put the Internet community--...(139.78 zł najniższa cena z 30 dni)
139.58 zł
169.00 zł(-17%)
Ebooka "The Site Reliability Workbook. Practical Ways to Implement SRE" przeczytasz na:
-
czytnikach Inkbook, Kindle, Pocketbook, Onyx Boox i innych
-
systemach Windows, MacOS i innych
-
systemach Windows, Android, iOS, HarmonyOS
-
na dowolnych urządzeniach i aplikacjach obsługujących formaty: PDF, EPub, Mobi
Masz pytania? Zajrzyj do zakładki Pomoc »
Audiobooka "The Site Reliability Workbook. Practical Ways to Implement SRE" posłuchasz:
-
w aplikacji Ebookpoint na Android, iOS, HarmonyOs
-
na systemach Windows, MacOS i innych
-
na dowolnych urządzeniach i aplikacjach obsługujących format MP3 (pliki spakowane w ZIP)
Masz pytania? Zajrzyj do zakładki Pomoc »
Kurs Video "The Site Reliability Workbook. Practical Ways to Implement SRE" zobaczysz:
-
w aplikacjach Ebookpoint i Videopoint na Android, iOS, HarmonyOs
-
na systemach Windows, MacOS i innych z dostępem do najnowszej wersji Twojej przeglądarki internetowej
Szczegóły ebooka
- ISBN Ebooka:
- 978-14-920-2945-8, 9781492029458
- Data wydania ebooka:
- 2018-07-25 Data wydania ebooka często jest dniem wprowadzenia tytułu do sprzedaży i może nie być równoznaczna z datą wydania książki papierowej. Dodatkowe informacje możesz znaleźć w darmowym fragmencie. Jeśli masz wątpliwości skontaktuj się z nami sklep@ebookpoint.pl.
- Język publikacji:
- angielski
- Rozmiar pliku ePub:
- 10.1MB
- Rozmiar pliku Mobi:
- 23.8MB
Spis treści ebooka
- Foreword I
- Foreword II
- Preface
- Conventions Used in This Book
- Using Code Examples
- OReilly Safari
- How to Contact Us
- Acknowledgments
- 1. How SRE Relates to DevOps
- Background on DevOps
- No More Silos
- Accidents Are Normal
- Change Should Be Gradual
- Tooling and Culture Are Interrelated
- Measurement Is Crucial
- Background on DevOps
- Background on SRE
- Operations Is a Software Problem
- Manage by Service Level Objectives (SLOs)
- Work to Minimize Toil
- Automate This Years Job Away
- Move Fast by Reducing the Cost of Failure
- Share Ownership with Developers
- Use the Same Tooling, Regardless of Function or Job Title
- Compare and Contrast
- Organizational Context and Fostering Successful Adoption
- Narrow, Rigid Incentives Narrow Your Success
- Its Better to Fix It Yourself; Dont Blame Someone Else
- Consider Reliability Work as a Specialized Role
- When Can Substitute for Whether
- Strive for Parity of Esteem: Career and Financial
- Conclusion
- I. Foundations
- 2. Implementing SLOs
- Why SREs Need SLOs
- Getting Started
- Reliability Targets and Error Budgets
- What to Measure: Using SLIs
- Types of components
- A Worked Example
- Moving from SLI Specification to SLI Implementation
- API and HTTP server availability and latency
- Pipeline freshness, coverage, and correctness
- Moving from SLI Specification to SLI Implementation
- Measuring the SLIs
- Load balancer metrics
- Calculating the SLIs
- Using the SLIs to Calculate Starter SLOs
- Choosing an Appropriate Time Window
- Getting Stakeholder Agreement
- Establishing an Error Budget Policy
- Documenting the SLO and Error Budget Policy
- Dashboards and Reports
- Continuous Improvement of SLO Targets
- Improving the Quality of Your SLO
- Decision Making Using SLOs and Error Budgets
- Advanced Topics
- Modeling User Journeys
- Grading Interaction Importance
- Modeling Dependencies
- Experimenting with Relaxing Your SLOs
- Conclusion
- 3. SLO Engineering Case Studies
- Evernotes SLO Story
- Why Did Evernote Adopt the SRE Model?
- Introduction of SLOs: A Journey in Progress
- Breaking Down the SLO Wall Between Customer and Cloud Provider
- Current State
- Evernotes SLO Story
- The Home Depots SLO Story
- The SLO Culture Project
- Our First Set of SLOs
- Availability and latency for API calls
- Infrastructure utilization
- Traffic volume
- Latency
- Errors
- Tickets
- VALET
- Evangelizing SLOs
- Automating VALET Data Collection
- TPS Reports
- VALET service
- VALET Dashboard
- The Proliferation of SLOs
- Applying VALET to Batch Applications
- Using VALET in Testing
- Future Aspirations
- Summary
- Conclusion
- 4. Monitoring
- Desirable Features of a Monitoring Strategy
- Speed
- Calculations
- Interfaces
- Alerts
- Desirable Features of a Monitoring Strategy
- Sources of Monitoring Data
- Examples
- Move information from logs to metrics
- Problem
- Proposed solution
- Outcome
- Move information from logs to metrics
- Improve both logs and metrics
- Problem
- Proposed solution
- Outcome
- Examples
- Keep logs as the data source
- Problem
- Proposed solution
- Outcome
- Managing Your Monitoring System
- Treat Your Configuration as Code
- Encourage Consistency
- Prefer Loose Coupling
- Metrics with Purpose
- Intended Changes
- Dependencies
- Saturation
- Status of Served Traffic
- Implementing Purposeful Metrics
- Testing Alerting Logic
- Conclusion
- 5. Alerting on SLOs
- Alerting Considerations
- Ways to Alert on Significant Events
- 1: Target Error Rate SLO Threshold
- 2: Increased Alert Window
- 3: Incrementing Alert Duration
- 4: Alert on Burn Rate
- 5: Multiple Burn Rate Alerts
- 6: Multiwindow, Multi-Burn-Rate Alerts
- Low-Traffic Services and Error Budget Alerting
- Generating Artificial Traffic
- Combining Services
- Making Service and Infrastructure Changes
- Lowering the SLO or Increasing the Window
- Extreme Availability Goals
- Alerting at Scale
- Conclusion
- 6. Eliminating Toil
- What Is Toil?
- Measuring Toil
- Toil Taxonomy
- Business Processes
- Production Interrupts
- Release Shepherding
- Migrations
- Cost Engineering and Capacity Planning
- Troubleshooting for Opaque Architectures
- Toil Management Strategies
- Identify and Measure Toil
- Engineer Toil Out of the System
- Reject the Toil
- Use SLOs to Reduce Toil
- Start with Human-Backed Interfaces
- Provide Self-Service Methods
- Get Support from Management and Colleagues
- Promote Toil Reduction as a Feature
- Start Small and Then Improve
- Increase Uniformity
- Assess Risk Within Automation
- Automate Toil Response
- Use Open Source and Third-Party Tools
- Use Feedback to Improve
- Case Studies
- Case Study 1: Reducing Toil in the Datacenter with Automation
- Background
- Problem Statement
- What We Decided to Do
- Design First Effort: Saturn Line-Card Repair
- Implementation
- Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card Repair
- Implementation
- Lessons Learned
- UIs should not introduce overhead or complexity
- Dont rely on human expertise
- Design reusable components
- Dont overthink the problem
- Sometimes imperfect automation is good enough
- Repair automation is not fire and forget
- Build in risk assessment and defense in depth
- Get a failure budget and manager support
- Think holistically
- Case Study 2: Decommissioning Filer-Backed Home Directories
- Background
- Problem Statement
- What We Decided to Do
- Design and Implementation
- Key Components
- Moonwalk
- Moira Portal
- Archiving and migration automation
- Lessons Learned
- Challenge assumptions and retire expensive business processes
- Build self-service interfaces
- Start with human-backed interfaces
- Melt snowflakes
- Employ organizational nudges
- Conclusion
- 7. Simplicity
- Measuring Complexity
- Simplicity Is End-to-End, and SREs Are Good for That
- Case Study 1: End-to-End API Simplicity
- Background
- Lessons learned
- Case Study 1: End-to-End API Simplicity
- Case Study 2: Project Lifecycle Complexity
- Background
- What we decided to do
- Lessons learned
- Regaining Simplicity
- Case Study 3: Simplification of the Display Ads Spiderweb
- Background
- What we decided to do
- Lessons learned
- Case Study 3: Simplification of the Display Ads Spiderweb
- Case Study 4: Running Hundreds of Microservices on a Shared Platform
- Background
- What we decided to do
- Design
- Outcomes
- Lessons learned
- Case Study 5: pDNS No Longer Depends on Itself
- Background
- Problem statement
- What we decided to do
- Lessons learned
- Conclusion
- II. Practices
- 8. On-Call
- Recap of Being On-Call Chapter of First SRE Book
- Example On-Call Setups Within Google and Outside Google
- Google: Forming a New Team
- Initial scenario
- Training roadmap
- Afterword
- Google: Forming a New Team
- Evernote: Finding Our Feet in the Cloud
- Moving our on-prem infrastructure to the cloud
- Adjusting our on-call policies and processes
- Restructuring our monitoring and metrics
- Tracking our performance over time
- Engaging with CRE
- Sustaining a self-perpetuating cycle
- Practical Implementation Details
- Anatomy of Pager Load
- Scenario: A team in overload
- Pager load inputs
- Preexisting bugs
- New bugs
- Identification delay
- Mitigation delay
- Alerting
- Rigor of follow-up
- Data quality
- Vigilance
- Anatomy of Pager Load
- On-Call Flexibility
- Scenario: A change in personal circumstances
- Automate on-call scheduling
- Plan for short-term swaps
- Plan for long-term breaks
- Plan for part-time work schedules
- Scenario: A change in personal circumstances
- On-Call Team Dynamics
- Scenario: A culture of survive the week
- Proposal one: Empower your ops engineers
- Proposal two: Improve team relations
- Scenario: A culture of survive the week
- Conclusion
- 9. Incident Response
- Incident Management at Google
- Incident Command System
- Main Roles in Incident Response
- Incident Management at Google
- Case Studies
- Case Study 1: Software BugThe Lights Are On but No Ones (Google) Home
- Context
- Incident
- Review
- Case Study 1: Software BugThe Lights Are On but No Ones (Google) Home
- Case Study 2: Service FaultCache Me If You Can
- Context
- Incident
- Review
- What went well?
- What could have been handled better?
- Case Study 3: Power OutageLightning Never Strikes TwiceUntil It Does
- Context
- Incident
- Review
- Case Study 4: Incident Response at PagerDuty
- Major incident response at PagerDuty
- Tools used for incident response
- Putting Best Practices into Practice
- Incident Response Training
- Prepare Beforehand
- Decide on a communication channel
- Keep your audience informed
- Prepare a list of contacts
- Establish criteria for an incident
- Drills
- Conclusion
- 10. Postmortem Culture: Learning from Failure
- Case Study
- Bad Postmortem
- Why Is This Postmortem Bad?
- Missing context
- Key details omitted
- Key action item characteristics missing
- Counterproductive finger pointing
- Animated language
- Missing ownership
- Limited audience
- Delayed publication
- Why Is This Postmortem Bad?
- Good Postmortem
- Why Is This Postmortem Better?
- Clarity
- Concrete action items
- Blamelessness
- Depth
- Promptness
- Conciseness
- Why Is This Postmortem Better?
- Organizational Incentives
- Model and Enforce Blameless Behavior
- Use blameless language
- Include all incident participants in postmortem authoring
- Gather feedback
- Model and Enforce Blameless Behavior
- Reward Postmortem Outcomes
- Reward action item closeout
- Reward positive organizational change
- Highlight improved reliability
- Hold up postmortem owners as leaders
- Gamification
- Share Postmortems Openly
- Share announcements across the organization
- Conduct cross-team reviews
- Hold training exercises
- Report incidents and outages weekly
- Respond to Postmortem Culture Failures
- Avoiding association
- Failing to reinforce the culture
- Lacking time to write postmortems
- Repeating incidents
- Tools and Templates
- Postmortem Templates
- Googles template
- Other industry templates
- Postmortem Templates
- Postmortem Tooling
- Postmortem creation
- Postmortem checklist
- Postmortem storage
- Postmortem follow-up
- Postmortem analysis
- Other industry tools
- Conclusion
- 11. Managing Load
- Google Cloud Load Balancing
- Anycast
- Stabilized anycast
- Anycast
- Maglev
- Global Software Load Balancer
- Google Front End
- GCLB: Low Latency
- GCLB: High Availability
- Case Study 1: Pokémon GO on GCLB
- Migrating to GCLB
- Resolving the issue
- Future-proofing
- Google Cloud Load Balancing
- Autoscaling
- Handling Unhealthy Machines
- Working with Stateful Systems
- Configuring Conservatively
- Setting Constraints
- Including Kill Switches and Manual Overrides
- Avoiding Overloading Backends
- Avoiding Traffic Imbalance
- Combining Strategies to Manage Load
- Case Study 2: When Load Shedding Attacks
- What was happening?
- What went wrong?
- Lessons learned
- Case Study 2: When Load Shedding Attacks
- Conclusion
- 12. Introducing Non-Abstract Large System Design
- What Is NALSD?
- Why Non-Abstract?
- AdWords Example
- Design Process
- Initial Requirements
- One Machine
- Calculations
- Evaluation
- Distributed System
- MapReduce
- Evaluation
- MapReduce
- LogJoiner
- Calculations
- Sharded LogJoiner
- Evaluation
- Multidatacenter
- Calculations
- Evaluation
- Conclusion
- 13. Data Processing Pipelines
- Pipeline Applications
- Event Processing/Data Transformation to Order or Structure Data
- Data Analytics
- Machine Learning
- Pipeline Applications
- Pipeline Best Practices
- Define and Measure Service Level Objectives
- Data freshness
- Data correctness
- Data isolation/load balancing
- End-to-end measurement
- Define and Measure Service Level Objectives
- Plan for Dependency Failure
- Create and Maintain Pipeline Documentation
- System diagrams
- Process documentation
- Playbook entries
- Map Your Development Lifecycle
- Prototyping
- Testing with a 1% dry run
- Staging
- Canarying
- Performing a partial deployment
- Deploying to production
- Reduce Hotspotting and Workload Patterns
- Implement Autoscaling and Resource Planning
- Adhere to Access Control and Security Policies
- Plan Escalation Paths
- Pipeline Requirements and Design
- What Features Do You Need?
- Idempotent and Two-Phase Mutations
- Checkpointing
- Code Patterns
- Reusing code
- Using the microservice approach to creating pipelines
- Pipeline Production Readiness
- Pipeline maturity matrix
- Pipeline Failures: Prevention and Response
- Potential Failure Modes
- Delayed data
- Corrupt data
- Potential Failure Modes
- Potential Causes
- Pipeline dependencies
- Pipeline application or configuration
- Unexpected resource growth
- Region-level outage
- Case Study: Spotify
- Event Delivery
- Event Delivery System Design and Architecture
- Data collection
- Extract Transform Load
- Data delivery
- Event Delivery System Operation
- Timeliness
- Skewness
- Completeness
- Customer Integration and Support
- Documentation
- System monitoring
- Capacity planning
- Development process
- Incident handling
- Summary
- Conclusion
- 14. Configuration Design and Best Practices
- What Is Configuration?
- Configuration and Reliability
- Separating Philosophy and Mechanics
- What Is Configuration?
- Configuration Philosophy
- Configuration Asks Users Questions
- Questions Should Be Close to User Goals
- Mandatory and Optional Questions
- Escaping Simplicity
- Mechanics of Configuration
- Separate Configuration and Resulting Data
- Importance of Tooling
- Semantic validation
- Configuration syntax
- Ownership and Change Tracking
- Safe Configuration Change Application
- Conclusion
- 15. Configuration Specifics
- Configuration-Induced Toil
- Reducing Configuration-Induced Toil
- Critical Properties and Pitfalls of Configuration Systems
- Pitfall 1: Failing to Recognize Configuration as a Programming Language Problem
- Pitfall 2: Designing Accidental or Ad Hoc Language Features
- Pitfall 3: Building Too Much Domain-Specific Optimization
- Pitfall 4: Interleaving Configuration Evaluation with Side Effects
- Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua
- Integrating a Configuration Language
- Generating Config in Specific Formats
- Driving Multiple Applications
- Integrating an Existing Application: Kubernetes
- What Kubernetes Provides
- Example Kubernetes Config
- Integrating the Configuration Language
- Integrating Custom Applications (In-House Software)
- Effectively Operating a Configuration System
- Versioning
- Source Control
- Tooling
- Testing
- When to Evaluate Configuration
- Very Early: Checking in the JSON
- Pros
- Cons
- Very Early: Checking in the JSON
- Middle of the Road: Evaluate at Build Time
- Pros
- Cons
- Late: Evaluate at Runtime
- Pros
- Cons
- Guarding Against Abusive Configuration
- Conclusion
- 16. Canarying Releases
- Release Engineering Principles
- Balancing Release Velocity and Reliability
- What Is Canarying?
- Release Engineering and Canarying
- Requirements of a Canary Process
- Our Example Setup
- A Roll Forward Deployment Versus a Simple Canary Deployment
- Canary Implementation
- Minimizing Risk to SLOs and the Error Budget
- Choosing a Canary Population and Duration
- Selecting and Evaluating Metrics
- Metrics Should Indicate Problems
- Metrics Should Be Representative and Attributable
- Before/After Evaluation Is Risky
- Use a Gradual Canary for Better Metric Selection
- Dependencies and Isolation
- Canarying in Noninteractive Systems
- Requirements on Monitoring Data
- Related Concepts
- Blue/Green Deployment
- Artificial Load Generation
- Traffic Teeing
- Conclusion
- III. Processes
- 17. Identifying and Recovering from Overload
- From Load to Overload
- Case Study 1: Work Overload When Half a Team Leaves
- Background
- Problem Statement
- What We Decided to Do
- Implementation
- Lessons Learned
- Case Study 2: Perceived Overload After Organizational and Workload Changes
- Background
- Problem Statement
- What We Decided to Do
- Implementation
- Short-term actions
- Mid-term actions
- Long-term actions
- Effects
- Lessons Learned
- Strategies for Mitigating Overload
- Recognizing the Symptoms of Overload
- Reducing Overload and Restoring Team Health
- Identify and alleviate psychosocial stressors
- Prioritize and triage within one quarter
- Protect yourself in the future
- Conclusion
- 18. SRE Engagement Model
- The Service Lifecycle
- Phase 1: Architecture and Design
- Phase 2: Active Development
- Phase 3: Limited Availability
- Phase 4: General Availability
- Phase 5: Deprecation
- Phase 6: Abandoned
- Phase 7: Unsupported
- The Service Lifecycle
- Setting Up the Relationship
- Communicating Business and Production Priorities
- Identifying Risks
- Aligning Goals
- Setting Ground Rules
- Planning and Executing
- Sustaining an Effective Ongoing Relationship
- Investing Time in Working Better Together
- Maintaining an Open Line of Communication
- Performing Regular Service Reviews
- Reassessing When Ground Rules Start to Slip
- Adjusting Priorities According to Your SLOs and Error Budget
- Handling Mistakes Appropriately
- Sleep on it
- Meet in person (or as close to it as possible) to resolve issues
- Be positive
- Understand differences in communication
- Scaling SRE to Larger Environments
- Supporting Multiple Services with a Single SRE Team
- Structuring a Multiple SRE Team Environment
- Adapting SRE Team Structures to Changing Circumstances
- Running Cohesive Distributed SRE Teams
- Ending the Relationship
- Case Study 1: Ares
- Case Study 2: Data Analysis Pipeline
- The pivot
- Communication breakdown
- Decommission
- Conclusion
- 19. SRE: Reaching Beyond Your Walls
- Truths We Hold to Be Self-Evident
- Reliability Is the Most Important Feature
- Your Users, Not Your Monitoring, Decide Your Reliability
- If You Run a Platform, Then Reliability Is a Partnership
- Everything Important Eventually Becomes a Platform
- When Your Customers Have a Hard Time, You Have to Slow Down
- You Will Need to Practice SRE with Your Customers
- Truths We Hold to Be Self-Evident
- How to: SRE with Your Customers
- Step 1: SLOs and SLIs Are How You Speak
- Step 2: Audit the Monitoring and Build Shared Dashboards
- Step 3: Measure and Renegotiate
- Step 4: Design Reviews and Risk Analysis
- Step 5: Practice, Practice, Practice
- Be Thoughtful and Disciplined
- Conclusion
- 20. SRE Team Lifecycles
- SRE Practices Without SREs
- Starting an SRE Role
- Finding Your First SRE
- Placing Your First SRE
- Bootstrapping Your First SRE
- Distributed SREs
- Your First SRE Team
- Forming
- Creating a new team as part of a major project
- Assembling a horizontal SRE team
- Converting a team in place
- Forming
- Storming
- Risks and mitigations
- New team as part of a major project
- Horizontal SRE team
- A team converted in place
- Risks and mitigations
- Norming
- Performing
- Partnering on architecture
- Self-regulating workload
- Making More SRE Teams
- Service Complexity
- Where to split
- Pitfalls
- Service Complexity
- SRE Rollout
- Geographical Splits
- Placement: How many time zones apart should the teams be?
- People and projects: Seeding the team
- Parity: Distributing Work Between Offices and Avoiding a Night Shift
- Placement: What about having three shifts?
- Timing: Should both halves of the team start at the same time?
- Finance: Travel budget
- Leadership: Joint ownership of a service
- Suggested Practices for Running Many Teams
- Mission Control
- SRE Exchange
- Training
- Horizontal Projects
- SRE Mobility
- Travel
- Launch Coordination Engineering Teams
- Production Excellence
- SRE Funding and Hiring
- Conclusion
- 21. Organizational Change Management in SRE
- SRE Embraces Change
- Introduction to Change Management
- Lewins Three-Stage Model
- McKinseys 7-S Model
- Kotters Eight-Step Process for Leading Change
- The Prosci ADKAR Model
- Emotion-Based Models
- The Deming Cycle
- How These Theories Apply to SRE
- Case Study 1: Scaling WazeFrom Ad Hoc to Planned Change
- Background
- The Messaging Queue: Replacing a System While Maintaining Reliability
- The Next Cycle of Change: Improving the Deployment Process
- Lessons Learned
- Case Study 2: Common Tooling Adoption in SRE
- Background
- Problem Statement
- What We Decided to Do
- Design
- Implementation: Monitoring
- Lessons Learned
- Conclusion
- Conclusion
- Onward
- The Future Belongs to the Past
- SRE + <Insert Other Discipline>
- Trickles, Streams, and Floods
- SRE Belongs to All of Us
- On Gratitude
- A. Example SLO Document
- Service Overview
- SLIs and SLOs
- Rationale
- Error Budget
- Clarifications and Caveats
- B. Example Error Budget Policy
- Service Overview
- Goals
- Non-Goals
- SLO Miss Policy
- Outage Policy
- Escalation Policy
- Background
- C. Results of Postmortem Analysis
- Index
O'Reilly Media - inne książki
-
Keeping up with the Python ecosystem can be daunting. Its developer tooling doesn't provide the out-of-the-box experience native to languages like Rust and Go. When it comes to long-term project maintenance or collaborating with others, every Python project faces the same problem: how to build re...(200.93 zł najniższa cena z 30 dni)
200.88 zł
239.00 zł(-16%) -
Bringing a deep-learning project into production at scale is quite challenging. To successfully scale your project, a foundational understanding of full stack deep learning, including the knowledge that lies at the intersection of hardware, software, data, and algorithms, is required.This book il...(241.26 zł najniższa cena z 30 dni)
241.21 zł
289.00 zł(-17%) -
Frontend developers have to consider many things: browser compatibility, usability, performance, scalability, SEO, and other best practices. But the most fundamental aspect of creating websites is one that often falls short: accessibility. Accessibility is the cornerstone of any website, and if a...(200.09 zł najniższa cena z 30 dni)
199.59 zł
239.00 zł(-16%) -
In this insightful and comprehensive guide, Addy Osmani shares more than a decade of experience working on the Chrome team at Google, uncovering secrets to engineering effectiveness, efficiency, and team success. Engineers and engineering leaders looking to scale their effectiveness and drive tra...(114.88 zł najniższa cena z 30 dni)
114.38 zł
149.00 zł(-23%) -
Data modeling is the single most overlooked feature in Power BI Desktop, yet it's what sets Power BI apart from other tools on the market. This practical book serves as your fast-forward button for data modeling with Power BI, Analysis Services tabular, and SQL databases. It serves as a starting ...(198.88 zł najniższa cena z 30 dni)
198.78 zł
239.00 zł(-17%) -
C# is undeniably one of the most versatile programming languages available to engineers today. With this comprehensive guide, you'll learn just how powerful the combination of C# and .NET can be. Author Ian Griffiths guides you through C# 12.0 and .NET 8 fundamentals and techniques for building c...(240.92 zł najniższa cena z 30 dni)
240.72 zł
289.00 zł(-17%) -
Learn how to get started with Futures Thinking. With this practical guide, Phil Balagtas, founder of the Design Futures Initiative and the global Speculative Futures network, shows you how designers and futurists have made futures work at companies such as Atari, IBM, Apple, Disney, Autodesk, Luf...(148.00 zł najniższa cena z 30 dni)
147.90 zł
179.00 zł(-17%) -
Augmented Analytics isn't just another book on data and analytics; it's a holistic resource for reimagining the way your entire organization interacts with information to become insight-driven.Moving beyond traditional, limited ways of making sense of data, Augmented Analytics provides a dynamic,...(174.54 zł najniższa cena z 30 dni)
174.34 zł
219.00 zł(-20%) -
Learn how to prepare for—and pass—the Kubernetes and Cloud Native Associate (KCNA) certification exam. This practical guide serves as both a study guide and point of entry for practitioners looking to explore and adopt cloud native technologies. Adrián González Sánchez ...
Kubernetes and Cloud Native Associate (KCNA) Study Guide Kubernetes and Cloud Native Associate (KCNA) Study Guide
(169.14 zł najniższa cena z 30 dni)177.65 zł
199.00 zł(-11%) -
Python is an excellent way to get started in programming, and this clear, concise guide walks you through Python a step at a time—beginning with basic programming concepts before moving on to functions, data structures, and object-oriented design. This revised third edition reflects the gro...(140.14 zł najniższa cena z 30 dni)
139.94 zł
179.00 zł(-22%)
Dzieki opcji "Druk na żądanie" do sprzedaży wracają tytuły Grupy Helion, które cieszyły sie dużym zainteresowaniem, a których nakład został wyprzedany.
Dla naszych Czytelników wydrukowaliśmy dodatkową pulę egzemplarzy w technice druku cyfrowego.
Co powinieneś wiedzieć o usłudze "Druk na żądanie":
- usługa obejmuje tylko widoczną poniżej listę tytułów, którą na bieżąco aktualizujemy;
- cena książki może być wyższa od początkowej ceny detalicznej, co jest spowodowane kosztami druku cyfrowego (wyższymi niż koszty tradycyjnego druku offsetowego). Obowiązująca cena jest zawsze podawana na stronie WWW książki;
- zawartość książki wraz z dodatkami (płyta CD, DVD) odpowiada jej pierwotnemu wydaniu i jest w pełni komplementarna;
- usługa nie obejmuje książek w kolorze.
Masz pytanie o konkretny tytuł? Napisz do nas: sklep[at]helion.pl.
Książka, którą chcesz zamówić pochodzi z końcówki nakładu. Oznacza to, że mogą się pojawić drobne defekty (otarcia, rysy, zagięcia).
Co powinieneś wiedzieć o usłudze "Końcówka nakładu":
- usługa obejmuje tylko książki oznaczone tagiem "Końcówka nakładu";
- wady o których mowa powyżej nie podlegają reklamacji;
Masz pytanie o konkretny tytuł? Napisz do nas: sklep[at]helion.pl.
Książka drukowana
Oceny i opinie klientów: The Site Reliability Workbook. Practical Ways to Implement SRE Betsy Beyer, Niall Richard Murphy, David K. Rensin (0) Weryfikacja opinii następuję na podstawie historii zamówień na koncie Użytkownika umieszczającego opinię. Użytkownik mógł otrzymać punkty za opublikowanie opinii uprawniające do uzyskania rabatu w ramach Programu Punktowego.