GitHub

Playbook in Practice

Real stories from Orcta Engineering. Learn from what went well and what didn’t.

Story One: The Minimum Lovable Product That Launched in Two Weeks

Context

Product requested a comprehensive user dashboard with fifteen distinct features. The initial timeline was set for one month of development work. The team needed to deliver value quickly while managing scope.

What We Did

  • Asked the fundamental question: what is the one thing users need most?
  • Shipped just the critical metrics view in the first week
  • Gathered real user feedback on actual usage patterns
  • Added three more features in week two based on observed behavior
  • Delivered a complete, usable product on schedule

Outcome

  • Users engaged heavily with the initial release
  • Data showed 60% of originally planned features were unnecessary
  • Saved three weeks of engineering time
  • Delivered higher-quality features informed by real usage

Key Lesson

Start with why and validate with real users before building everything. Assumptions about user needs are often wrong until tested.

Engineering Philosophy: Minimum Lovable Product — Build the smallest version users can love


Story Two: The Code Review That Prevented a Critical Bug

Context

A pull request introduced payment processing functionality. All tests passed. The implementation appeared sound on first inspection. The PR was ready for approval.

What Happened

During review, an engineer noticed missing error handling for network failures. They asked a simple question: what happens if the payment API times out?

The author realized users would be charged but orders wouldn’t be recorded in our system. A critical data integrity issue that would have caused significant problems in production.

What We Did

  • Added retry logic with exponential backoff for transient failures
  • Implemented transaction rollback on payment confirmation failure
  • Created integration tests specifically for timeout scenarios
  • Documented the edge case for future reference

Outcome

  • Caught a critical bug before it reached production
  • Prevented potential revenue loss and customer trust issues
  • Improved the payment system’s overall reliability
  • Created reusable patterns for similar integrations

Key Lesson

Code reviews aren’t about finding typos. They’re about protecting users and the business by thinking through edge cases together.

Engineering Playbook, Section 5: Code reviews are sacred — feedback must be kind, clear, and focused on improvement


Story Three: When We Ignored Refactor as You Go

Context

While building a new API endpoint, an engineer noticed duplicated authentication logic across five different controllers. The code worked, but the duplication was obvious.

What We Did Wrong

The team decided to ship quickly with a note: we’ll refactor later when we have time. The authentication code was copied one more time. Development continued.

Three months passed. A security vulnerability was discovered in the authentication logic. The fix needed to be applied in six different places across the codebase.

The Incident

Two instances of the duplicated code were missed during the fix. Production experienced a two-hour outage when those endpoints were exploited. Customer data was not compromised, but trust was shaken.

What We Should Have Done

Invested thirty minutes to extract the authentication logic into a reusable middleware component. Fixed it once, used it everywhere. The vulnerability would have required one change in one place.

Outcome

  • Two-hour production outage affecting all users
  • Emergency incident response requiring all-hands effort
  • Blameless postmortem conducted within 48 hours
  • New team agreement: refactor duplicated code immediately

Key Lesson

Later usually means never. Technical debt compounds. What takes thirty minutes today costs ten times that in three months, plus the cost of the incident.

Engineering Philosophy, Section 3: Refactor as you go — Don’t postpone improvements


Story Four: The Documentation That Unblocked Three Teams

Context

A new internal API for user permissions was built and deployed. No formal documentation was written. Information was shared through Slack messages and verbal explanations.

What Happened

Over the next month, three different teams needed to integrate with the permissions API. Each team reached out with similar questions about authentication, endpoint structure, and error handling.

The original author spent over six hours answering repetitive questions in Slack DMs and ad-hoc meetings. Integration took each team longer than necessary due to missing context.

What We Did to Fix It

Invested fifteen minutes writing a clear README with essential information:

  • What the API does and why it exists
  • How to authenticate and handle tokens
  • Three common use cases with code examples
  • Known limitations and error scenarios
  • Inline code comments for complex logic

Outcome

  • Repetitive questions stopped immediately
  • Teams integrated independently without blocking the author
  • README was referenced over forty times in two months
  • Onboarding new engineers to the system became trivial

Key Lesson

Fifteen minutes of documentation saves ten hours of interruptions. Documentation is generosity to your teammates and your future self.

Engineering Philosophy, Section 2: Document to scale — Documentation is an act of generosity


Story Five: The Postmortem That Made Us Better

Context

Production database ran out of connections during peak traffic. The site went down for forty-five minutes. Users couldn’t access the application. Support tickets flooded in.

What We Did

Held a blameless postmortem within forty-eight hours of resolution. The team asked five whys to understand root causes:

Five Whys Analysis:

Why did the database run out of connections? Connection pool wasn’t sized correctly for peak load.

Why wasn’t it sized correctly? We didn’t load test before launch.

Why didn’t we load test? No documented process or checklist for pre-launch testing.

Why no process? Never formalized what everyone assumed was known.

Why wasn’t it formalized? Tribal knowledge instead of written procedures.

Actions Taken

  • Created comprehensive pre-launch checklist including load testing
  • Set up database connection monitoring with alerts
  • Documented runbook for connection pool issues
  • Made pre-launch checklist mandatory via PR template
  • Scheduled quarterly review of operational procedures

Outcome

  • No similar incidents in the following six months
  • Pre-launch checklist caught two other potential issues
  • Team felt safe discussing mistakes without blame
  • Culture of learning from failure was strengthened

Key Lesson

Systems fail. Humans make mistakes. How we respond to failure defines our culture. Blameless analysis and systematic improvement matter more than perfect execution.

Engineering Philosophy, Section 3: Fail fast, learn faster — Mistakes are okay, cover them in postmortems


Story Six: When Ownership Meant Heroism

Context

An engineer deployed a new feature Friday evening. At eleven PM, they received a page about a bug in production. The feature had an edge case that wasn’t caught in testing.

What Happened

The engineer felt personally responsible. They believed ownership meant solving it alone. They stayed up until two AM debugging and deploying a fix. The issue was resolved but the engineer was exhausted.

The following week, they felt burned out. Work-life balance suffered. The incident response, while successful, wasn’t sustainable.

What We Should Have Done

  • The on-call engineer should have handled the initial response
  • If the feature author wanted to help, pair with on-call instead of solo work
  • For critical issues, wake the tech lead for support and guidance
  • Follow established incident response procedures
  • No one should feel obligated to sacrifice sleep

What We Changed

  • Clarified that ownership doesn’t mean martyrdom
  • Set clear on-call expectations and rotation schedules
  • Added retro question: did anyone feel unsupported this week?
  • Emphasized collaborative incident response in documentation
  • Leadership modeled asking for help publicly

Outcome

  • Better work-life balance across the team
  • More effective collaborative incident response
  • Reduced individual stress and burnout
  • No one felt guilty for asking for help

Key Lesson

Ownership means responsibility, not isolation. We own problems collectively, not individually. Sustainable engineering requires sustainable practices.

Engineering Philosophy, Section 2: Ownership mentality — But we own problems collectively


How to Use These Stories

In Code Reviews

Reference relevant stories when providing feedback. “This reminds me of Story Two—can we add error handling here?”

In Standups

Connect current work to past lessons. “Feels like Story Three—should we refactor now before it spreads?”

In Retrospectives

Use stories to frame discussions. “This situation is similar to Story Four—let’s document this so it doesn’t happen again.”

In Onboarding

Share stories with new engineers to explain why the playbook exists and how principles apply in practice.

Add Your Own Stories

Saw something that reinforces or challenges our principles? Share it. These stories are our institutional knowledge.

  1. Write it up using the format: Context, What Happened, Outcome, Key Lesson
  2. Post in the engineering Slack channel for discussion
  3. Submit a pull request to add it to this document
  4. Share the lesson in the next team meeting
Edit this page