Managing technical quality in a codebase

Will Larson wrote a thoughtful post about managing technical quality in a codebase. The post goes through seven different approaches to managing technical quality, starting from fixing the most critical thing in your code right there and then, to planning & executing an organization-wide quality program with program sponsors, metrics, tools, dashboards, and reviews.

The underlying premise, which I agree with, is that from engineering leadership's perspective it's rarely helpful to view low technical quality issues as some kind of crisis. This paragraph summarizes it quite nicely:

In most cases, low technical quality isn't a crisis; it's the expected, normal state. Engineers generally make reasonable quality decisions when they make them, and successful companies raise their quality bar over time as they scale, pivot, or shift up-market towards enterprise users. At a well-run and successful company, most of your previous technical decisions won't meet your current quality threshold. Rather than a failure, closing the gap between your current and target technical quality is a routine, essential part of effective engineering leadership.

Hot Spots

Before jumping to large, expensive process improvements, try picking the lowest-hanging fruit first. Give direct feedback, remove that one flaky test, fix that one bug.

Best Practices

When there are too many Hot Spots, roll out process improvements. Planning processes, documentation, most things from Accelerate. Roll out one at a time, testing & tweaking your approach along the way.

Leverage Points

These are places where "extra investment preserves quality over time, both by preventing gross quality failures and reducing the cost of future quality investments".

  • Interfaces: invest in decoupling your components from each other. "Expose all the underlying essential complexity and none of the underlying accidental complexity". Exercise a mock implementation with several different clients.
  • Stateful Systems: Exercise failure modes, establish performance benchmarks.
  • Data models: Prevent expression of invalid state, enable evolution over time.

Technical Vectors

How your organization aligns its technical direction? A centralized architect role is one way, but it'll become a bottleneck (and potentially disconnected from reality) quite quickly. What if everyone makes their own decisions? "Organization that allows any tool is an organization with uniformly unsupported tooling".

Some tools for aligning technical vectors:

  • Direct feedback. Feedback is quick & cheap, and a good way to sync contexts. Prioritize quick conversations above large process changes.
  • Articulate your vision & strategies. You cannot align your team to something which you haven't articulated enough to be written down.
  • Encapsulate your approach in your workflows and tooling. Tool automation can be used effectively to enforce technical and process decisions.
  • Train new members during onboarding. It's easier to train habits during onboarding than to change them years later.
  • Use Conway's Law. Structure your organization to produce software quality.
  • Curate technology change: architecture reviews, investment strategies, a process for adopting new tools.

Measure Technical Quality

If there are still too many hot spots to fix, leverage points have been exhausted, all technical vectors point in the right direction and technical quality is still subpar, it's time to take out bigger guns. However, to know where to shoot, you need to define quality in concrete terms. It's time to measure.

Some examples provided in the post are the usual: test coverage, code churn rates, public interface sizes, various runtime performance metrics, etc. Every codebase may benefit from its own set of metrics, and every codebase may suffer immensely if the wrong set of metrics is enforced. I've personally seen team productivity being destroyed by applying some randomly selected target like "> 95% test coverage for each function" for codebase in the UI layer which was extremely expensive to test due to limitations of the underlying UI framework. The exact same target would have been extremely effective in the data model layer of the same project. So involve your team here, don't enforce metrics out of context.

Quality Team

Going even further, a separate, centralized quality team may be established, dedicated to creating quality in the codebase. Personally, the fact that quality is created separately from the product development sounds quite foreign to me, but I'll discount that to the fact that I've spent quite limited time in engineering organizations large enough to potentially benefit from such structure.

Will Larson suggests starting with a team from three to six people, roughly one quality team engineer for ≈ 15 product development engineers.

The main responsibilities of the quality team would be to provide developer tooling which would allow monitoring quality metrics of every project. The adoption and usability of the tooling should be prioritized since both positive and negative effects will affect the whole organization.

Will also recognizes the fundamental tension between centralized quality teams and the teams that they support. It's not that far from the natural tension between product development and traditional operations (launching new features all the time vs. accepting the risk of outages which are caused by some kind of change). The tension between Standardization vs. Exploration has been nicely covered in another post from Will, Magnitudes of exploration.

Quality Program

Finally, we move into the realm of operating organization programs. To expand the impact of the quality team beyond developer tooling, the team leads an initiative to change the organization's behavior in order to achieve the target quality level.

  • Find a powerful sponsor to advocate for the program
  • Automate the collection of quality metrics
  • Identify specific goals for each team which is affected by the program
  • Provide tools, examples & support to each team throughout the program
  • Provide dashboards for each team to see where there are and where they need to go next
  • Review overall progress with the sponsor to bridge misaligned prioritization between teams' and program's requirements

Summary

Technical quality is the output of a complex system which has many different inputs:

  • Engineers and their skills & motivations. How well you succeed in hiring, retaining, and growing talent?
  • Choices of programming languages & tools. Do you let your company run after every latest shiny new thing, or do you have some well-defined technology strategy that allows you to invest deeply in chosen technologies, avoid fads and still be able to benefit from major innovations in language & tool space?
  • Best practices such as CI/CD, trunk-based dev, small batches. Do you invest enough time & resources to get quick build times, lower MTTR, less merge conflicts?
  • Quality of communication of vision and strategy. Do your people have enough context to make effective decisions on their own?
  • Definition, collection, and monitoring of quality metrics. Do you know your quality metrics before pull request? After merge? In production?
  • Clarity of roles and organizational structure. Does everyone know & accept their responsibilities? Are there gaps in your responsibility graph, leading to some things being not owned by anyone? Who reacts to production outage of service X at time Y? Who responds to customer's feedback about a bug in project X and by when? Who is responsible for developer tooling? Release management?

Sure, each individual engineer has a major impact on the quality - that's why you need to succeed in the first item. But if you'll ignore all the other items, you will waste the talent you have. You improve your technical quality by improving your engineering leadership.

The same thing could be said for data security and data privacy, by the way. You can have the most data privacy inclined developers in the world, but if data privacy is continuously being deprioritized by the rest of the organization, their efforts won't go too far. The results can be catastrophic for both the company and society at large.