Designing at Scale:

Challenges, Pitfalls, and Lessons Learned

Maurício Aniche

CTO at Alura

Maurício Aniche.

CTO at Alura. Ex-{Uber, Adyen, Locaweb}. Former assistant professor in software engineering at TU Delft.

What do I mean by scale?

  • Traffic and uptime expectations
  • Large volumes of data and users
  • Many teams and many services
  • Security, regulatory, and compliance pressure
  • Mix of old and new systems all working together

We love software design

and that is why we tend to think this is the most important part of software.

The uncomfortable truth:

at scale, the biggest challenges live in architecture, infrastructure, and data.

Wait, but isn't the architecture implemented by the code?

Yes, but changing code is "easy", defining the strategy so that you don't break production is the hard part.

Every architectural decision expires.

The real question is how and when it will age.

Story: caching migration at Uber

A story about "just" moving to a different Redis instance.

Safe architectural refactoring.

  1. Feature flags
  2. Shadow traffic
  3. Dual reads and writes
  4. Rolling upgrades
  5. Backward compatibility

Wait, there's more.

  • Schema evolution and data migration plans
  • Backfills and reprocessing
  • Sunsetting old paths and deprecation windows
  • Monitoring and observability

Operational metrics help identify what's not working from an architectural point of view.

Even more so than any code metrics.

Socio-technical design matters.

  • Team boundaries shape architecture and may also need refactor
  • Ownership determines how fast change can happen
  • Decision-making processes can make changes fast or slow
  • Migrations can fail organizationally before they fail technically

Let's talk code now

Good large-scale code design is not about elegance; it is about being "good enough" to be changed under new information and business pressure.

If "good enough" is the goal,

we should optimize for simplicity!

OO is great. Really?

OO is powerful when you truly need substitution, rich behavior, extension points, and domain flexibility.

Code challenges in large-scale software systems

  • Orchestrating work
  • Enforcing workflow
  • Moving data around
  • Calling services
  • Handling errors
  • Managing concurrency
  • Authentication and authorization
  • Code ownership

Flexibility is often overrated.

Most systems need clarity and safe change long before they need speculative extension points.

This is not permission to write sloppy code.

Readable, testable, flexible code still matters. But at the right place, when it's just the mean to an end.

Story: bin ranges at Adyen

When the lack of a good code abstraction can cost you time and money.

Let’s get concrete:

a more practical design philosophy

Accept that "good enough" is good enough.

No code survives the test of time, only overengineer it if you have a concrete need.

Architecture, data, and infrastructure as first-class.

They are where the refactor and migration costs concentrate.

Observability and progressive rollout as the true safety net.

You can't ever put the airplane down. Tests aren't enough.

Constant refactor as the natural course of things.

"I respect the past, but what took us here won't take us there".

What’s in it for you, researchers?

  1. Go beyond code refactoring
  2. Include data, infrastructure, operations, and teams in the picture
  3. Help make change safer, cheaper, and more observable
  4. Bring research closer to production reality

A summary

  • Scale means traffic, data, uptime, many teams and services.
  • The hardest problems are often about architecture, infrastructure, data.
  • The only certainty is that today's decisions will age, and refactoring will eventually be necessary.
  • Code design is still important but simplicity is king.
  • The goal is not timelessness, but the ability to keep changing.
Stay in touch

Maurício Aniche

CTO at Alura

mauricio.aniche@alura.com.br

Twitter: @mauricioaniche