Designing at Scale:

Challenges, Pitfalls, and Lessons Learned

Maurício Aniche

CTO at Alura

Maurício Aniche.

CTO at Alura. Ex-{Uber, Adyen, Locaweb}. Former assistant professor in software engineering at TU Delft.

What do I mean by scale?

Traffic and uptime expectations
Large volumes of data and users
Many teams and many services
Security, regulatory, and compliance pressure
Mix of old and new systems all working together

We love software design

and that is why we tend to think this is the most important part of software.

The uncomfortable truth:

at scale, the biggest challenges live in architecture, infrastructure, and data.

Wait, but isn't the architecture implemented by the code?

Yes, but changing code is "easy", defining the strategy so that you don't break production is the hard part.

Every architectural decision expires.

The real question is how and when it will age.

Story: caching migration at Uber

A story about "just" moving to a different Redis instance.

Safe architectural refactoring.

Feature flags
Shadow traffic
Dual reads and writes
Rolling upgrades
Backward compatibility

Wait, there's more.

Schema evolution and data migration plans
Backfills and reprocessing
Sunsetting old paths and deprecation windows
Monitoring and observability

Operational metrics help identify what's not working from an architectural point of view.

Even more so than any code metrics.

Socio-technical design matters.

Team boundaries shape architecture and may also need refactor
Ownership determines how fast change can happen
Decision-making processes can make changes fast or slow
Migrations can fail organizationally before they fail technically

Let's talk code now

Good large-scale code design is not about elegance; it is about being "good enough" to be changed under new information and business pressure.

If "good enough" is the goal,

we should optimize for simplicity!

OO is great. Really?

OO is powerful when you truly need substitution, rich behavior, extension points, and domain flexibility.

Code challenges in large-scale software systems

Orchestrating work
Enforcing workflow
Moving data around
Calling services
Handling errors
Managing concurrency
Authentication and authorization
Code ownership

Flexibility is often overrated.

Most systems need clarity and safe change long before they need speculative extension points.

This is not permission to write sloppy code.

Readable, testable, flexible code still matters. But at the right place, when it's just the mean to an end.

Story: bin ranges at Adyen

When the lack of a good code abstraction can cost you time and money.

Let’s get concrete:

a more practical design philosophy

Accept that "good enough" is good enough.

No code survives the test of time, only overengineer it if you have a concrete need.

Architecture, data, and infrastructure as first-class.

They are where the refactor and migration costs concentrate.

Observability and progressive rollout as the true safety net.

You can't ever put the airplane down. Tests aren't enough.

Constant refactor as the natural course of things.

"I respect the past, but what took us here won't take us there".

What’s in it for you, researchers?

Go beyond code refactoring
Include data, infrastructure, operations, and teams in the picture
Help make change safer, cheaper, and more observable
Bring research closer to production reality

A summary

Scale means traffic, data, uptime, many teams and services.
The hardest problems are often about architecture, infrastructure, data.
The only certainty is that today's decisions will age, and refactoring will eventually be necessary.
Code design is still important but simplicity is king.
The goal is not timelessness, but the ability to keep changing.

Stay in touch

Maurício Aniche

CTO at Alura

mauricio.aniche@alura.com.br

Twitter: @mauricioaniche