The Bar is Sooooooooo High

Dmitry / Undefined Behavior / Blog

This is a translation.

Original article: https://nekrolm.github.io/blog.html

October 17, 2022 - I started working at the Amazon Web Services office in London

October 17, 2025 - my last day at Amazon Web Services

I quit, turning down an incredible opportunity to earn at least another ~£100k in Amazon RSUs if I stuck around for one more year

Some will say - it's about time. Others: how could you?! Just one more year, and you'd get promoted and everything would be great, right? And it's FAANG, after all! You could have just transferred instead of quitting. Amazon is huge!

Amazon is indeed huge. And truly, stories about inhumane conditions can be 100% true in one part of it and complete myth in another. I can only speak specifically about AWS, and most confidently - about CloudFront. But there are also certain characteristics that apply to all of Amazon.

There are many reasons why I decided to leave AWS:

Compensation compared to the market
Return-to-office 5 days a week
Endless approvals
Desperate attempts to do something well
Oncalls
Project disappointments
Stress

And that last reason became the deciding factor.

High Success Brings High Arterial Pressure

Amazon is famous for its Leadership Principles. 16 commandments to follow, which you need to weave into daily Amazon conversations on any topic to be successful in the corporation. Funny enough, when I joined in 2022, there were 14 commandments. Now there are 16. But that's not what this is about.

One of the principles is "Success and Scale Bring Broad Responsibility." And right there, just as a headline rather than its official definition, it perfectly explains everything.

I always considered myself fairly stress-resistant. And it seemed to me that nothing particularly intense was constantly happening at work. At least not all the time.

However, when subjected to regular, almost daily tugs in different directions (as they love to say here: receive shouldertaps), the human nervous system can eventually break:

Over 3 years working at AWS, I developed a monstrous stress-induced cough that sometimes doubled me over and made me vomit. For a long time I didn't understand the cause of this cough - I went to doctors, spent my entire annual insurance quota on them. Doctors and tests ruled out respiratory and gastrological causes, leaving me with one - stress. I paid closer attention to how my cough correlated with what I was doing throughout the day. And I noticed that:

I barely coughed on weekends
I barely coughed on vacation
It got worst after meetings
And also when I left home, heading to the office

So what went wrong?

Jack of All Trades, Impostor of All

In this section, I'll allow myself to insert those very Leadership Principles, ironically and as buzzwords. To somehow convey the peculiarities of corporate writing.

CloudFront is a CDN, content delivery network, or simply put - a large distributed cache for your cat photos. And quite successful at it. Something like 30% of all internet traffic passes through CloudFront one way or another. Cool, right?

In practice, this means that with any change you have a chance to take down 30% of the internet. Banks won't be able to show you very important stories. Media services won't display your slop feed. Your very precious JavaScripts for drawing snowflakes in the holiday website theme won't load. There's a lot you can break.

Of course, allowing this to happen is absolutely unacceptable. So there must be well-established processes for review, testing, and gradual rollout of changes and, most importantly, emergency rollback. Preferably - automatic.

CloudFront has such processes. But they have some amusing specifics. I won't go into details, but this specificity is very well described by the following:

When estimating costs for any feature, the manager might say "let's keep rollout aside."

Not because everything is so well-organized that you don't need to think about it. But because this rollout will take an indefinite amount of time, from a month to a year or more, and throughout all this time you'll be shaking and praying that god forbid something doesn't go wrong and you won't have to roll back, so as not to upset our dear customers. Customer Obsession above all. And some customers are so sensitive that they might get upset if one request out of a million fails in a day.

If something goes wrong, writing a correction-of-errors document is not the most pleasant thing, which you then need to present to a very wide audience of principals and senior managers L7, L8 and above (a regular developer at Amazon is L4 or L5).

Few people want this. Therefore, as they say, we must Think Big and Insist on Highest Standards. So before each feature you need to write a document. Even if you still completely don't understand whether the feature is feasible and how to implement it. Implementation details are a Two Way Door Decision, so not very important.

The document will be reviewed. Moreover, it will be reviewed right at the meeting to discuss this document. After all, few people will read it beforehand. In a large corporation, everyone has so many meetings and so little free time that reading documents also requires booking a separate slot in the calendar. So the first 15-20 minutes of the meeting you'll be reading silently.

And then the grilling begins (Bar Rising!)

Are we sure we set the right goal?
Are we sure we understood our customers correctly?
Are we confident this is secure?
How will we release this?
Are we sure we're ready?
Who is the customer?
Are we certain?
Certain?
Are we sure we've considered everything?
All the metrics?
Are we sure they're sufficient?

You're expected to be like Doctor Strange, viewing all possible futures and definitely choosing the one correct solution together in advance.

I think you couldn't design a better recipe for developing impostor syndrome. Such Bar Rising can make anyone doubt themselves. Plus the background is quite fertile for this: CloudFront is incredibly large. There's a huge number of wiki articles on how to try to "operate" this or that thing in it. Very many of them are outdated, have links leading nowhere. Or simply completely useless. The codebase is enormous, spread across thousands of repositories that only pretend to be independent, but are actually combined into a monorepo. All this is also rolled out by pipelines synchronized god knows how. And on top of this sits a bunch of internal monitoring services. Uncertainty, doubts, enormous context that can't be held in your head.

You can, and you absolutely should seek help and consultation when writing documents, on questions you don't understand very well. True, those who understand are most likely sitting in Seattle, which is 8 hours different from you. And your choice isn't very large:

Hope for asynchronous interaction (rarely works well and quickly)
Stay up late at night for meetings (not a very healthy approach)
Wake up colleagues at 6 AM (they often already have meetings scheduled there)

So that an engineer better understands the product they're working on and feels unity with it, you have oncall rotation. Allows you to demonstrate Ownership, Customer Obsession and Operational Excellence. And also report Ops Wins!

Leaders are owners. They think long term and don't sacrifice long-term value for short-term results. They act on behalf of the entire company, beyond just their own team. They never say "that's not my job."

This means the oncall operator must:

Know who to ping when things are really bad
Be able to block/unblock pipelines
Be able to deploy bypassing pipelines
Be able to dig through logs of 50 different forms and services (here's CloudWatch, there's Querylog, here's just a file, here's Grafana suddenly, and here's also a CodeDeploy log, oh, and for the old generation of servers we have another log and need to use completely different commands. And here you should also check the network log...)
Be able to redirect traffic from one Point-of-Presence to another
Monitor a couple hundred dashboards
Know where runbooks are
Be able to read Java stacktraces as well as core dumps (or equally poorly)
Also understand AWS Lambda and step functions
And it's also desirable to Raise the Bar and Make the World a Better Place - by improving runbooks, dashboards and such

And, oh yes,

Respond to tickets filed on behalf of service users... And all sorts of fun can happen there:

A user might be (in pain) complaining that they had 1 http request out of a million fail. Figure it out or they'll leave.
A user might be (in pain) complaining that they have an infinitely ancient web server that doesn't give a damn about any RFC, so hand them an HTTP header in alternating case, please.
A user might come and say they fired all the programmers who wrote code for deploying resources in your service, so hop to it and finish writing that code for them. And this isn't a joke. I sat in meetings for two weeks on exactly such a case.

And that poor soul who first responds (no matter what or how) to this ticket (or whoever is oncall now), will become responsible for resolving it - and will receive shouldertaps regularly:

So what's up? Any updates?
Do we have customer wording?
Can we do anything else?

Special delight occurs when the ticket actually highlights some problem that can only be solved by a code change - a hotfix.

The hotfix will pass review very quickly. We'll shove it into the pipeline... And it will travel through the pipeline for several months. And all this time you'll be regularly jerked around from different sides:

Do we have an expected delivery time for the fix?
Has the fix arrived yet?
The customer is still in pain!
Why is it taking so long?!

Scroll up, what should the oncall operator be able to do?

Be able to block/unblock pipelines
Be able to deploy bypassing pipelines

And there are about ten of you skilled ones sitting on the pipeline at once. And each deployed DIFFERENT versions.

...

I hope everyone understood everything. Let's keep rollout aside.

Are You Optimistic About GenAI Future at Amazon?

You know, it's very difficult to maintain a positive attitude when:

They regularly announce layoffs by thousands and tens of thousands of people.
The CEO demands everyone be driven into the office 5 days (Disagree and Commute) and feel like they're in the biggest startup

Jassy also urged employees to "move fast and act like owners," as some of the company's competition is "working seven days a week, 15 hours a day."

And of course, legendarily send out a letter about his vision of a bright GenAI future and how it will help us reduce our workforce even further.

And here, three days before my last day: we're cutting more HR and someone else for company https://fortune.com/2025/10/14/amazon-layoffs-pxt-hr-andy-jassy/

Hearing this when you're on a work visa - well, that's very inspiring, of course.

Thanks to everyone who supported me, invited me to podcasts, read and shared UBBook - all of this led to the opportunity to get a Global Talent visa and not depend on the whims of the left heel of corporate high leaders.

GenAI hysteria is completely inadequate

They want to see GenAI in almost every project, in any form whatsoever
Internal hackathon? Need to add GenAI. Won't be evaluated without it
And here's a KPI for GenAI usage at work. You're falling behind!

Amazon has a daily survey system (taking the temperature of the iron), like:

Do you feel energized at work? Strongly Agree, Agree, Neither agree or disagree, Disagree, Strongly Disagree

Does your team make decisions fast? Strongly Agree, Agree, Neither agree or disagree, Disagree, Strongly Disagree

And so on. It's all supposedly anonymous. But at the end of each month you gather as a team, review the answers. My impression: you sit for 40 minutes playing mafia: who among us is dissatisfied, and why are we below benchmarks, and these are such important surveys, you can't answer them spontaneously. Need to approach with all seriousness...

I don't know how the benchmark for Job Satisfaction is calculated, but when it was regularly shown at 50% (whatever that means) - it was very funny.

This wonderful system, in the last weeks before my decision to quit, asked me every day how deeply informed and using GenAI I am at work.

The hammering with these surveys has become memetically unbearable.

One More Document and Everything Will Roll

In 2022 I was hired as an L5 Software Engineer. After about a year, almost all L4 engineers get promoted to L5. Then I thought that maybe it would be good to try to get a promotion too. We don't have a single L6 engineer in London. Especially since I pushed forward a good project and got approval for it. And even completed it. And generally a respected person with deep expertise, mentoring the team, etc.

They gave me an "exceeded role expectations" rating. Even with the manager's great interest in my promotion, I received comments: bar is so high. Need to write another little document. Just a bit more, need to demonstrate operational excellence. One more system document and for sure. Everyone already has this perception that you're totally L6, but now we just need to show senior peers one more document so they understand how you'll feel outside your comfort zone...

In general, a year ago, having seen endless meetings and even more jerking around of L6 engineers (even more meetings until midnight), I decided I didn't really need this promotion that badly.

Besides, L5 at Amazon is considered a terminal level - people sit on it for 10 years. Because bar is sooooooooo high.

And yes, to get promoted, you must work for several years in a row at the level of that very promotion. But without compensation from that higher level. The compensation, according to many accounts, completely doesn't correspond to the level of headache that awaits you.

The Best Way to Burn Out

I don't really love web development and standard enterprise CRUD-mongering. I started my career as a C++ developer, moving bytes from FPGA board registers and optimizing cool number-crunchers.

In CloudFront Compute I got to work on excellent tasks (many of which I found myself) on optimization and low-level systems programming.

Many tasks and fixes were done in a couple of days or weeks. And showed very good results in laboratory tests, passing all known and reasonable tests. Well then, maybe we could try to release them and see the results?

Are we sure?
How many killswitches do we need?
Is this definitely the right goal?
What if this breaks something for someone?

And such discussions dragged on for months. And each time it was necessary to convince, convince, convince. You can't just try it, verify or roll back...

At some point I simply couldn't write any code anymore, except for something really small. Any even slightly serious change of more than 10 lines would require rounds of:

Long nagging of reviewers with 8-hour time difference
Doubts from everyone, are we sure about XXXX?
Writing endless documents for approval of testing on live traffic, since experimental isn't enough
More endless discussions of additional monitors, alarms and kill-switches
And then regular tracking of how it's going through pipelines that can stand still for months

I went through this ordeal once as part of a relatively small isolated project (about 5 thousand lines of changes total, and lots of tests on top). And from the realization that any other comparable change would go the same way, I completely lost all desire to start anything.

The release of a project I made in 2 weeks took a year and a half... A year and a half without being able to see the result of your work!

Make programmers write code, don't let them see the result, and jerk them around in different directions - that's the best recipe for burnout.

Of course, I perfectly understand that such a state of affairs may suit many people. And in general it's all correct: the system is large, with specific requirements, you need to carefully plan everything, account for everything, think through everything, so there's zero downtime, and then the caravan will slowly go for the next 5 years... But this only disappoints me. I'd prefer to work on less massive projects with an understandable delivery cycle, rather than on monstrous distributed clouds whose continuous deployment is a constant dance on rakes.

I'm almost certain that everything related to release processes and feature development applies not only to Amazon. So I'll stay as far away as possible from any positions somehow related to SRE (Site Reliability Engineer).