"We are not Google"
The most obvious, but also disregarded, piece of advice I used to give my colleagues is: “we are not Google”. By that I mean two things: that we were not operating at the scale of Google and, organizationally, we had nothing in common with Google engineering.
As early partners of Google Cloud, we were exposed to, and influenced by, a lot of their practices – which, to be clear, is often a good thing. I am a proponent of SRE, progressive delivery and, more broadly, DORA. What catches outsiders like us out is that these practices need to be tuned to the reality of our own organization.
The first problem is scale. The statistical underpinnings of these practices assume a large uniform set of users – two facts (large, uniform) that often aren’t true in the world of Enterprise and B2B software. When you have a billion users using your system globally, you can rely on statistics to give you a clear and timely picture. When there are far fewer users, when they are relying on your system to work during very specific business hours, when each tenant has its own SLA, then you have to adapt!
Charity Majors has spoken of her experience at Parse where the overall health of the platform would hide the fact that one specific tenant could be experiencing a major outage. This is a problem that we’d see over and over. It is very hard, tending towards impossible, to build a multi-tenant system that is resilient to localized problems without degradation for the tenant: the infrastructure and development time to do so are rarely affordable and there is always going to be something that you didn’t think of. So, if you have made non-functional promises to individual tenants, you also have to measure and react at tenant scale, not just globally. This is hard and, to some extent, anathema to Google’s concept of SRE.
The other multifaceted problem is organization: very few companies have the depth of engineering that hyperscale tech companies do. Those are organizations that understand the value of in-house expertise and can afford it. Obviously, small companies can’t afford it. But, even companies that are big enough to afford experts also need to recognize the value of expertise and be organized in such a way to make the it available to those who need it. When you’re a “normie” – potentially very successful – company, you can’t rely on there being someone who can tackle deep, extremely challenging, problems in esoteric areas. You may have deep knowledge in some domain (hopefully, your business domain!) but there won’t be a deep bench of experts to call in when some technology misbehaves in novel ways.
The two problems compound each other: not only are smaller organizations not set up to tackle deep technical problems when they occur, they are unlikely to have the institutional memory to recognize them when they occur – nobody will have seen even all the mildly esoteric problems before, never mind know how to fix them. Another way of putting this is that you should avoid solutions that you aren’t able to diagnose and debug within your team – there won’t be an expert that the company can pull out of a hat to fix them for you when things go badly wrong. This is not an endorsement of technological stagnation – far from it! – but rather the observation that advances in technology need to come with commensurate advances in one’s ability to operate and debug them.