Desired state and DevOps
TL;DR
This has turned into a long-winded rant, so what point am I trying to make here? Well, a couple of them:
- Whenever one says “desired state”, there should be some form of reconciliation, not just unidirectional change propagation.
- The deterministic portion of the reconciliation should be fast, relying on reproducibility to deduplicate time-consuming steps. Without the needless slow “build” steps, the performance of the overall reconciliation process becomes easier to focus on and optimize.
The rant
Related to my example yesterday where the desired state of a complete SaaS control plane can be quite successfully be stored in a Git repo, I think it is worth taking a moment to unpack the intent of such an implementation and see how implementations frequently fail to live up to that intent. In short, when saying that a repo is the desired state of X, we imply that something will reconcile the actual and desired states. In Kubernetes terms, a reconciliation loop.
Build systems don’t reconcile things
My experience has been that reality on the ground is usually not so nice – the old-fashioned “build system” roots of DevOps infrastructure leads everyone to treat this as an open-loop state machine rather than a closed-loop control system. This is where naming doesn’t help: “GitOps” as a term falls short of the full intent, quite possibly because it has been captured by vendors (just like “observability” has). Sure, the common usage of the “GitOps” term captures the idea that the desired state of the system is present in a Git repo and that something will make the real world reflect it, what it doesn’t capture is how the real world will be changed. Because of the industry legacy of build systems, there is an assumption that the only source of changes is the repo – everything else flows unidirectionally from there. That’s simplistic, but false, assumption.
The long shared history of “I have new source files, therefore I must build and deploy new binaries” is comfortable, but it isn’t really the job to be done. The job to be done is to ensure that the deployed binaries reflect the source code. It doesn’t matter why they don’t currently match, the situation needs to be remedied (by, absent any optimizations, building and deploying the code). So, we have a lot of tooling that was created for a subset of the job to be done (the hammer), leading people to treat everything as a nail.
People set up build systems to be inefficient
Aside from these build systems' inability to recognize and recover from changes that don’t follow the strict state machine they were coded to implement, the heritage of applying them to flakey technologies leads to rigid and inefficient patterns. If merging a PR to the main branch means that it is code that we want to see in production, why do I see so many teams run all the tests, security scans, etc. again after the merge? They should be prerequisites to the merge (and often are), but teams see the “build” as the one true pipeline that runs as continuous integration the only source of truth, ignoring all checks that happened prior.
The sad thing is that the push towards immutable build artifacts and reproducibility has been underway for many years. In a world where “building” the same code produces the same results forever, there is no need to build it more than once. Those results can be cached for efficiency. The reason I put “build” in quotes is because so many of the time-consuming transformations that happen along the path from checking in code to having an updated running system are now reproducible, deterministic, transformations. These aren’t just code compilation steps – unit testing, bill of materials determination and so forth are, if done properly, immutable. The work that remains based on external state, such as determining which software vulnerabilities are present in the bill of materials today, is limited.
We shouldn’t live in a world where all the same expensive steps are performed over and over. Not just from resource efficiency perspective, but also from an operational perspective: if the reconciliation from code to deployed system is fast, there only needs to be one way to roll back code, for example – just update the main branch to reflect the desired version. So many of the special cases teams are forced to include in their operational processes are just there to work around inefficiencies that have been carried forward from decades-old infrastructure concepts. While I’m not in love with the implementation, I think that Dagger is a good example of somebody rethinking the problem in a modern setting.