Orchestration-as-Code

Orchestration and software are the same

What is Orchestration?

It starts simply. A developer creates a program. That program runs. Works on my machine! But the local developer’s station isn’t a stable place to run the program. So it needs to run somewhere else. Grab a server (how about ec2?) and run it there. Solved. Then it needs to run at a certain time. Easy – setup a cron job. Done. The team moves on.

Another program is written – developers like to write programs. This one needs to run after the other program that was written. Maybe instead of programs we should think of them as jobs? Ok, jobs. So job 1 runs on a schedule. Job 2 needs to run after job 1. We could give job 2 its own cron schedule after job 1 with enough buffer so that there’s no overlap. Oh, except twice a year job 1 runs massive jobs that take triple the time. Ok, we’ll just remember that one and cancel the cron for job 2 on those days. Solved.

And then another program is written – we will cleverly name this job 3. This needs to happen regularly, doesn’t have any dependencies, but is resource intensive. Well, we could either give it its own server but that’s a lot of money; or we could just schedule it to not run when jobs 1 and 2 run. Nice, that works … but also let’s make sure that those really long job 1 runs twice a year aren’t impacted by this job 3. So we’ll also cancel job 3. Who is going to keep track of all this? We can assign this to a person. Perfect. It’s easy enough to remember.

The pattern continues until eventually cracks in the flow form:

A job prevents another job from running
Two jobs with shared dependencies write over one another to create unpredictable state
The “assigned person” forgets that the longer running jobs required that job n, y, and x needed to be stopped. The jobs need to be re-run
New team members have no idea what is running or why
Simple solutions (every job gets it’s own resources) create linear (or worse?) cost increases

The team starts to feel like they are managing a symphony of jobs, spending more time instructing jobs to run than building and improving the systems in place. They feel like … orchestrators.

Simply put, orchestration is the management of processes in a system. In simple systems, it’s barely noticed (although there! A single process cron job is a perfectly sufficient orchestrator). In complex systems it is much more pronounced.

Enter the field of Orchestration Systems

Orchestration systems are a class of technology designed to enable easy management of the flow of complex systems: resource dependencies, process chaining, and job execution (retry, auto-scale, backoff, etc). Done well, they remove the growth bottleneck and let teams continue to scale the processes without needing to scale linearly the team (people $$$$) or sometimes the execution space (cpu/gpu, mem, disk $$).

The Challenge

I recently had the challenge of selecting the next orchestrator. The process itself taught me a lot about some of the underlying assumptions guiding my thinking and led me to a striking conclusion.

First, what actually are the bottlenecks in an orchestration system? And, by inverse, what does an orchestration system need to have in place in order to safely accelerate operations? I put forward that essentials of an orchestrator are the following:

Human Review
Chaining
Observability

Everything else is sugar. These are the main courses.

Human Review

The first time a job or job set is moved to production is relatively straight-forward. That is, the first deployment is always the easiest because there are no expectations or dependencies yet. The nth deployment or change to the job set is much trickier because there exist n-1 expectations governing the production system. These must be preserved.

Software teams know this well and have many tools to enable fast safe delivery in the veins of CI/CD: a core set being automated testing and human review (PRs). Change proposals for job orchestration should follow the same.

Changes to orchestration should be reviewed and tested.

Chaining

A single process flow is pretty straight-forward. Multiple ordered processes (i.e. DAG) on shared compute need to be managed. This is the primary job of the orchestrator: the job flow is described and the orchestrator makes sure that the jobs kick off in the correct order and do not if there is a problem.

Observability

Job scheduling, results, and history need to be easily visible and inspectable. Failures need to have proper alerting. Communication hubs (email, pagerduty, slack) need to be activated when there is critical failure and otherwise not. Logging needs to be easily producible and accessible. Inspectability is needed to build trust in the orchestration.

Orchestration Product Positioning

With that outlined, there are three ways of evaluating orchestration technologies

Job as overlay
Job as config
Job as code

Each of these philosophies see the orchestration problem – and solutions space – slightly differently.

Job-as-overlay looks at orchestration as a distinct activity from software development. The orchestrator’s responsibility sits over the software. It doesn’t matter whether the process is custom golang, an api call, or even a manual process. The orchestrator is fully agnostic to the implementation – and separate from it.

Job-as-config treats orchestration as something that sits alongside the process (think Apache Airflow). Similar to an overlay strategy, the structure and flow of job execution is defined separately from the job. However, it deviates from the overlay class because it does understand more closely program execution; the orchestration itself is governed by code flow. This allows more granular governance within program execution: observability goes further into software execution and is typically where production failures occur. It’s this higher observability that has made Airflow especially popular.

Job-as-code treats orchestration as software. Orchestration flow is fundamentally the same as software process execution. This becomes especially important when flow might not be known ahead of execution: dynamic runtime branching is much more difficult in an overlay or config paradigm. If instead we think of orchestration as a software program running, we have the ability to more simply define branching. If the result of <x> is true, run Job A; if it’s false, run Job B. These kinds of branching happen in every production system and are where many teams spend time working with their orchestrator to manage (and often going out of the orchestrator).

Ultimately, I came to the conclusion that job-as-code is a superior paradigm within which to operate orchestration and more generally Business Process Automation.

Orchestration-as-Code

It’s striking to me how much the principles of IaC cross apply to nearly every facet of the SDLC. Orchestration isn’t unique.

Orchestration needs to be safe. This is where Human Review comes in. Consider the overlay model. You work in a WYSIWYG to chain jobs, define resources, and build the job flow. How do you know it’s right? Maybe you create a duplicate environment – call it staging – with mocked resources and chain everything together. You run it and everything looks good. You’re ready to make the change in production. What do you do then? Re-configure everything? Copy and paste? A/B releasing is the safest but still needs resource swapping and the staging / prod environments need to be exactly the same except for the proposed change. Then you have the problem of change bottlenecks: at most one change at a time.

What if instead … it was code. that was auto-deployed. and changed on a branch. with PRs. and automated testing.

Instead of the copy-pasta, config-hell, wing-and-a-prayer releases, the orchestration follows a full CI/CD process flow. Branching supports multiple changes, automated deployments encourage smaller changes often, reviews ensure multiple humans inspect, and testing sits alongside the changes in the software. This is exactly what has made software releases faster and safer. It’s no different with orchestration.

That means that a job-as-config and job-as-code work in this model: config can follow branching, review, testing, and deployment alongside software.

There’s another interesting facet at play here: cognitive load. When the developer is building a process or even chaining together multiple processes, they are writing software.

Function invocation is orchestration
Multi-threading jobs is orchestration
API call-and-response is orchestration

And so on. That is, the developer’s work is software orchestration. Why shouldn’t orchestration be integrated into the software process?

When the developer needs to leave the program in order to the describe program execution, the developer holds additional cognitive load. Now two systems need to be organized: the program, and the program orchestration.

What if instead the orchestrator came to the developer? As the developer is building the system, the orchestrator is right there alongside bringing the full support of enterprise orchestration software during the development process?

Ok, that’s great but what about multi-program execution? What about when you need actually need to call a black-box process? It’s an important question and one that every enterprise process flow needs to grapple with. But the let’s consider from a different starting point: shouldn’t multi-program execution itself be code? Shouldn’t that code be governed by CI/CD? Shouldn’t branching, retry, backoff, state management, halting, fan-out be governed by review, testing, and automated tests? Wouldn’t we want to mock the retry logic and validate performance?

The point is this: once you start to notice that orchestration is code, you realize that you weren’t solving anything different from basic software principles. You are just scaling out those principles in a specific domain.

The realization landed for me as I was reviewing what good orchestrations systems needed. I wasn’t actually evaluating orchestration tools. I was evaluating delivery paradigms. And once I saw that, I realized we had already solved that problem with modern CI/CD processes. The moment I stopped asking “which orchestrator handles our job dependencies best?” and started asking “which process flow best governs this code?” the decision became obvious. Branching logic, retry behavior, state management, failure handling: these aren’t orchestration concerns sitting above the software. They are the software. Orchestration and software are the same. Treating them differently was always the illusion.

The Architecture of Meaning

Leave a comment Cancel reply

Orchestration-as-Code

What is Orchestration?

Enter the field of Orchestration Systems

Orchestration Product Positioning

Orchestration-as-Code

Share this:

Leave a comment Cancel reply