A migration story worth reading closely
On June 25, Slack's engineering organization published a detailed account of how its AI serving platform evolved across four phases over roughly three years, and it is one of the more useful infrastructure write-ups we have seen this quarter. Authored by Shaurya Ketireddy, a Lead Member of Technical Staff, the piece traces a journey from self-managed Amazon SageMaker to a provider-agnostic serving layer that now spans AWS Bedrock and Google Cloud Vertex AI. What makes it valuable is not the destination but the sequence of trade-offs, each phase solving the pain the previous one created.
The context is brutal in the way real production systems always are. Slack reports that its AI workloads can fluctuate up to tenfold between peak and off-peak periods, the kind of variability that punishes any naive capacity plan. Serving millions of enterprise users while that demand swings wildly is the constraint that shaped every architectural decision. For engineering leaders, the honesty here is the point. This is not a vendor case study about a frictionless adoption; it is a record of how a serious team negotiated GPU scarcity, cost commitments, and availability over multiple years.
Phase one and two: from SageMaker to Bedrock
The first phase, the SageMaker era beginning in early 2023, ran inside an escrow VPC with cross-account IAM roles and immediately surfaced the classic problems: scaling latency, GPU scarcity, and over-provisioning. Slack patched these with on-demand capacity reservations and cron-based scaling, the duct tape that keeps early systems alive. We recognize this stage in almost every organization that built AI serving before managed inference matured. It works, but the operational tax is paid by the people who least want to be paying it, the engineers who would rather be improving the product.
The second phase moved to Amazon Bedrock in mid-2024, which eliminated most of the infrastructure-management overhead and reframed capacity in terms of Model Units rather than raw GPU instances. Slack describes this as a zero-incident migration that maintained 100 percent availability, and crucially it let the team adopt new model versions weeks or months earlier than the SageMaker path allowed. That second benefit is underrated. In a market where model quality jumps on a near-monthly cadence, the ability to upgrade faster is a competitive feature, not just an operational convenience.
Phase three: hybrid capacity and spillover
The third phase introduced a hybrid routing strategy that combined Bedrock's Provisioned Throughput for latency-sensitive features with On-Demand capacity for bursty workloads. Slack built a spillover pattern so that surges could overflow from reserved capacity into on-demand, smoothing the tenfold traffic swings without paying for peak capacity around the clock. This is the FinOps heart of the whole story. Provisioned throughput buys predictable latency at the cost of a commitment; on-demand buys flexibility at a premium. The engineering work is in routing each request to the cheaper option that still meets its latency budget.
Removing the multi-month commitment lock-in was the quiet structural win of this phase. Long capacity commitments are comfortable until a better model or a better price appears and you are stuck. By designing the system so that provisioned and on-demand capacity coexist behind a single routing decision, Slack preserved the ability to move. We would encourage any team building serving infrastructure to treat commitment flexibility as a first-class design goal rather than something you negotiate with finance after the fact.
Phase four: the multi-cloud payoff
The final phase, arriving in early 2026, added Google Cloud Vertex AI alongside AWS behind a provider-agnostic serving layer with an intelligent routing layer and an automated circuit-breaker pattern. Slack is explicit that Vertex AI is not merely a failover for redundancy but a strategic engine for accessing a broader catalog of state-of-the-art models. The measured results are concrete: roughly a 10 percent improvement on complex reasoning tasks and about a 67 percent latency reduction for short, low-token workloads. Numbers like that justify the considerable complexity of running two clouds in production.
There is a sober counterpoint worth holding. Multi-cloud is not free; it doubles the surface area for failures, observability, and security review, and the circuit breaker that makes it resilient is itself a system you now have to operate. Slack's claim of zero customer-facing incidents across these migrations is impressive precisely because it is hard. The lesson we take away is sequencing. Slack earned the right to go multi-cloud by first mastering single-provider operations, and teams tempted to skip straight to the end state should study how much groundwork the earlier phases laid.
The routing layer is the real product
If there is one component worth extracting from Slack's account, it is the intelligent routing layer with its automated circuit-breaker pattern. Everything else in the architecture, the providers, the capacity modes, the model catalog, is interchangeable; the routing layer is what makes them interchangeable. It is the abstraction that lets a feature request a model by intent rather than by provider, and lets the platform decide at runtime whether that request lands on Bedrock provisioned throughput, Bedrock on-demand, or Vertex AI. That indirection is precisely what spared Slack from rewriting application code every time the underlying serving decision changed across four phases.
The circuit breaker is the part engineering leaders should internalize. In a multi-provider world, the failure mode is not a clean outage but a degraded dependency: one provider gets slow, returns errors intermittently, or throttles you mid-surge. A circuit breaker that trips on those signals and reroutes traffic is the difference between a contained blip and a cascading incident. We would argue that this routing and isolation logic, not the choice of any particular cloud, is the genuine intellectual property in Slack's stack. Teams building serving infrastructure should invest there first and treat the providers behind it as commodities that come and go.
Capacity planning when traffic swings tenfold
The tenfold peak-to-trough variability Slack reports is the constraint that should reframe how leaders think about AI serving budgets. Provisioning for peak wastes money for most of the day; provisioning for the average means falling over during spikes. Slack's answer evolved deliberately: cron-based scaling and on-demand reservations in the SageMaker phase, Model Units and provisioned throughput under Bedrock, and finally a spillover pattern where reserved capacity handles the steady state and on-demand absorbs the surge. Each step traded a little operational simplicity for a lot of cost efficiency, which is usually the right exchange once volumes are large enough to matter.
For finance-minded technology leaders, the structural lesson is that capacity commitments and traffic shape have to be designed together, not negotiated separately. A long provisioned-throughput commitment is cheap per unit but punishes you when a better model arrives or demand shifts. On-demand is flexible but expensive at the margin. The art is in the routing policy that decides, request by request, which budget to draw from while still meeting each feature's latency target. Slack's roughly 67 percent latency reduction for short workloads suggests they got that policy right, and it is the kind of result that only shows up when capacity strategy and routing are treated as a single problem.
A reference path, not a template to copy
We want to be careful about how this story gets used. It would be easy for a leadership team to read Slack's four phases and conclude that multi-cloud AI serving is the obvious end state for everyone. It is not. Slack reached this architecture because it serves millions of enterprise users with strict availability expectations, compliance requirements such as FedRamp Moderate, and traffic variability that makes single-provider economics genuinely painful. Organizations without that scale may find the operational cost of two clouds outweighs the roughly 10 percent quality gain on reasoning tasks that justified it for Slack.
The durable value of the write-up is the decision framework, not the destination. Each phase was a response to a specific, measured pain, and Slack only added complexity when the previous architecture demonstrably could not keep up. That discipline, adding sophistication in response to evidence rather than ambition, is the transferable lesson. Engineering leaders should map their own AI serving setup against Slack's phases to locate where they actually are, and resist the urge to leap ahead. The teams that skip the unglamorous middle phases tend to discover, painfully, why those phases existed in the first place.


