While there are plenty of articles already on the Total Cost of Ownership (TCO) between a fully-managed service like ECS vs. one that shares the responsibility more with its users like EKS, the discussion is almost always very high-level, geared towards C-level executives. There’s certainly value in having those discussions, but problem I see over and over again, is more at the ground-level between developers and DevOps teams struggling to internalize what it really means for them on a day-to-day basis.
I recently went through this exercise that highlights some of these key points so wanted to walk through how TCO actually plays out in practice in terms of concrete workstreams for both dev and infra teams.
Background
To lay out some context, there is a homegrown, legacy ETL system that has been running on ECS for years. This system was developed when there were no embedded DevOps engineers on the team, meaning that some developers on the team wrote some bespoke Terraform code and decided to use ECS as it required lower DevOps overhead upfront.
While the system is fairly simple (e.g., moves files from S3 to a data lake, does some simple transformations), it become a critical component of the entire data pipeline that it became one of those “don’t break what works” systems that was always on the backlog for migrations but never had enough momentum to carry it through.
During this time, the DevOps team grew in size and EKS became the norm at the company for container orchestration. All of the new workloads were deployed onto EKS, and all the internal tooling to help manage not just the cluster itself but adding some controls onto the applications as well were geared towards supporting Kubernetes workloads (e.g., network policies, security, etc).
At every quarterly planning event, the question of “why aren’t we using a single container orchestration system?” would be brought up. Every now and then, the DevOps team would do an initial analysis on how ECS is actually costing more in terms of operational and management costs as backporting new EKS features to ECS was expensive in terms of time and internal resources. This would in turn trigger the dev teams to do their due diligence in estimating how much effort it would take to migrate, but because things are “still working”, it would always fall behind in priority and the issue would become stale and forgotten until the next time TCO discussion would bubble up again.
Problems Bubbling Up
Cracks started showing when there were finally new feature requests to add to the legacy ETL system. From the dev side, this was a well-scoped problem. For example, instead of storing data in CSV, this system would now convert the format into Parquet for other systems to efficiently ingest. After the feature was developed, the dev team worked with infra teams to run some preliminary scaling analysis and pushed to prod with no problem.
Or so they thought.
After a few weeks, the team was getting paged for two reasons. First, sometimes the pods would eat up too many resources on the node and not let other pods including observability agents from being scheduled. Secondly, the finance team was noticing a huge uptick in network costs as soon as this feature was released.
Both the dev team and the infra teams were confused. Afterall, they had done some scalability testing and nothing they were doing was ground-breaking (meaning these exact problems were already solved on the EKS side). But what they found was that even though best-practices like anti-affinity rules, container limits, and using S3 Private Endpoints were thought to be in place, due to bespoke Terraform code and subtle differences in ECS and EKS, it was in fact not working as intended (e.g., S3 Private Endpoint was only on for VPCs hosting EKS and not ECS).
Takeaways
This “incident” finally illustrated to the dev teams what the hidden operational and maintenances costs are and how it can manifest in practice. Even though ECS is easier to manage and requires very little input from developers, there is a hidden cost to maintaining two difference infrastructure systems across teams. So the argument of “ECS is so easy to use and it’s working” is true, it does not diminish the fact that it is masking a TCO problem that can bubble up in the future.
Most of the TCO discussion is often focused on how running EKS adds on more operational burden, but this can be a nuanced discussion as this case study shows. If the rest of the team is running on EKS and has more expertise, maintaining a more “fully-managed” solution can bring on more challenges as well.
Comments are closed.