Why You Should Not Build an Internal Developer Platform

Maneesh Chaturvedi
8 min readJun 14, 2021

Most organizations in today’s day and age are investing heavily in application modernization. What that translates to is the move to microservices and container-based technologies. The obvious choice is to run these containers in a public or a private cloud or a combination of the two. However, as the number of containerized applications increases, there is a need to invest in some tool that helps manage them: enter Kubernetes.

As far as enterprise organizations are concerned, the first step they take in the modernization journey is to use a managed Kubernetes offering from a cloud provider like AWS, Azure, or GCP. Unfortunately, a lot of companies assume at this point that all the problems of managing containerized applications are solved. They fail to understand that every managed Kubernetes offering, each one of them, only brings up your cluster, and then you are on your own. Three, maybe four years ago, bringing up a Kubernetes cluster was a hard problem. But not anymore. The complexity has significantly reduced. I’m not saying it’s easy or trivial; the complexity has reduced. Complexity now manifests in what is typically called Day 1 and Day 2 operations.

The lifecycle of an application has a lot to it. After the cluster is provisioned, there are a plethora of things that need to be done. The Devops team or the application developers have to think about how to provision the pods, how to allocate the right set of resources to the running pods, how to scale them, how to get the manage application configuration, how to deal with secrets, how to upgrade their code, how to set up service to service communication, how to capture logs, how to get the application and cluster metrics, how to manage policies, how to restore and backup clusters, how to attach storage. The list goes on. Every company, irrespective of the domain, faces these issues because the challenges of containerization are the same. Kubernetes does not provide capabilities for all of these. It relies on other tools, whether open source or custom-built- like service meshes, security, telemetry, networking, and infrastructure automation, to name a few. There are so many things that need to be brought into the mix to run clusters across sites.

So how do you go about solving the problem? There are a couple of alternatives. All the proposed solutions have trade-offs: some are downright impossible to sustain in the long run.

The traditional approach has been to have the Ops team handle constructs related to application deployment, management, monitoring, upgrades, security, etc. Although this model worked in the past, it's not a sustainable model at scale. A small number of applications, typically monoliths running on designated servers, made it possible to monitor the applications and servers proactively. Hence a small team of operations could handle these responsibilities adequately. However, with hundreds or thousands of services running on hundreds of servers, it becomes infeasible to manage the ecosystem for a small operations team.

It’s not just the larger number of applications that matter; it’s the speed of rolling out changes as well. Rather than deploying once every three months, organizations have moved to a model of frequent deployments. The frequency varies from once a day to even multiple times in a day, depending on the technical maturity of the team/organization. You could throw people at the problem. However, how many is enough. Is a team of 100 enough, and if the stack grows, which it invariably will, do you keep adding people. Problems of scale cannot be addressed by adding headcount.

Another way is to let application engineers handle the complexity. However, when organizations encourage their software developers to write complex infrastructure software, they create a growth barrier for themselves. The time and effort invested by the developers should be in adding value to the business. Unfortunately, long-term investment in this approach is typically infeasible.

An optimal way to solve this problem is to develop an Internal Developer Platform(IDP). To make application development go faster, companies create internal platforms augmented by dedicated platform teams. Internal Developer Platforms (IDPs) represent an evolution. IDPs focus on optimization and minimizing repetitive tasks for operational teams while providing a convenient environment for developers to increase their productivity.

So, it seems like building an IDP should be the way to go. Right? While this is undoubtedly an essential and valid initiative, it’s not something that an organization should get into without considering the challenges associated with building one. We’ll discuss a few of the challenges to let you make an educated decision if you decide to build an Internal Developer Platform.

Reason 1: Underestimate the complexity

A lot of organizations underestimate the complexity associated with building an IDP. One of the reasons is that the barrier to entry for Kubernetes can appear low. It is straightforward to run Kubernetes on a local machine using tools like kubeadm, docker for Kubernetes, etc. Unfortunately, this low entry barrier leads many application teams and other stakeholders to assume that setting up or running an IDP is equivalent to firing a couple of kubeadm/kubectl commands. This leads to a false notion that ignores the underlying complexity associated with running hundreds or even thousands of microservices, as well as the task of automating the developer loop and simplifying operations. Running a small cluster with just the basic bells and whistles is easy, but scale changes everything. The challenges around building and operating an Internal Developer Platform manifest at scale. Running a few pods in a local cluster is nowhere near production-ready. It’s like the hello world! in Kubernetes.

Another misconception that a lot of folks have is the belief that Kubernetes is a silver bullet. They falsely believe that setting up a Kubernetes cluster will solve all the issues. We’ve talked about this at the beginning of this post. This ties in with the general lack of understanding that developing software for infrastructure is inherently complex. A well-built IDP abstracts many layers for the developers, including infrastructure, networking, scale-in and scale-out, management and monitoring of applications, service to service communication, and a ton of other aspects. Building an IDP is way more complex than traditional application development; applications typically built using a high-level framework like spring or Single Page Applications using React/Redux.

The lack of understanding and the low entry barrier can manifest in extremely aggressive timelines and targets for the platform team. They generally end up working round the clock to meet these inflated timelines leading to eventual burnout. The industry standard around building an IDP ranges anywhere between 18–24 months. Expectation setting with stakeholders at the onset and during the evolution of the IDP is one of the most important criteria for success. Failure to do so or working with inflated timelines is a sure-shot recipe for failure.

This is where team semantics, stakeholder management, and good product management can make a big difference. There can be differences in what stakeholders perceive as an IDP. For some, it could be Infrastructure as a Service, which automates the provisioning of infrastructure; for others, it could be Kubernetes as a Service, that enables individual teams to set up Kubernetes clusters; and for yet others, it could be a Platform as a service that provides everything from infrastructure, deploying and running applications, to telemetry, to even databases and messaging middleware. These are areas where proper and timely conversation with stakeholders can help guide the IDP correctly.

Reason 2: Organization Culture and Maturity Matters

The maturity of an organization and the culture that percolates through it is an important criterion when deciding whether to develop an internal developer platform. Very often, organizations start on the journey of developing an IDP too early without fixing other issues related to technical maturity and culture. They do this with the assumption that an IDP will solve issues around technical maturity and organizational culture. This might work for small to medium startups with fifty to a hundred developers. However, in larger organizations that have a lot of history, coupled with obsolete processes, this would not be the wisest choice to make. Ideally, such organizations should be focused on fostering a culture of DevOps and removing silos and centralization within the organization. The presence of these silos and obsolete processes creates a sense of fear when such organizations set out on a journey to build an IDP. This is natural since an IDP is expected to remove a lot of the human effort required to build systems and replace them with an autonomous platform that takes care of most tasks associated with deployment and running applications across the organizations. Although an IDP can help you ascertain which those areas are, that is not what you want as the outcome of building an IDP. It is better to look at specific developer metrics like Time Spent Outside Code(TSOC) and try and bring that down to a small manageable number. A continuous effort to reduce TSOC sprint by sprint for each team can pave a good path for building a culture of DevOps. A simple question to a team What slows you down? can bring up a lot of important areas where the team needs help. As a leader, if you can address those areas, you can set the ball rolling in the right direction.

Reason 3: The dichotomy of Fast vs. Good

Numerous studies and publications from various organizations that have successfully built IDP’s show that an IDP can substantially bring down the time to production. Most have reported improvements by at least fifty percent. Yet there’s the question of ensuring that everything works properly, scales well, and is bug-free. What is the cost of breaking customer experience? This is where good comes into the equation. Good is an area that is tangential to speed. How good is your software relies on how well systems are architected, designed, developed, and tested. That means having a lot of best practices related to software engineering in place. Proper testing, architecting, and designing for scale and redundancy, sizing, and application performance characteristics must be done beforehand.

Poor architecture can mean applications would require indiscriminate amounts of compute or memory. Improper sizing and lack of performance benchmarks can lead to having under or over-utilized application pods. Improper readiness checks can lead to applications getting traffic before they are ready to process requests, leading to the initial calls returning errors or getting dropped. Similarly, improper health checks can lead to not knowing about the availability of an application.

It takes a lot of upfront work during application development to ensure that an IDP can deliver its promises. Investment in development is a prerequisite to investing in an internal developer platform. Having an IDP does not exclude the need for solid engineering practices. In really good software companies this dichotomy of fast vs good can be made irrelevant.

In closing, let me mention that it's a fabulous journey to undertake despite all the challenges of developing an IDP. An IDP's impact on an organization, assuming that the prerequisites are met, can be huge.

--

--

Maneesh Chaturvedi

Seasoned Software Professional, Author, Learner for Life