The Three Stages of Software Engineering
As the old engineering adage goes, “Fast, cheap or good: you can only have two.” While individual software developers often face decisions about when and where to make compromises in their code, tech companies as a whole face similar questions on a broader level as well. Things like complexity, speed, organizational alignment, and the availability of resources create restrictions as a company looks to scale software engineering. However, changes to the way you develop, build, test, and release apps can often help mitigate these problems.
Stage One: Simplicity
Scaling software development has to start somewhere. When I joined LinkedIn in 2011, the LinkedIn mobile development team was small—six people small, to be exact. The company only had an iOS app, and it was built on the Three20 framework. We used a Ruby on Rails frontend and our Rails instances interfaced with our backend API. Builds were created on my personal machine and releases were on an as-needed basis. Testing was almost entirely manual and we only tested a few server-side metrics. We had no mobile web presence or Android presence.As you can imagine, this was not the most efficient way to build and deploy software.
One of the most important first steps when starting any major new software development project is to find ways to make the develop-build-release process simple. An early focus on simplicity and straightforward process is key because it allows you to concentrate on creating software that solves your business problem, performs well for your users, and can scale as your user base grows.
In the mobile world, many people assume that they have no time to focus on building solid code. They are consequently lured into using tools like PhoneGap, which trade simplicity for portability. But my recommendation is not to do that, because these tools essentially trade technical debt for speed of development and deployment, making it harder for you to understand your own code as your applications evolve. They also make it impossible to use the developer tools, both server side and client side, provided by big platform companies like Apple and Google. These tools can meet the needs of the average project team when you have less than ten or so developers working on a project when it comes to development and testing.
Instead, at the beginning of your scaling basis, you should focus on having specialists that understand your code from the ground up, and avoid trading speed for tools that will complicate the design-build-test process or introduce technical debt in the early days of your project.
Another key thing to avoid during early stages is trying solving problems that don’t move your business forward. Lots of people hear about awesome tool efforts at companies like Google, Facebook, or LinkedIn and want to emulate those companies by building their own custom tooling. But the only reason we and peer organizations build those tools is because we have hundreds or thousands of developers. The number of users and developers that make us have to do that custom tooling, not the other way around. Test Flight, Gradle, the IntelliJ IDE, and other simple tools will meet the majority of your needs while your development team is still small. If someone proposes some custom crazy tooling to address a common engineering problem, then it’s probably the wrong answer.
The performance tradeoffs with our technology stack were causing headaches on a daily basis. We needed to architect our server to make it more I/O friendly, and we needed to fix the testing and release pipeline. By Q3, we had already begun scaling our organization to expand our mobile user base and expand our platform availability. We were at 11 percent of traffic, launching on Android and HTML5-based mobile web and had increased our headcount to 10 engineers. We also switched from a process-based system (Rails) to an evented system (Node.js) for increased performance in handling requests from clients.
Stage Two: Efficiency
Once your engineering organization has grown upwards of 30 people, the next important thing to establish is a sense of release cadence. After we had established a reasonably-sized mobile engineering team at LinkedIn, we ran into a problem with our branch development model, which I call “branch hell.” Essentially, if your releases are far enough apart, developers will try to commit code that isn’t ready for production for fear of missing out on the latest build. This leads to bad code being added to branches, compromises in branch readiness, and ultimately delays in shipping new versions of software.
To address this issue, we ended up adopting a very simple model for releases called the “train model.” In this model, every release goes out to production at a fixed time, just like a train leaving the station. A certain set of features can be on the train or not, but it’s definitely leaving, without question, which solves the issue of delayed releases. When you’re in an early-stage engineering organization, the train can leave whenever you want it to. But once you reach a certain size, you need to create and stick to a schedule—structure around how and when you release new versions.
Focusing on maintaining consistency between platforms will ease development at the tooling level and the cognitive level of understanding the product for everyone.
In order for the train model to work, you need to invest in a trunk-based development model that offers a significantly simplified approach, compared to branch development. Having a good experimentation platform and automated testing suite is a good investment, as well, in order to accelerate your release schedule and increase your test coverage. At this point, I also recommend running against simulators (software emulators and the like) and not focusing on device variability at this stage, until you get more users.
At LinkedIn, we use a system called LIX, which manages the lifecycle of all of our tests and experiments. Every product feature is behind a LIX test, and functional testing is performed with LIX turned on and off. All check-ins are made directly to the trunk and have to meet our testing criteria in order to ship. When we adopted this system in 2012, it meant that for the first time our automating testing could give us a comprehensive view of our code readiness, which features were ready to ship, and which were not. In short, we had banished “branch hell.”
At this point in our company’s history, we were tracking three primary metrics in order to measure success:
- The time from design created to code checked in.
- The time from code checked into code in production.
- The time from code entering production to it being 100 percent ramped to members.
Once you’re in a more rhythmic execution model, I recommend locking the development of features cross-platform, so that features launched on iOS can be launched on Android as well. This is not just to meet end-user expectations. Rather, diverging the two creates more complexity for everyone, from the designers and project managers to developers. Focusing on maintaining consistency between platforms will ease development at the tooling level and the cognitive level of understanding the product for everyone.
Stage Three: Organizational Alignment
The final stage of scaling any software development project is organizational alignment when the project becomes so big that all of your engineering efforts (infrastructure, databases, operations, etc.) are supporting the same output. This typically happens when you start to get 100’s of developers committing to a single codebase and many millions of users using your software on a daily basis.
As you grow bigger, you actually need to compress the release cycle.
At this scale, you need quality assurance at every stage: pre-commit, post-commit, in production, etc. In many ways, this means you have to give up on the idea that all of your bugs will be caught in your testing cycle, and focus on monitoring the actual behavior of your software in the wild. There are lots of key metrics you can use as signals for how well your product is operating: CPU utilization, DNS lookups, memory utilization, crash rate, etc.: these are all key indicators. It may seem counterintuitive, but there’s no way to monitor what 100 million users are all seeing on their screens, so you actually have to release your software in order to truly understand how it is performing.
Another counterintuitive idea is that as you grow bigger, you actually need to compress the release cycle. At LinkedIn, we went from a once per month release cycle to a 3×3 release cycle. This shorter duration reduces the mega-apprehension that developers might feel the need to chuck a lot of features into a release at the last moment, in order to avoid missing a deadline. With a smaller group of people, you can control this tendency through culture. But when you have a lot of people in many, many different organizations as part of your engineering team, there’s no way you can control that impulse through culture alone. Frequent releases also help developers feel less anxious about missing release deadlines, and as a result, they don’t rush their code into builds.
Having such this kind of release cadence almost necessitates having a fully automated release cycle, simply to keep up with the pace at which your teams are moving. At this stage is when the custom tooling that I discussed earlier comes into play. This is because at this scale, the default tools for iOS, Android and even Web development will be unable to deal with your growing codebase, and will end up slowing down the rate at which your teams can move. They’re simply not built to facilitate that many people are working on the same codebase and features like quality library management are simply not provided by the big platform vendors by default. At LinkedIn, we split our projects between multiple repos to allow for more manageable multi-team collaboration. This necessitated a big investment in custom tooling to make multi-product development work while keeping up with our desired release cadence. Be prepared to make similar investments in your dev infrastructure as you grow your team.
Finally, this section is called “organizational alignment” for a reason. Today at LinkedIn we see mobile as a majority of our business and no longer make the distinction between engineering for mobile vs. desktop. Just as form follows function, your business goals and software projects will also eventually align as the number of users and customers grows. Those users will expect the same experience, regardless of the platform they use. Having two teams that own two different experiences. Fractured experience creates infighting between groups.
Also, you’re not going to keep hiring people into the mobile group if they outnumber the rest of the engineering team, combined. When you project merges with the overall organization as a whole, you can say that you that you truly have “scale” in your engineering organization.
Putting It All Together
Scaling a software engineering organization presents many challenges to the best organizations. It has the potential to introduce costs and complexity, but also presents useful opportunities for improved collaboration and greater operational efficiency.
Often, these challenges are as much about culture and philosophy as they are about process and resources. By reducing organizational fear, increasing the speed of innovation, and creating alignment throughout your organization, you can reap the benefits of a large group of highly-motivated software engineers.