Continuous integration forms the backbone of every tech company. Without a smooth build and test setup teams lose productivity and everything grinds to a halt. That’s why we took a hard look at our Travis CI setup last year. Was it really supporting our teams after the enormous growth we went through? Turns out, that was not the case.
We, therefore, migrated 75 repositories from Travis CI to TeamCity in a few months’ time. While doing that we standardized our pipelines, reduced the build/queue time, and took control of our build agent hardware and environment. Quite impressive right? In the remainder of this post, we’ll share why we did this and how we tackled this challenge.
Why did we even start?
We had quite a few regular complaints about our CI setup before migrating:
- Builds are too slow
It would take around 45 minutes (45 minutes, Carl!) to build our largest repo. And that was after we fixed a few obvious bottlenecks in the old setup already.
- Time in the queue is too long
In the afternoon your build could sit for over an hour waiting to start. We had 20 concurrent builds which clearly was not enough anymore.
The final push arrived when Travis announced their new pricing model, increasing our cost by a factor of 4x without providing any benefit. It was time to look for a solution that better fit our scale and ambitions.
What goals did we have?
As migration was becoming inevitable, we decided to describe the ideal CI solution — in essence, all points we could improve if we would re-implement anyway. After a few rounds of discussions we settled on the following list of priorities:
- We should standardize our build pipelines.
There should be reference pipeline definitions that can be used easily, but teams still should be able to customize everything.
- Time to feedback should be short.
Queuing and building together should be under 15 min for an “average” build.
- Maintenance should be low.
CI is not our core business, we want a managed service that would be as flexible as possible without the need to spend months or even weeks to keep it running.
- All our pipelines should be configured as a code and versioned.
The main drivers for this are auditability, knowledge sharing, and history of changes.
- Scalability should come out of the box.
We should have no limits besides the depth of our wallet. No matter how much we grow or add to our CI, it should be able to scale with us. Ideally with (sub-)linear cost growth.
What alternatives did we consider and why did we choose TeamCity?
The full list included more than 10 options: TeamCity, GitHub Actions, Circle CI, CloudBees CodeShip, JFrog, AppVeyor, Argo, Buddy, Semaphore, GoCD, Gitlab, Buildkite, and Bamboo.
From there we narrowed it down to 3 main candidates by reading the docs and checking against our requirements. The shortlist included: Circle CI, GitHub Actions, and TeamCity.
Next, we started hands-on testing of the shortlist. We took our largest repository and set up the pipeline with each vendor. Through this quick prototyping, we could see how the pipelines would look in reality and whether our goals are realistic.
The final step was to do a larger-scale evaluation with actual teams. We selected a few teams, migrated their repositories to one of the CI platforms, and they used it for a few weeks. Soon after, we got their feedback and did a few rounds of evaluation and discussions. Eventually, we settled on TeamCity Cloud because it was the most feature-rich, allowed us to build our own DSL, and had simple integration with our cloud provider. Shortly we started working on implementing the final setup.
So what does the final setup look like?
Now the juicy part! 😀Here is the overview of our solution:
- TeamCity configures WebHook in GitHub, which notifies TeamCity when a change is made.
- According to our internal policy, each project has versioned settings enabled, so TeamCity executes the Kotlin scripts from the repositories being built to get the actual steps the builds should execute. To do that it needs to fetch our build DSL library from the registry.
- Once the configuration is clear, TeamCity provisions an agent in our AWS account using a cloud profile. That agent then executes the build.
- The build agent executes the steps according to the configuration compiled from the repo settings. Usually, it involves resolving dependencies of the project being built and uploading the newly produced artifacts.
- Eventually, TeamCity communicates the status of the commit back to GitHub.
Let’s dive into the most important and interesting parts — the integration with the cloud provider (step 3 in the overview) and the internal DSL we use to declare standard projects (step 2 in the overview).
Provisioning the agents
We configured an IAM user that can start instances in our AWS account, TeamCity documentation has a detailed list of permissions required. The cloud profiles are using the latest version of AWS launch templates. This way, in order to upgrade the build agent image we only need to create a new version of the launch templates and TeamCity will pick that up right away. This is a one-time configuration. We use Terraform to configure the AWS side and manually configured Cloud Profiles in TeamCity.
Creating build agent AMI
We use Packer in combination with Ansible to build our agent images. Packer takes care of creating a source instance and once it’s available it gives control to the Ansible playbook to configure everything we want, eventually creating a new build agent AMI from that instance.
When a new AMI is ready, an engineer will update the corresponding Terraform configuration which creates a new version of the Launch Templates and TeamCity picks that up automatically.
TeamCity uses an extremely powerful Kotlin DSL to describe the configuration, allowing you to configure every aspect of your build pipeline. While this offers a lot of flexibility, it does make the learning curve quite steep and the final configuration of a build is quite verbose.
With versioned settings enabled, TeamCity’s configuration is just a regular Maven project with a Kotlin script. We used this opportunity to build a small library of reference projects and builds. It allows teams to bootstrap a new repo with default pipelines faster.
Under the hood,
javaProject declares 3 build configurations that would build the application using Maven, run all unit and component tests (if they exist), and push the resulting Docker image. It will also run relevant automated analysis and post the status back to GitHub.
If you’d download the settings of this project in Kotlin format the file contains 1300 lines of code. Of course, some refactoring and sensible re-use would get it down to let’s say 500 or even 300 lines of code, but still, it is not succinct enough and contains lots of repetitive boilerplate in every repository.
With our own DSL, the boilerplate is hidden and the reference build pipelines are implemented in a shared repository. Any engineer within our company can open PR to improve it. Together with Renovate, it enables the rollout of new features across all repositories in a very short time.
Engineers still have full control over their pipelines. Our DSL has hooks to get to the underlying Kotlin DSL, so everything that is possible with Kotlin DSL is still possible with our DSL.
Did we achieve our goals?
Of course, we did! Otherwise, the title of this post would have been “Why you should not migrate to TeamCity” 😂
Most of our services use the reference pipelines provided in our DSL library. There are a few projects with advanced requirements or unusual tech stacks — these can only use some basic parts from our DSL and the rest is custom.
Build and queue time
The same repo that used to take 45 min to build is now ready in 10 minutes on average. The main reason for that is more powerful build agents. We are in full control of the underlying hardware, so we provision Compute Optimized instance types because the Java builds are mostly CPU bound. Most CI vendors we checked use General Purpose instance types, which have less suitable CPU/Memory ratios.
To be fair, not all our builds received such a boost. In some cases, resources were not the limiting factor, so we still have projects with build times past the 20-minute mark.
The regular queue time is now between 2 and 3 min. No rocket science here, we have a higher concurrency limit, so builds rarely have to wait.
Half of the queue time is waiting for an agent to be provisioned in our AWS environment, which is something we want to improve, by either re-using the agents or having a few warmed-up agents waiting for builds.
Note: when you use TeamCity-managed agents they start nearly immediately because TeamCity provisions default agents in advance, so there is always a small pool of agents waiting to pick up your build.
After the migration we have a few ongoing maintenance tasks:
- Review PRs to upgrade or install the tooling on the build agents.
- Build and roll out new build agent images.
- Release new versions of the DSL library.
Combined, they take up a few days a quarter. While most could be automated, for the time being, we decided to focus on more urgent projects.
By now we have 135 repositories with 200 projects defined in them and we haven’t noticed any signs that would suggest we can’t grow further by a 2x or even 10x factor. All we need is to add more credits, and the platform with our setup takes care of the rest.
How long did it take?
It did not happen overnight for sure. The migration of the most active repositories took around 3 months followed by another 3 months for the repositories with less active projects.
The outline of all stages
- We spent around two months distilling the requirements and getting to the shortlist.
- A month for hands-on experiments.
- PoCs with the teams took around two and a half months.
- A few weeks to set up things in the production environment.
- 80% of repositories were migrated within 3 months.
- It took another 3 months to migrate the rest.
A story in case you want to build PRs the way we do it
TL;DR; Do not use
refs/pull/*/merge to trigger your builds.
We use pull requests and code reviews a lot at Picnic, so we configure our CI to build the pull requests. TeamCity has a Pull Requests build feature exactly for that. As the documentation says it builds
pull/*/head git references, which is exactly the head of the PR source branch.
While this makes total sense, it’s not exactly what we want. We are not interested to know whether the changes on the feature branch are OK, we are much more interested in whether the result of merging those changes to the target branch is OK.
We found that GitHub also maintains
pull/*/merge git references which is exactly what we were interested in. Those references are updated each time you push things to your branch and are not updated if you have a merge conflict. Sounds perfect, but as we learned the hard way it’s actually too perfect.
What you wouldn’t find about the merge references is that they are updated much more often than you would expect. The references are also updated when you merge something to the target branch. As a result, lots of our engineers were confused that their PRs would be rebuilt randomly. It also affected our queues because some of our repositories have quite a list of open PRs.
Now we trigger PR builds on changes in
pull/*/head references and manually merge that branch into the target branch of the PR. For debugging purposes, we also attach the diff as a build artifact and push our own git reference for the history.
Although the migration took some time it was a total success for us! Instead of just moving to another platform as-is, we took a step back, re-evaluated our CI setup, and implemented multiple improvements.
The only thing that I would do differently, in a hindsight, is the trial of the shortlist by the teams. Our schedule was too ambitious. Each team tried two different CI platforms for two weeks each. It was too much new tooling in quite a short period of time. Besides that, we had 3 teams and 3 tools to evaluate, but each team only tried two tools out of 3, which made it harder to evaluate the feedback.
Our new setup opened many possibilities for us, which we want to explore now. Among the most important for us:
- Get rid of the queue time caused by provisioning the agent.
We are building more complicated build chains, the queue time adds up.
- Improve the observability of our CI.
We want to collect stats across all the projects so that we can focus on improving performance in the slowest parts of our setup.