fbpx

Bye Travis CI, hi TeamCity!

admin
Ivan Babiankou 8 Feb, 2023 34 - 9 min read

Why did we even start?

  1. Builds are too slow
    It would take around 45 minutes (45 minutes, Carl!) to build our largest repo. And that was after we fixed a few obvious bottlenecks in the old setup already.
  2. Time in the queue is too long
    In the afternoon your build could sit for over an hour waiting to start. We had 20 concurrent builds which clearly was not enough anymore.

What goals did we have?

  1. We should standardize our build pipelines.
    There should be reference pipeline definitions that can be used easily, but teams still should be able to customize everything.
  2. Time to feedback should be short.
    Queuing and building together should be under 15 min for an “average” build.
  3. Maintenance should be low.
    CI is not our core business, we want a managed service that would be as flexible as possible without the need to spend months or even weeks to keep it running.
  4. All our pipelines should be configured as a code and versioned.
    The main drivers for this are auditability, knowledge sharing, and history of changes.
  5. Scalability should come out of the box.
    We should have no limits besides the depth of our wallet. No matter how much we grow or add to our CI, it should be able to scale with us. Ideally with (sub-)linear cost growth.

What alternatives did we consider and why did we choose TeamCity?

So what does the final setup look like?

  1. TeamCity configures WebHook in GitHub, which notifies TeamCity when a change is made.
  2. According to our internal policy, each project has versioned settings enabled, so TeamCity executes the Kotlin scripts from the repositories being built to get the actual steps the builds should execute. To do that it needs to fetch our build DSL library from the registry.
  3. Once the configuration is clear, TeamCity provisions an agent in our AWS account using a cloud profile. That agent then executes the build.
  4. The build agent executes the steps according to the configuration compiled from the repo settings. Usually, it involves resolving dependencies of the project being built and uploading the newly produced artifacts.
  5. Eventually, TeamCity communicates the status of the commit back to GitHub.

Provisioning the agents

Creating build agent AMI

TeamCity DSL

Example of a configuration using our DSL
Example of adding a custom build step using our DSL

Did we achieve our goals?

Standard pipelines

Build and queue time

Maintenance

  • Review PRs to upgrade or install the tooling on the build agents.
  • Build and roll out new build agent images.
  • Release new versions of the DSL library.

Scalability

How long did it take?

  1. We spent around two months distilling the requirements and getting to the shortlist.
  2. A month for hands-on experiments.
  3. PoCs with the teams took around two and a half months.
  4. A few weeks to set up things in the production environment.
  5. 80% of repositories were migrated within 3 months.
  6. It took another 3 months to migrate the rest.

A story in case you want to build PRs the way we do it

Solution

Conclusion

  1. Get rid of the queue time caused by provisioning the agent.
    We are building more complicated build chains, the queue time adds up.
  2. Improve the observability of our CI.
    We want to collect stats across all the projects so that we can focus on improving performance in the slowest parts of our setup.

Thanks for reading!

Want to join Ivan Babiankou in finding solutions to interesting problems?