This article is the second of a series of 3, which outlines in detailed all problems we faced, and our journey finding solution for each of them. If you haven’t read the first article, I suggest you to read it first to get some context. The next article will describe the actual work on building the CI infrastructure in the cloud with Genymotion Cloud and GCP.
In previous article, we were facing problems with our Continuous Integration set up.
Problem #1: Job queue time
BBM Android team has around 30 engineers by the end of 2017, spread all over Indonesia, Singapore, Canada, and Vietnam. As a company where pair programming is the norm, it means we have around 15 pairs working on a separate stream of work at the same time.
Each pair work on their branch, create a pull request (PR), and then the job to run automated tests for their changes is generated automatically by Multibranch pipeline job. As we only have limited agent nodes to run tests for branches with PR, the test job is usually queued for around 30 to 60 minutes quite often. If there were a problem with agent nodes which require a restart, the queue would grow, as there are no agent nodes to run the job. After restarting the agent nodes, sometimes jobs will be executed after 1 to 2 hours.
This queue problem prevents pairs from taking on the next task in the backlog effectively, as they need to continually check that tests for their PR are passing. If the changes are breaking any functionality, the pair needs to switch context to the previous task, and make changes. And after making necessary changes, the pair most likely has to wait for quite long before the tests run.
Goal: Start job as soon as possible!
Problem #2: Pipeline job runtime
The pipeline job for running automated tests consists of 14 stages. The stages are:
In total, the pipeline job took a little over an hour to complete.
Goal: Shorten job run time by half!
Problem #3: Network speed
As all agent nodes are located in KMK headquarter in Jakarta, Indonesia, the network connection used by those nodes is shared with hundreds of other employees from all departments.
Sometimes, jobs are aborted automatically because the agent node cannot download the required source code or third-party libraries. Isn’t it frustrating to see the job for your PR aborted after waiting an hour in the queue to get started? There are better times, though, when the test job is completed after 1.5 hours, because the download speed is slow.
Our office utilizes a few internet service providers to cater to everyone’s need. But from time to time, one or more providers will have a problem with their service, and as a result, it will slow down the network speed.
Goal: Stable and faster download speed!
Problem #4: Emulator stability
At times, Genymotion emulators running on our GNU/Linux boxes could be unstable for unknown reasons. After being used and reused multiple times, one or more emulators are unusable. At one time, the emulator is stuck, you can’t interact with it. Other time, random Activity under test stuck at the front, though it can be dismissed. As a result, tests are randomly failing here and there. We’re still unsure what caused this issue, though. A few theories: CPU usage/memory management on the agent nodes.
Goal: Fresh emulators, please?
Problem #5: Power outage and manual agent nodes launching process
Yes! It is happening a few times. When this happens, someone needs to physically visit the place where all agent nodes are located, and ensure all are starting up and running. Otherwise, CI won’t be available to any engineers.
Did I mention that we only have one display for all machines? Imagine all the troubles plugging and unplugging the display cable to each machine.
Goal: Automatically launch agent nodes!
Let’s take a look at the five goals we’re trying to achieve.
It is obvious that scaling our existing CI infrastructure to achieve all that is not possible without making a radical change. So, what options do we have? First, we could explore a managed CI service provided by well known companies. Or, we could rebuild our CI infrastructure in the cloud and have fun while doing that. We decided to go with the latter!
There are two important keys to move to the cloud. First, the cloud provider where master and agent nodes are running. And second, the Android emulators used by all agent nodes to run instrumentation tests.
Deciding the cloud provider is an easy one. As KMK has been one of Google Cloud Platform’s (GCP) customer, it’s an obvious choice. But, when it comes to how and where we run our Android emulators, we have a lot of homework to do.
We have a few options in choosing Android emulator:
AWS Device Farm or Firebase Test Lab do not fit our need, as they are designed to run the same tests on multiple devices. In our case, we split our automated tests according to the amount of emulators that we have, and run them all in parallel using adb shell command . We scratched this option off and move on to the next one.
Next, we tried running Android emulator in headless mode with xvfb. During our testing period, we found out that the tests are not stable. There are times that tests are just stuck and never completes. So, we parked this option, and exploring the last option we have on the list.
And finally, we turn to Genymotion. Genymotion is a cross-platform Android emulator which comprises a set of sensors and features, designed to help app developers and testers to run their automated tests in a virtual environment. Other than the desktop version, like what is used on existing CI configuration, Genymotion also provides cloud version, called Genymotion Cloud.
Genymotion Cloud comes in two options: PaaS and SaaS. Genymotion Cloud PaaS allows you to spin up an instance on pretty much well known cloud providers, such as GCP, AWS, or Alibaba. While Genymotion Cloud SaaS allows you to create emulators and start using them right away, all managed by Genymotion team.
From the spike we did, we realize that using Genymotion Cloud SaaS is very reliable. We can create, start, and destroy emulators quickly, easily, and reliably. From dozen of sessions running instrumentation tests that we have, all of them are running perfectly fine. It also provides tools that are easy to use and has all the necessary functionalities we need to achieve what want to do.
Genymotion Cloud + GCP = 💖
How does combining Genymotion Cloud and GCP solves our problems, theoretically?
Here is the simplified version of the pipeline process.
In the next article, I will go into detail of the step by step process in rebuilding our CI infrastructure in the cloud.