The Data Metro – Automating Data Workflows Using Jenkins and EC2

Deep Learning AI Improving Workflows

Jacob Platin, Christopher Ye, Charles An – Aidoc A.I. Software Engineering Interns

Every summer, Aidoc hosts international interns who have the opportunity to take on active roles on our AI, R&D and Business teams. These individuals gain hands on work experience, learn new skills and become better prepared for employment by working alongside our highly skilled team. During the summer of 2019, Aidoc had the opportunity to host Jacob Platin from the TAMID Internship program, Christopher Ye and Charles An from the Princeton Startup Immersion Program (PSIP). Aidoc is proud to work with these internship programs and have the opportunity to collaborate with those that are leading the way in excellence. Continue reading for their insights on ‘The Data Metro’.

Designing workflows has always been a tricky task: how do we effectively move from sophisticated programmatic pipelines that connect a variety of services to an efficient, streamlined, and accessible process that presents end-users with a clean user interface? An additional challenge facing a variety of high-tech companies, particularly those in the Machine Learning realm, is efficiently and seamlessly obtaining an appropriate amount of computing power to complete these pipelines.   

During our time at Aidoc Medical in Tel Aviv, Israel, we interacted with these challenges and more first-hand.  Aidoc is a diverse, established company, and the employees we interacted reflected that.  We met programmers, AI experts, radiologists, data analysts, marketers, and more.  But, what we found refreshing was the amount of cross-team interaction, especially between the data analysis and AI teams.  Since we were interning in the AI department, we believed that our work would only pertain to this department, but we constantly found ourselves talking to members in the data and DevOps departments.  

As interns, we were tasked with automating a variety of existing workflows within the company, including programs that retrieve meta-data from CT scans, displayed scans in an online viewer, and programs that ran medical prediction algorithms.  We began by cleaning up the existing code (with Bob Martin’s “clean code” principles in mind) by abstracting common functions, writing single-purpose functions, and structuring the code directory to enhance readability.  To read more about how we ensured the quality of our code, please refer to this  article.  With clean code in place, we then faced the difficult task of automating the code pipelines. Given that these processes needed to be run on-demand and with adequate power, we needed a robust cloud-computing solution, so we chose EC2. 

We continued in pursuit of our overall mission to automate and streamline our workflows by first running the relevant code manually on EC2 instances.  Since we only had access to the cloud computers through a terminal window, we decided to simplify the framework through which we called the processes.  Specifically, we used Python’s Argparser package to create a usable, standardized command-line interface for each task. While we began by simply reading in parameters in a prescribed order, the need for the ability to handle null parameters in addition to the sheer number of parameters required by certain tasks (some up to 10 or more) meant that key-word specified parameters increased ease-of-use.  

Moreover, we also developed an FSx-based storage system that not only allowed us to improve efficiency by relying on a designated storage endpoint, but also prevented the storage of redundant data through the use of a common data volume for all EC2 instances. In particular, when downloading files from object storage, we would preserve the file structure of the data in the cloud, and thus duplicate (and therefore unnecessary) downloads were easy to identify and eliminate.

Indeed, the selection of the storage system was an immense task. We initially chose the Elastic Block Store (EBS) file system because each time an EC2 instance is created, an EBS storage volume is created along with it and used as the instance’s root volume. Thus, we believed that using an EBS file system would be the simplest to work with and the best place to start. EBS provides durable, block-level storage volumes that can be attached to an EC2 instance. An EBS volume acts as a physical hard drive in the sense that it can be attached and detached from an instance and still keep its contents. Implementing the EBS file system and using it in our pipeline was rather straightforward, since AWS storage solutions and EC2 are tightly integrated.  First, we created an EBS volume in the EBS management console with the desired volume size. Second, we ensured that the volume and the relevant instance were in the same security group.  Third, once the volume was attached to the instance, we mounted the volume to a directory in the instance in order to view and edit the contents of the volume.

 The tasks that we were running required different types of instances to perform efficiently. For example, the task of extracting metadata from a CT scan is best suited for an instance that has a powerful CPU whereas the task of running the prediction algorithm on the scans is best suited for an instance with a powerful GPU. In the final pipeline, we wanted the ability to select which CT scans should be sent to prediction using the metadata extracted by an instance with a powerful CPU and then run the prediction algorithm on those scans with a GPU instance. In order to accomplish this, separate instances need to be able to access the same storage volume in order to have access to the same scans. An EBS volume can only be attached to one volume at a time, but it can be detached from one and attached to another. This would allow multiple instances to access the same storage volume, but several problems arise from this: detaching and re-attaching EBS volumes can be a costly operation and impact efficiency and when an EBS volume runs out of storage space it must be manually resized which can also be a costly operation. We thus considered EFS, an alternative to EBS, in order to solve these issues.  EFS volumes can be attached to multiple instances at the same time and have elastic storage capacity, meaning that they grow and shrink automatically based on the amount of stored data. These features made it simple to have multiple instances access the same data, and the steps to implement an EFS file system are effectively the same to those of the EBS file system. 

On paper, the EFS file system thus looked like a promising option. Because the baseline aggregate throughput scales with the file system size, the more data that is stored in the file system, the faster the read and write times will be. We can avoid the initial slow speeds when the file system is empty by pre-loading the file system with data.  However, in practice, we discovered that the speeds of an EFS file system were, in general, too slow for our purposes. As a result, we lastly considered the FSx file system. The FSx file system can be attached to multiple instances and its performance is a significant upgrade from that of an EFS file system. The downside is that an FSx file system cannot be resized, but the significant improvement in performance is more valuable to us.

Total Prediction Time1867 seconds2434 seconds
Prediction Runtime1577 seconds1692 seconds
Write Runtime206 seconds354 seconds
A table comparing a variety of operations performed on the two storage types

With our EC2 instances and processes configured properly, we next attempted to automate and consolidate the various pipelines through an intuitive command-line interface (CLI).  Using the Click library, we developed an EXE file that could be deployed onto user’s local machines.  Unfortunately, given that we developed our code on Linux machines, creating the EXE entailed copying the relevant code to a Windows Virtual Machine and manually creating the EXE file using Pyinstaller.  This meant that enforcing version control was difficult and that any small hotfixes, updates, or new code required a completely new version of the EXE be made, a cumbersome process that made this approach too inflexible and static. 

The end product, however, proved to be an important first step towards automating the discussed pipelines.  What was once five disparate tasks was now a single executable file.  However, the user experience was suboptimal.  Entering inputs on a command line proved difficult and prone to error, and we found ourselves constantly having to deploy new EXE files as new code was developed or old code was updated. In addition, EC2 tasks were not deployed on-demand for each separate task, so multiple different tasks could not be run simultaneously.  What we needed was a complete suite that both provided an easy-to-use user interface and the necessary developer and infrastructure support to deploy a company-wide solution to run multiple pipelines simultaneously.  

After a bit of research, it became clear that Jenkins was a suitable option, especially given that Aidoc had a strong Jenkins infrastructure in place and Jenkins’ fully integrated (from Git to EC2) platform that made our programming work simple. Jenkins provides a clean and easy to use interface for users not unfamiliar with the terminal environment, while also handling EC2 resource management. In particular, instead of having to manually create an instance and configure it to our needs, Jenkins launches an instance from the image we provided, as well as running a given start-up script to mount the necessary storage device. Finally, upon the completion of a task, Jenkins terminates the computing instance after a period of idle time to save computing time, and therefore money (those GPU instances get pretty expensive!). 

By completely automating the process, Jenkins also handles version control, pulling code from the Git source provided, as well as updating the Anaconda environment and activating it before running the Python script. Therefore, we could effectively develop our code without having to worry about wasting time and resources integrating it into the existing pipeline.  The beauty of Jenkins is that we could simply test our code, push it to the right branch and Jenkins would handle the rest: from booting up the instance and activating the correct environment to ensuring that the correct code was being run. 

Finally, we had achieved the goal we had set out from our first days in June: distribute a clean, usable, and powerful user-facing interface to integrate Aidoc’s existing tasks into streamlined pipelines.  Our Jenkins solution, after unit testing and code reviews, was distributed to the company during our last days of our internship in August.  We immediately saw an improvement in user satisfaction and productivity from the existing CLI and especially the archaic Matlab-based tools.  Users were now excited to run tasks, and a few users even remarked that, unlike before, they did not have to “babysit” tasks anymore by manually stringing together the outputs of various smaller processes.  These users validated the success of our Jenkins solutions: tasks were now effectively in a streamlined pipeline that was easy to use and had increased performance.  

With our finished product being used extensively on large-scale data throughout Aidoc, we think it’s quite easy to say that we thoroughly enjoyed our time at Aidoc.  We were able to, in three short months, develop an internal tool that will greatly expedite the on-boarding process for new hospitals allowing Aidoc to offer its services to an increasing number of customers. For an intern project, we think that is an accomplishment to cherish.  In addition, we gained a first-hand look at the high tech culture of Israel.  From talking with co-workers, we were truly astounded by the number who held advanced degrees and/or had served in some of Israel’s most elite military technological divisions.  Not to mention we also learned about Israel from both locals and foreigners that had moved to Israel by making Aaliyah, which expanded our cultural horizons.  Moreover, our boss, Rotem Rehani, facilitated our growth as aspiring computer scientists by constantly stressing the importance of both clean and efficient code through frequent code reviews and general feedback.  We can honestly say that we are now much better and more disciplined programmers than before our time at Aidoc, which is no less important than developing a substantive project as an intern.  Looking forward, we are excited by the direction that Aidoc is moving, and we can’t wait to see what new markets and success await for Aidoc!


Latest Posts

Popular Posts
Subscribe to our Blog
Follow us on social
AI Consultation Request
Get your complimentary consultation with an AI expert