2285
Article

The Data Metro - Applying the principles of “Clean Code”

Christopher Ye, Jacob Platin, Charles An – Aidoc A.I. Software Engineering Interns

Every summer, Aidoc hosts international interns who have the opportunity to take on active roles on our AI, R&D and Business teams. These individuals gain hands on work experience, learn new skills and become better prepared for employment by working alongside our highly skilled team. During the summer of 2019, Aidoc had the opportunity to host Jacob Platin from the TAMID Internship program, Christopher Ye and Charles An from the Princeton Startup Immersion Program (PSIP). Aidoc is proud to work with these internship programs and have the opportunity to collaborate with those that are leading the way in excellence. Continue reading for their insights on ‘The Data Metro’.


In developing our code base (see our other blog post for a more general description of the system), the cleanliness and quality of our code was repeatedly emphasized. Indeed, our first task was to read Bob Martin’s Clean Code: A Handbook of Agile Software Craftsmanship. Due to the high usage of our repositories, they needed to be reliable and easily maintainable, whether to fix an existing bug or to add a new feature. To do this, we used several key ideas and patterns to help us achieve our goal.

SOLID Principles

Throughout our orientation, SOLID Principles were particularly emphasized as rules that we must adhere to. Of these, the Single-Responsibility and Open-Closed Principles were particularly prominent throughout our time. 

At first, the Single-Responsibility Principle seemed strict, in that every function could do one and only one thing, which Martin led to a rule of thumb that each function should not be more than 5 lines long. However, this principle was perhaps the most fundamental in improving the cleanliness of our code. With this principle and our best attempt at adhering to the 5 lines or less rule, our modules became more compact, and while the number of files and functions grew (a natural consequence of the principle was that each module shouldn’t perform more than one [albeit higher-level] function as well), this was significantly more preferable to having one gargantuan function which would become indecipherable over time. Of course, a great advantage we enjoyed by using this principle effectively was the ability to edit, fix, or add code easily. Indeed, whenever there was an issue (with suitable testing) it was immediately obvious where the mistake was occuring. Since the intended effect of each function was also fairly easy to decipher, we caught mistakes easily, fixing them with little trouble. 

This leads naturally to the Open-Closed Principle, which will be mentioned in more depth as well in the Factory Pattern. Indeed, by effectively employing the Single Responsibility Principle we successfully compartmentalized functionalities into specific modules, meaning that we may minimize the impact of adding a new module. We could create a new module that interacted with previous modules without relying on any kind of specific implementations, treating them as a black box where we could simply use whatever functionality we required from them. As we see with the Factory Pattern, this allows us to add new code without having to modify any old material.

Naming

In Clean Code, it was repeatedly emphasized that code that required comments meant messy code. In particular, it should be immediately clear what any module, function or line of code does. The easiest way to manifest this was to make our code read as much like English as possible; and the easiest improvement to make here was to improve our naming system.

Contrary to what we often saw in our classes, Clean Code encouraged opaque naming conventions over succinct ones. Of course, certain abbreviations such as np for numpy were so universal that there was no ambiguity in their meaning, in most cases abbreviations could lead to obfuscation. Thus, we generally tried to write our function and variable names in a way that approximated normal, grammatical english. 

In many cases, even wrapping a one-line function in another (effectively just changing the name) allowed the code to become more legible. For example,

def get_extension(file_name):

return file_name.split(‘.’)[-1]

This small (and simple) change nonetheless allowed us to make significant progress in writing clean, maintainable code. Even when editing others’ code, it was immediately clear what each function was supposed to do. 

Utility Functions

Throughout the process, the modules we wrote required the same functionalities again and again. Initially, this overlap was not too obvious (perhaps because we didn’t plan our development strategy as closely as we should have or did not communicate our respective plans well enough), but soon we noticed that when one of us ran into a problem or issue, it was one that another had already solved. Gradually, we started building a habit of asking each other if they had already implemented some function that we needed. Eventually, this brought forth our shared module, named Data Metro Utils. In here, we collected the simplest, most commonly used functions, allowing the actual product modules to contain more conceptual and higher level code. Of course, this also improved efficiency, increasing our development speed by avoiding redundant implementations.

Python Packages

While we required this code to run with all our modules, we did not want to include it in each module. We also wanted to make updating code (whether to fix bugs, update code to new requirements, handle new use cases, or implementing new utilities altogether) simple and efficient for our users. Thus, we turned the utilities repository into a python package. After some struggling with twine, we eventually made this a reality, and incorporated this into our pipeline, ensuring that any cloud computing instance installed the necessary packages (including our utility package) before executing any code. 

The Factory Pattern

Lastly, although not a general overarching principle, the Factory Pattern proved to be an invaluable tool in improving both the efficiency of the coding process as well as the quality of the final product. This was most obvious in our filtering module, but also proved useful (and I think has the most potential) in the View Cases module. 

In the former case, we used the factory to produce instances of filters. This was especially useful as we would often require new filters to be made available, as well as instantiating different filters for every new run of the program, since different analyses may require different types of information. 

Of course, each filter itself was an extremely simple program, yet the number of distinct filters required would have made the overall program too convoluted without this abstraction. Instead, with this implementation, we could simply pass each filter designated by the user to the factory, returning the required instances. This allowed for clean, compact code, and many small, simple files, each independent of the other, so that we could modify every filter independently as necessary. This was an excellent application of the Open-Closed Principle, making it very easy to modify or add programs without affecting other parts unnecessarily.

In the latter case, while we only created one instance of a data transfer module (moving DICOM images from Azure to Google Cloud), we designed the module keeping in mind that in the future other forms of data storage, such as Amazon Cloud Services, will be required. In particular, the factory pattern implemented meant that any such new functionality could easily be extended. Indeed, we saw this in practice. After implementing the transfer from Azure to Google Cloud, we still needed to move the DICOMs to Google Healthcare Storage. However, relative to the first step, this took almost no time to implement, reaping the rewards of the Factory Pattern.