A relatively recent paper by Baidu (Dec 2017) that was covered in The Morning Paper has empirically demonstrated something fascinating that could have major implications on the management of deep learning projects— Deep Learning output errors decrease predictably as a “power-law” of the training set size:
(m is the number of samples in the training set, and Beta is usually between 0 and -1)
Below are three technical insights that cover Baidu’s promising studies on deep learning management
Insight 1: While there are some specific caveats about the way these particular research and analyses were conducted (some of which I mention later*) – the studies find that you can use relatively small amounts of data to accurately extrapolate the improvement in performance gained from adding X more data without any extra research (except for hyper-parameter sweeps to increase model capacity). This can help companies prioritize and execute only the necessary data efforts to get to the desired performance on time, as well as quantify how valuable it is to acquire and annotate X more data in each project.
Insight 2: The studies have demonstrated Insight1 in 4 different domains, with “real-world” datasets, such as Imagenet classification which is the closest to our application.
An example of the results can be seen in the images below.
Image-net classification (*it is suspicious to me they they didn’t show the graph beyond 2^9 samples per class):
Character Based Language Models:
Insight 3: Research achievements could potentially affect the following two domains
- a) The minimal achievable error and the intercept of the graph with the y-axis – the studies state that based on their experiments, this is the only thing affected by increasing #parameters\layers in the architecture.
- b) The exponent of the power law. Baidu has yet to explore this, but it is possible that smart training techniques (augmentation, data sampling, priors, meta-architectures, etc.) increase the steepness of the learning curve by enabling the machine to learn more from each marginal sample.
I would love to hear your thoughts about this, and especially those of you who have time to go deeper into this paper and gain additional insights.