Editor’s note: Keith is a speaker for ODSC East 2022. Be sure to check out his talk, “What Analytics Leaders Should Know About Human-in-the-Loop,” there!
Human-in-the-loop machine learning is changing the data science landscape faster than data science education can keep up. If you are working in any of the areas powered by Deep Learning (computer vision; autonomous vehicles; drone delivery and inspection; and many others), you likely are working with data that require large amounts of manual data annotation. However, it is unlikely that there was time to explore this world during your data science training. Changing technologies, mastering coding, and learning algorithms leave little time for anything else. But I’ve grown convinced that everyone needs some familiarity with how this trend is changing our field.
The development of machine learning models traditionally relies on vast amounts of structured data for training. That requires that style of data preparation that we are all familiar with: cleaning, formatting, and feature engineering. However, the vast majority of the world’s data is unstructured, existing in just about every form imaginable, from news articles to social media posts to images and videos. This data is among the most valuable resources of all, yet it remains largely untapped. I’ve spoken to some data scientists who come very close to suggesting that Deep Learning doesn’t have data preparation at all.
But, of course, it’s not true. Instead, data preparation is entirely different, focused on creating carefully annotated and assembled training datasets. In an academic environment, one can use an existing dataset, like the famous ImageNet dataset, for practice and training. When applying that same academic training to a major project in a corporate environment we might have to build our own, and it is not a trivial undertaking. What does it take, in terms of training and resources, to achieve such a feat at scale?
And it’s not just training data during the model development phase. Human-in-the-loop also plays a role during deployment. Machine learning models will produce propensity scores that vary in their confidence of the prediction. You can act on both high and low scores with confidence, but you also have to deal with what one colleague calls the “squishy middle,” where the scores indicate that either outcome is likely. Where are you going to route such cases, with a middling level of confidence? One option is to route them to a human working in near real-time. Time-sensitive predictions like insurance claim processing can benefit greatly from this kind of approach.
header.all-comments