Global Big Data Conference

Industry News Details

What to do when AI brings more questions than answers Posted on : Jul 17 - 2021

The concept of uncertainty in the context of AI can be difficult to grasp at first. At a high level, uncertainty means working with imperfect or incomplete information, but there are countless different potential sources of uncertainty. Some, like missing information, unreliable information, conflicting information, noisy information, and confusing information, are especially challenging to address without a grasp of the causes. Even the best-trained AI systems can’t be right 100% of the time. And in the enterprise, stakeholders must find ways to estimate and measure uncertainty to the extent possible.

It turns out uncertainty isn’t necessarily a bad thing — if it can be communicated clearly. Consider this example from machine learning engineer Dirk Elsinghorst: An AI is trained to classify animals in a safari to help safari-goers remain safe. The model trains with available data, giving animals a “risky” or “safe” classification. But because it never encounters a tiger, it classifies tigers as safe, drawing a comparison between the stripes on tigers and on zebras. If the model were able to communicate uncertainty, humans could intervene to alter the outcome.

Uncertainty explained

There are two common types of uncertainty in AI: aleatoric and epistemic. Aleatoric accounts for chance, like differences in an environment and the skill levels of people capturing training data. Epistemic is part of the model itself — models that are too simple in design can have a high variation in outcome.

Observations, or sample data, from a domain or environment often contain variability. Typically referred to as “noise,” variability can be due to natural causes or an error, and it impacts not only the measurements AI learns from but the predictions it makes.

In the case of a dataset used to train AI to predict species of flowers, for instance, noise could be larger or smaller flowers than normal or typos when writing down the measurements of various petals and stems.

Another source of uncertainty arises from incomplete coverage of a domain. In statistics, samples are randomly collected, and bias is to some extent unavoidable. Data scientists need to arrive at a level of variance and bias that ensures the data is representative of the task a model will be used for.

Extending the flower-classifying example, a developer might choose to measure the size of randomly selected flowers in a single garden. The scope is limited to one garden, which might not be representative of gardens in other cities, states, countries, or continents. View More

Get the