Oct 20255–6 mins

The Model Is Confident. The Model Is Wrong.

generating.

In 2018, a widely circulated paper showed that a standard image classifier trained on ImageNet would assign high confidence to images of pure noise. Show the network a pattern of random pixels and ask it to classify the image; it will report with 99% probability that this is a picture of a specific breed of dog. The network is not malfunctioning. It is doing exactly what it was trained to do: map inputs to probability distributions over classes. The problem is that those probability distributions are meaningless outside the region of input space the training data covered, and the network has no mechanism to know when it has left that region.

This problem does not go away in scientific applications. It gets more expensive.

How neural networks express confidence

A standard classification network ends with a softmax layer that converts a vector of raw scores into a probability distribution summing to one. A regression network predicts a point estimate with no uncertainty attached. In both cases, the network's output is defined for any input, regardless of whether that input resembles the training data. The network cannot say it does not know. It can only produce a number.

For an image classifier, this is annoying but rarely catastrophic. For a molecular property prediction model used to screen drug or material candidates, a confident wrong prediction directly determines what gets synthesized, at a cost of weeks of lab time and significant expense. The confidence interval around the prediction determines how much of the screening budget gets allocated to candidates that will fail. Getting this wrong systematically is not a minor inefficiency. It reshapes the direction of the research.

The specific failure mode in molecular ML

Property prediction models for molecules are typically trained on datasets of known compounds: drugs that have been tested, materials that have been characterized. The training distribution reflects the chemical space that has been historically explored. Novel scaffolds, new functional group combinations, and structural motifs not present in the training set are precisely the most interesting targets for generative design, and they are precisely the inputs on which the model's confidence is least reliable.

A model trained on ZINC drug-like molecules and asked to predict the binding affinity of a macrocyclic peptide with unusual backbone geometry will produce a number. That number will look exactly like the numbers it produces for drug-like small molecules. There is no flag, no asterisk, no widened interval. The prediction is indistinguishable in format from a prediction the model is actually equipped to make.

The tools that exist

Uncertainty quantification in neural networks is an active research area and the solutions are imperfect but real.

Monte Carlo dropout uses the dropout layers present in many networks during inference as well as training, running the same input through the network many times with different neurons randomly zeroed, and treating the spread in outputs as a proxy for uncertainty. It is easy to implement and often underestimates true uncertainty, but it provides some signal at low computational cost.

Deep ensembles train multiple networks with different random initializations and take the spread in their predictions as a measure of uncertainty. This is more expensive but better calibrated than MC dropout in most studies. The intuition is that networks trained from different starting points will agree on in-distribution inputs and disagree on out-of-distribution inputs, because the training data constrains all the models to similar solutions for seen cases but does not constrain them for unseen cases.

Gaussian Process models provide theoretically principled uncertainty estimates by maintaining a distribution over functions rather than a point estimate. The uncertainty is low near training data and grows in unexplored regions. GPs scale poorly to large datasets and high-dimensional inputs, which is why neural networks displaced them for many tasks, but for low-data scientific problems they remain competitive and their uncertainty estimates are better understood.

Conformal prediction is a newer framework that provides distribution-free coverage guarantees: for a specified confidence level, the prediction interval will contain the true value that fraction of the time, regardless of the model architecture or the data distribution. It requires only a calibration set and produces wider intervals when the model is less certain. The intervals are valid in a precise statistical sense, which is more than most uncertainty estimates can claim.

Why calibration is a scientific responsibility

A model that is uncalibrated is not just technically flawed. It is producing misleading information dressed up as quantitative precision. In materials screening, an uncalibrated model that predicts high binding affinity for twenty novel candidates with apparent confidence of 95% has told you something that sounds like strong evidence but is not. The confidence numbers carry rhetorical weight whether or not they are warranted.

The practice of reporting point predictions with no uncertainty in scientific ML papers transfers to the tools used by experimentalists who may not have the background to question the confidence encoding. A computed number that looks like a measurement gets treated like a measurement. Acting on it as if it were a measurement, when it is actually an interpolation of uncertain validity, wastes resources and sometimes sends research programs in wrong directions for years.

Getting the uncertainty right is not a nice-to-have. It is part of what it means to do science with these tools.