← Writing
Nov 20258–10 mins

Why Most AI-for-Science Papers Won't Reproduce

generating.

I want to be precise about this, because it is easy to make the argument sound like cynicism and I mean it as something closer to a diagnostic. The field of AI for science, meaning machine learning applied to molecular design, protein structure prediction, materials property prediction, and related problems, has published an extraordinary number of papers in the last five years. A non-trivial fraction of those papers will not reproduce when someone attempts to validate them experimentally, and a large fraction will not even reproduce computationally when someone attempts to rerun them on slightly different data. This is not primarily a problem of scientific misconduct. It is a problem of structural incentives meeting technical debt, and it has specific, identifiable failure modes.

Data leakage and the public database problem

The most common failure mode, and the one that produces the most dramatic-seeming results, is data leakage through public chemical databases. The standard datasets used for training and benchmarking molecular property prediction models, ChEMBL, PubChem, ZINC, the Cambridge Structural Database, the Materials Project, are all publicly available and extensively cross-referenced. A molecule that appears in one database often appears in several others, sometimes with slightly different representations, slightly different property measurements, or computed at different levels of theory.

When a research group trains a model on one database and evaluates it on a held-out test set drawn from the same database or a closely related one, the test set is not truly held out in the meaningful sense. Both training and test data were generated by the same community, using similar experimental or computational methods, on similar classes of compounds. The model is not generalizing to a new distribution. It is interpolating within a distribution it has already seen, from multiple directions.

The correct evaluation is prospective: train on compounds synthesized before a cutoff date, evaluate on compounds synthesized after it. Very few papers do this, because it requires waiting, or because historical databases are not cleanly time-stamped, or because the retrospective split looks fine to reviewers who are not thinking carefully about it.

The benchmark saturation problem

Several standard benchmarks in molecular ML are now effectively saturated. Models achieve performance on QM9 property prediction that is close to the noise floor of the DFT calculations that generated the dataset. On protein structure prediction, AlphaFold2 achieved accuracy on CASP14 targets that exceeded many crystal structures in reproducibility metrics. When benchmarks are saturated, continued improvement on them provides no signal about whether models are actually getting better at the underlying scientific task.

The community responded by creating harder benchmarks, which is reasonable. The problem is that harder benchmarks often have smaller datasets, which means overfitting is more likely, and the field moves fast enough that benchmark-specific tricks developed for one competition often appear in the code of subsequent submissions without being flagged. Benchmark-specific tuning is not the same as developing general methods.

The DFT accuracy problem

A particular failure mode in materials ML is the assumption that DFT is the ground truth. Neural network potentials are routinely benchmarked against DFT energies and forces, and a model that achieves DFT accuracy is considered state-of-the-art. But DFT is itself an approximation to quantum mechanics, one that makes specific errors in specific regimes. Van der Waals interactions are underestimated by standard GGA functionals. Strongly correlated electron systems are handled poorly by most functionals. Reaction barriers involving charge transfer are systematically off.

A model trained on DFT data that achieves DFT accuracy has learned to reproduce DFT's errors faithfully. Whether those errors matter depends entirely on the application. For screening candidate materials by relative stability, DFT accuracy is often sufficient. For predicting absolute reaction rates or spectroscopic properties, it frequently is not. Papers that claim experimental-level accuracy based on DFT benchmarks are making an inference that requires a separate validation step that is almost never performed.

Why the incentive structure makes this worse

Machine learning conferences and journals reward novelty and benchmark improvement. A paper that introduces a new architecture achieving a 3% improvement on QM9 is publishable. A paper that carefully reproduces a prior result and finds that it does not hold on a different dataset is not, or is significantly harder to publish. The replication studies that would most improve the field's collective knowledge are the studies the field least rewards.

The timescale mismatch compounds this. A computational ML paper can be written, reviewed, and published in months. Experimental validation of a computational prediction in materials science takes a year or more: synthesis, characterization, property measurement, troubleshooting. By the time an experimental group has determined that the model's prediction was wrong, the field has moved on and the original paper has accumulated citations from subsequent papers that took its claims at face value.

What a higher standard would look like

The clearest improvement would be prospective blind challenges, run with experimental partners, where the models make predictions before the experiments are done and the outcomes are shared with the modeling teams afterward. The CASP competition for protein structure prediction worked this way and produced genuinely reliable information about model performance. Materials Genome Initiative databases contain enough data to run similar challenges for materials property prediction, but the coordination required is substantial and nobody's career advancement depends on organizing them.

At the individual paper level, the minimum viable improvement is a clear statement of what the training distribution is, what the test distribution is, why those two distributions should be considered separate, and what the performance would be on a truly prospective evaluation even if that evaluation is approximate. Most papers do not provide this information, not because the authors are hiding it, but because the field has not established a norm of requiring it.

The papers that will matter in ten years are the ones whose predictions get tested in physical reality and hold up. Right now, those papers are not clearly distinguishable at publication time from the papers whose predictions do not hold up. Making that distinction possible before experiments are run is the structural change the field needs.