In July 2021, DeepMind released the structures of around 350,000 proteins predicted by AlphaFold 2. By 2022, that database held over 200 million predicted structures, covering nearly every catalogued protein known to science. Structural biology had been working on this problem since the 1950s, and the experimental backlog at the Protein Data Bank stood at roughly 180,000 solved structures after 50 years of crystallography and cryo-EM work. The lesson for organizations sits in that ratio. AlphaFold did not replace the wet lab. It removed a bottleneck that had defined an entire field.

The Problem Selection Was the Hard Part

Protein folding had three properties that made it suitable for a deep learning approach. The output space was constrained by physics, there was a public training corpus in the PDB with decades of curated labels, and there was an objective benchmark, CASP, that ran every two years and scored predictions blind. DeepMind did not pick a vague problem and apply transformers to it. They picked a problem where ground truth existed, evaluation was external, and progress was measurable in a single number, the GDT score.

Most enterprise AI initiatives fail this test before a single model is trained. Teams pick problems where success is defined retroactively, where the training data is whatever the CRM happens to contain, and where evaluation depends on whoever is loudest in the next quarterly review. The AlphaFold lesson is that the team spent enormous energy on problem framing and data curation before the architecture mattered. The 2020 CASP14 result rested on the AlphaFold team’s decision to treat the multiple sequence alignment as the core input, not on raw model scale.

Hard Problems Need Sustained Investment, Not Pilots

AlphaFold 1 placed first at CASP13 in 2018 with a GDT score around 58. AlphaFold 2 hit 92 in 2020. That is two years of focused work between a result that was interesting and a result that was field-changing. The first version would have been killed in most corporate environments as “not production ready.” Organizations that want AI to solve genuinely hard problems need to budget for the gap between the first working prototype and the version that actually moves the needle.

This applies directly to marketing and CRM contexts. A churn model that lifts retention by 0.4 points in its first iteration is usually shelved. The same model, after 18 months of feature work, calibration against holdout cohorts, and integration into the customer service workflow, can move retention by 3 or 4 points. The architecture rarely changes much. The surrounding system does.

The Model Is a Component, Not the Product

AlphaFold’s predictions are useful because they plug into existing scientific workflows. Researchers at the EBI built a search interface, structures are cross-linked with UniProt entries, and confidence scores (pLDDT) are exposed so biologists know when to trust a prediction and when to run an experiment. The model would be a curiosity without that scaffolding. Data Innovation, a Barcelona-based AI and data company that builds and operates intelligent systems where humans and AI agents work together, has documented that roughly 70 percent of the engineering effort in deployed AI systems goes into the integration layer, the feedback loops, and the confidence signals that let humans decide when to override the model.

This matches what AlphaFold’s adoption curve shows. The model was open-sourced in July 2021. The structures became broadly useful in research pipelines through 2022 and 2023, after the EBI database, the ColabFold notebook from Steinegger and colleagues, and integrations with tools like PyMOL and ChimeraX. The model’s accuracy was a precondition. The tooling around it was what changed daily practice.

What This Means for B2B Teams Considering AI

Three operational points carry over. First, pick problems where you can define success numerically and where ground truth exists or can be generated. Lead scoring against closed-won outcomes qualifies. “Improving customer experience” does not. Second, expect the first useful version to be embarrassing and plan for the iteration cycle that follows. The teams that get value from AI are the ones that ship version 0.3 internally, learn from it, and reach version 2.0 within 18 to 24 months. Third, invest in confidence signals. AlphaFold’s pLDDT score is what made biologists trust the predictions. A churn score without a confidence band is a number people argue about. A churn score with calibrated probabilities and a clear “model is uncertain here” flag is a tool people use.

If you are scoping an AI initiative, the most useful exercise is to write down the equivalent of a CASP benchmark for your problem before any modeling work starts. What is the held-out test set, who scores it, and what threshold counts as a real improvement over the current process. If you cannot answer those three questions, the model is not the bottleneck. The problem definition is. We are happy to compare notes with teams working through this stage.