50 Years of Clinical AI: What We Built, What We Lost, What Still Hasn’t Changed
The Question That Never Changed
Fifty years into the story of medical AI, one question has never changed. In the early 1980s, Ted Shortliffe and his team published a study in Computers and Biomedical Research documenting what physicians would require before trusting a computer-based clinical decision support aid. The answer:
“We don’t want to use something that we don’t know how it works. We wanna know why.”
That was 1981. In 2026, the tools are unrecognizable. The requirement is identical.
Shortliffe built MYCIN at Stanford in the 1970s, a rule-based expert system for diagnosing bacterial infections and recommending antibiotic therapy. Every recommendation could be traced through its reasoning chain. The system knew what it knew and could show its work. Today’s LLMs generate explanations that sound authoritative but are produced after the fact. “You can ask an LLM a question twice, five minutes apart, and you’ll get a different kind of answer back,” Shortliffe observes.
The Cognitive Science Twist: Why AI Explainability in Medicine Mirrors Human Reasoning
Here’s the uncomfortable parallel: human experts have the same problem. Shortliffe, drawing on decades of working alongside cognitive scientists at Stanford, points out that when you ask a physician consultant why they recommended a particular treatment, their explanation is reconstructed after the fact. Research going back to Nisbett and Wilson’s 1977 work shows that people’s self-reports of their reasoning diverge significantly from their observed behavior.
“Frankly, an expert consultant who you ask to give you advice, he’ll give you an explanation, but you’re not really sure that’s how he actually did it,” Shortliffe says. “There’s almost like an automated way in which people who are really familiar with a field come up with a decision.”
The implication: perhaps the standard for clinical decision support AI explainability should be testable transparency rather than reasoning accuracy. Can you evaluate the explanation against evidence? That may matter more than whether the explanation maps perfectly to the internal process.
Why Healthcare Is Harder Than Any Other AI Domain
When Shortliffe’s team repurposed their medical diagnostic tools for industrial applications, the results were striking. Diagnosing car engine failures was dramatically easier than diagnosing infections. “We knew exactly how an automotive engine should work. We built it,” Shortliffe explains. “You didn’t need that probabilistic stuff as much anymore because you knew precisely what was going on and what a test meant.”
This asymmetry explains a pattern that has repeated for decades: smart technology companies walking into healthcare and underestimating the domain. IBM Watson healthcare failure. Google Health failure. The companies aren’t stupid. The problem is structural. “If you haven’t lived it, if you haven’t worked out on the wards, I don’t think you have quite that intuitive understanding of what makes medicine so challenging.”
The Wachter Response
Bob Wachter suggested on this podcast that clinical decision support might have been “too hard a problem to start with” for the field of medical informatics. Shortliffe heard that episode. His response is measured but firm: “Yes, decision science arguably is some of the hardest stuff. It’s not just a computing issue. It’s got a lot of psychology and understanding of the domain of medicine.”
But it was also the problem that mattered most, and it fit what AI could actually do in the 1970s and 80s. The alternative at the time was building electronic health records, which plenty of people were working on. Decision support was harder, yes. But “too hard to start with” misreads the history.
Not Failures, But Premature Ideas
To understand medical AI history, Shortliffe argues that you have to push back on the narrative that expert systems “failed.” The AI winter of 1988-1995 wasn’t caused by bad ideas. It was caused by a technological landscape that couldn’t support them yet. Memory was expensive. Processing was slow. Databases were local. The internet was in its infancy.
“Machine learning was a joke until we suddenly had the computational power and data sets to actually do machine learning of the sort that we now are taking for granted,” Shortliffe says. “Something that didn’t work once in a day when we had really different tools doesn’t mean that it wasn’t worthwhile.”
What concerns him now is that the pendulum swung too far. Machine learning researchers abandoned structured knowledge representation (ontologies, semantic relationships, domain models) as if those approaches had been disproven. They hadn’t. They were working, just slowly and with limited computational resources. Shortliffe sees retrieval-augmented generation (RAG) as a potential bridge, adding semantically structured knowledge back into systems that are otherwise pure data processing.
“We need to be looking for ways to take more explicit knowledge representation and leverage it to enhance the machine learning approaches of today,” he argues. The past isn’t dead. It’s infrastructure waiting for better hardware.
The Message to the Next Generation
Shortliffe meets monthly with Columbia medical students taking informatics electives. Most are there because AI suddenly seems relevant to their futures. His message to them: “You are in the middle of a process. This is not the end. If you think anything that is hot off the press is going to be still great in 5, 10, 15 years, you’re kidding yourself.”
And clinical informatics, he insists, is medicine. “It’s just a different aspect of medicine, but it’s medically motivated. It’s aimed at improving medicine. It is as intellectually challenging as anything else in medicine.”
Five decades in, the creator of one of the first medical AI systems is still working on the same problem. The tools have transformed. The question hasn’t.
Listen to the full conversation.
You Might Also Enjoy
- S1E24: The Digital Doctor Revisited with Bob Wachter — The episode Ted responds to directly. Wachter argued CDS was “too hard to start with.” Ted disagrees.
- S1E30: Patient-Centered AI with Amy Price — A different angle on the same question: what happens when AI meets the irreducible complexity of individual patients?
- S1E27: When the Data Isn’t Ready with Charlie Harp — The data infrastructure problem Ted describes (local databases, no shared data sets) still echoes in today’s interoperability challenges.