github twitter linkedin google scholar email rss
Some thoughts on how to get to an AI scientist
Dec 12, 2023
6 minutes read

How do we know if we’re making progress towards an AI scientist

I’ve always been fascinated by the prospect of automating science. In hindsight, it’s my favorite failed project: having several grants, blog posts, and side projects that have been rejected, never posted, and failed to produce anything interesting time and time again. I’ve never been satisfied by any approach to the problem. At its core it’s because automating science is a very ill-posed problem. Science is in essence a social endeavor and one really can’t separate the technical aspects of understanding systems from the refinements through dissidence and synthesis that’s provided by colleagues in night-long discussions, years-long back and forth, and the occasional conference bar crawls. So automating science would mean fully automating human behavior: an AI scientist is very much the embodiment of whatever general intelligence may be.

And because of that, it’s difficult to measure how much we can progress towards the goal of an AI scientist. Is it through generating traces of reasonable argumentation and hypothesis? Do we require the AI to make experiments itself and iteratively improve? How far off the beaten path and how non-obvious does a discovery have to be for us to recognize that an AI system has truly contributed to science on its own? We can have multiple answers for each of these questions, which makes the issue a very slippery one indeed.

Nevertheless, over the years I’ve tried to come up with some ways to bisect the problem to have it be a bit more systematized and tractable, at least in thought. The first division I have is between what I consider far away science fiction and our current capabilities. The holy grail in the (maybe?) far future is to have a system that can both generate novel, non-trivial hypotheses, but also learn to experiment in the real world by itself. This is very much outside our systems’ current capabilities. Currently, we have made leaps and bounds more advances in hypothesis generation and simulation than in physical platforms that can be “operator-free” (though there are some, in a limited sense, e.g. integrated systems). We are either stuck with robots that can reason and do experiments in a very limited domain (like Adam and Eve), or using AI as a mostly hypothesis engine. Even if we have high quality AI-generated hypotheses, you will be hard pressed to find a set of human scientists that will gladly and blindly test these hypotheses. The humans will in all likelihood introduce bias, pre-filter hypothesis, and become part of the system. This is what happens in all companies and labs that have “AI” at its core, the humans form a non-trivial part of that core. The question then becomes how can we measure success of an AI’s work in isolation, without too much experimental feedback?

This leads to the second bisection: domain. Depending on the field, lacking much experimental feedback will make measuring progress difficult or not at all different. In math, it’s relatively straightforward to check whether a reasoning trace makes sense: you either proved the theorem or not. There are whole fields dedicated to proof checking and verification: math becomes in a way a game with a vast search space, a fertile ground for some our current RL or LLM systems. On the other extreme, you can have an AI hallucinate the most wonderful looking drugs that work perfectly in silico but that you know that will in all likelihood fail even an early assay screening. How then can you keep measuring progress in this setting without an experimental result?

Reasoning traces

While it may seem to be logical to focus on outcomes (did the drug work or not?), I think the main thing to evaluate in our current systems should be the process itself: how did the system arrive at its current conclusion? How much evidence is there to support it? How novel is the reasoning behind it? And – a very important one – how likely is a human scientist to be convinced by it?. The primitive we are interested in examining is the reasoning trace that the system outputs and how we evaluate these reasoning traces in terms of evidence support and novelty will depend on the domain. An AI that needed to invent a new clever logic trick to prove a theorem is going to be a lot more valuable than another that consistently brute forces inelegant proofs. However, a clinician might be more inclined to believe a drug has more of a shot to work if it works through known mechanisms, especially in the absence of experimental data.

Evaluating reasoning traces at scale would be necessary for progress, likely through a combination of crowdsourcing, AI-guided critique (probably something in the style of RL[H/AI]F), and bibliometric density measures (e.g. an out-of-distribution measure of the proposal to assess novelty). Providing evaluation of each step of the trace will likely be necessary as work in LLM reasoning has shown. This is probably the hardest system to put together and appetite of funding for it may end up in awkward spaces between industry and academia.

The weakness of current LLMs

At first blush, the generalizable power of LLMs and their capabilities to produce common sense reasoning traces seem to make them good candidates for a first swing at an AI scientist. However, current LLMs have an inherent weakness in medium/long term planning that is tied to their current autoregressive design: the longer a reasoning trace needs to go on and be correct every step of the way, the more unlikely that the LLM will sample it. This doesn’t seem to be an easy problem to solve and even several attempts at extending LLMs with RL and search mechanisms don’t seem to help much more than simpler tricks like better prompting and majority voting.

I therefore don’t see an LLM as a candidate to be the core reasoning component of an AI scientist. Instead, what seems to be a more promising direction is using the LLM as ‘reasoning glue’ for other components like physical simulators and theorem provers, which the LLM can use to shorten reasoning trace lengths by delegating big pathways to the tools or even providing definitive conclusions say by using a causal inference tool at the end. We have barely scratched the surface of the combinatorics of chaining LLMs with external tools (though we have some examples of even pairing them with real robots). Some of these tools that are powerful in their own right, a good example being probabilistic programming languages, which could be used to build a quantitative and predictive world model. An alternative path is to train more generalized search or RL policy algorithms by having LLMs create and evaluate worlds and rewards at scale.

The humans in all of this

Even in the limit, when we somehow finally have a working AI scientist that is fully independent, humans still play a crucial role. Remember that science is a social activity! And the more minds that end up discussing and falsifying hypotheses, the better.

What about narrow systems?

Narrow systems are not AI scientists: we already have a lot of narrow systems that can automatically produce discoveries without much reasoning, but they’re more of an experimental automation exercise than truly independent AI systems. The main distinction here seems to be the reasoning angle: a system that produces new science through a reasoning trace is probably the bar we want to set.


Back to posts


comments powered by Disqus