Solutions

Products

Platform

About

Solutions

Products

Platform

About

Benchmarking Scientific AI Research Tools

Jun 20, 2025

Grant Forbes

At Inquisite, a core part of what we are developing is an AI tool to assist researchers with literature review and evidence synthesis. This involves taking, as input, natural-language research queries and returning an annotated list of relevant published papers and journal articles. While this capability can increase researchers’ output by an order of magnitude, we also acknowledge that proper evaluation of tools like this is crucial to ensure that researchers can trust the outputs of our pipeline. Implicit in the use of Inquisite is the understanding that the articles presented should be the most relevant articles for the query given, rather than simply tangentially-related, and that if there exists a particularly relevant article for a given query, it ought to be included in the resulting report.

To this end, it’s vital that we have a benchmark to ascertain how well Inquisite meets this criterion, and to compare its efficacy to that of other, similar services. Benchmark development for this area is challenging, however, for a few reasons. It is an imprecisely-defined task, in some ways, because even experts in a particular field can disagree on which articles are most relevant to a particular query. Disagreement may be based on the perceived quality of an article, different conceptions of how subfields intersect, or any number of subjective, fuzzy, and difficult-to-evaluate reasons. So we expect there to be some limitation on how accurate a benchmark can be in this area, defined by how much in agreement the human domain experts are within a particular sub-field in forming “gold standard” responses. Alternatively, if a benchmark instead relies on an automated process for collection, it becomes subject to the same magnitude and distribution of errors as the automated query-answering pipelines that it’s meant to evaluate.

Existing benchmarks for scientific QA

There are a few existing benchmarks in this domain that are worth examining, and there is a suite of metrics relevant to this task. In particular, there are “QA” (Question-answering) datasets / evaluation methods for assessing a model’s ability to accurately answer natural-language questions with natural-language answers. ScienceQA, for example, contains explanations, along with their answers for questions taken from elementary and high school curricula: these are likely to be a different (and simpler) domain than most users’ use-cases for Inquisite, but is likely still positively correlated with aptitude in accurately answering more complex and technical queries. BioASQ-QA and PubMedQA, on the other hand, are similar datasets more targeted toward query-answering in the medical science field specifically. LFR-QA is a dataset meant to evaluate Retrieval-Augmented Generation (RAG) in particular: that is to say, automated systems that can answer questions by referencing some corpus of documents to which it has access. This is particularly relevant for evaluating pipelines such as ours which actively incorporate search processes to compile and sort through relevant articles (rather than simply relying on memorized data in the training set of a LLM). LFR-QA consists of human-generated answers to queries, made by referencing and summarizing a series of extractive summaries (relevant quotes) from selected relevant documents within the corpus. RAG-QA Arena is a corresponding metric for evaluating performance on the LFR-QA dataset by comparing the method to be evaluated against these human-generated summaries, and asking an LLM which is better.

The last step in this automated evaluation pipeline - using a LLM to evaluate LLMs - raises a potential concern: is there a danger of building in biases with such automated methods of evaluation? While there is some evidence that bootstrapped evaluation and training can cause divergence from human evaluation (for example, this paper on “Model Autophagy Disorder”), there is also evidence that self-evaluation, if done carefully, can be surprisingly robust (for instance, this paper on LLM-based automated summary evaluation, and this paper on “Constitutional AI” for aligning LLMs).

Limitations of existing benchmarks

The aforementioned benchmarks, while relevant to our research assistant tool, are not necessarily the best fit for directly evaluating Inquisite’s effectiveness as they are ultimately meant for evaluating natural-language answers to natural-language questions. While our AI research assistant does provide natural-language synthesis, it does so via a complex search process and a substantial part of what the end user is interested in is the identified sources themselves rather than just the synthesis. As such, a more fitting benchmark would be one which maps natural-language queries not to natural-language answers, but instead to a set of research articles, listed and ranked by relevance to the query. While LFR-QA has a set list of documents per question for RAG-based approaches, this set is fixed and immutable, and isn’t comparable to the set of all possible research articles that Inquisite’s research assistant searches for a query. Other QA-based methods listed above do not have a list of relevant articles at all, and so are even less useful for benchmarking purposes of an AI research assistant such as ours.

Undermind, a research assistant with a similar goal to Inquisite’s research assistant, has done some work evaluating their model in this domain (mapping natural-language queries to a ranked list of relevant articles) and comparing the results to Google Scholar. However, this evaluation was based on the LLM-based metric used in Undermind itself, which both is proprietary (and thus not directly publicly accessible), and potentially prone to have correlated errors with their workflow. Ideally, we want something like this benchmark (in that it maps natural-language queries to ranked lists) that is both publicly accessible and as non-automated as possible to minimize any bias.

The best currently-existing benchmark for this is LitSearch. LitSearch is a benchmark mapping queries to papers: specifically, recently-published ML and NLP research papers. It was developed by first collecting the dataset of papers, then asking both automated systems and researchers (in particular, the authors of the papers themselves) to generate questions for which they would expect that paper to be highly ranked by a competent research assistant. While the scientific field covered by this benchmark (contemporary natural language processing papers) is somewhat narrow and differs from Inquisite’s target use cases, this benchmark is the most relevant identified approach for evaluating AI research assistants. The methodology behind generating it (particularly the automated query-generating portion) is itself fairly general, and thus could be extended to other domains, to compile a broader relevant benchmark. At Inquisite we are currently developing plans to do so, and will post news of our progress when we have it.

Benchmarks for AI applications are a tricky and diverse thing, and there is no one-size-fits-all solution – even for a simple automated task, which generating automated literature reviews is decidedly not. While there are several somewhat-relevant datasets and evaluative benchmarks out there, we have not been able to identify an existing benchmark that is sufficient to test and demonstrate the efficacy of Inquisite’s research assistant. As such, we are currently working on developing our own evaluative framework, to better assess the thoroughness and relevance of our research assistant’s results. We plan to share it openly for use by others developing similar tools in order to bring greater transparency to the performance of AI tools used in literature search and synthesis.

Powering life science innovation with advanced AI

Enabling organizations to get treatments into the hands of patients faster, more efficiently and with less risk.

Get Started Free

Powering life science innovation with advanced AI

Enabling organizations to get treatments into the hands of patients faster, more efficiently and with less risk.

Get Started Free

Powering life science innovation with advanced AI

Enabling organizations to get treatments into the hands of patients faster, more efficiently and with less risk.

Get Started Free

Accelerating science with cutting-edge AI for a better world.

Company

About

Blog

Resources

Tutorials

Feedback

Terms & Conditions

Accelerating science with cutting-edge AI for a better world.

Company

About

Blog

Resources

Tutorials

Feedback

Terms & Conditions

Accelerating science with cutting-edge AI for a better world.

Company

About

Blog

Resources

Tutorials

Feedback

Terms & Conditions

Accelerating science with cutting-edge AI for a better world.

Company

About

Blog

Resources

Tutorials

Feedback

Terms & Conditions