Designing AI Retrieval for the Realities of Laboratory Data

Jonathan Montgomery of Indigo Designs Winning LLM Retrieval System For Laboratory Documents

Indigo BioAutomation Systems Specialist and member of Indigo’s R&D team, Jonathan Montgomery, recently took on the “ADLM Data Science Group: LabDocs Unlocked Challenge.” The challenge, developed by the ADLM Data Analytics Steering Committee, asked teams to build an AI tool that could help laboratories more easily find information in document repositories and answer questions in plain language accurately. Jonathan’s submission, a purpose-built combination of exact term matching and context-based concept matching, was selected as one of two finalists, along with a team from Baylor College of Medicine and Texas Children’s Hospital. On February 26, in a live head-to-head sharing of approaches and demonstrations (now available on demand), Jonathan’s system was selected as the winner by webinar participants.

Watch the LabDocs Unlocked and LLMs: Data Challenge Finalist Showdown On-Demand Now

The Challenge

The LabDocs Challenge tasked participants with designing an AI-based retrieval system capable of extracting reliable answers from a repository of laboratory documents. The dataset comprised 13,600 PDF documents of varying sizes and formats, split into two categories: SOPs (standard operating procedures) and FDA 510(k) documents. To be successful, the system needed to accurately answer questions written in plain language based on the content of the documents in the data set, and to provide quick access to the documents used to generate the answers. Naturally, the system had to be fast and stable enough to run in a live demonstration.

Built For Purpose

Laboratory medicine is not a “one-size-fits-all” discipline. Some very specific technical terms have no counterparts in everyday language, and others require a specialized scientific context to understand.

"The R&D team was ready for a challenge like this. We work with so much raw measurement and operations data that this was a modest scale problem for Indigo. We had already explored how AI could be used safely for our internal use and tested some RAG or `Retrieval Augmented Generative` models on our own documentation. The ADLM challenge was a chance to test our work outside the R&D team."

With his extensive experience with lab documentation and customer needs as additional inputs, Jonathan developed a hybrid retrieval design combining exact-term matching (BM25) with semantic embedding search. This approach has been well-established in the information retrieval community since around 2020, based on work done by Karpukhin and others, and was recently highlighted by the Anthropic engineering team. Using an LLM (large language model) approach to embedding requires pre-processing documents so that current-generation LLMs can work with them.

LLM Retrieval System Indigo Design Highlights: Medical domain-aware document chunking Validated embedding generation Hybrid retrieval – semantic & exact match Tunable response Transparent source attribution

The portion of a document that current LLMs can work reliably on is smaller than most documents. That meant using a common method called document chunking, but in a way that considers how laboratory medicine documents give language meaning by context. By chunking in a way that preserved the medical content of the chunks, Jonathan generated medically meaningful embeddings and evaluated them using advanced embedding-space visualization techniques. Accurate embedding allows the LLM to focus the query and retrieve a complete answer, across multiple documents if necessary. “Designing proper chunking and using the right embeddings was critical,” Jonathan explained. Further testing was done on the response time of various LLMs and on the content of system-level prompts to tune how answers were presented.

Named The Winner

After being selected as one of two finalists, Jonathan presented the solution during an ADLM public webinar. Both teams reviewed their system designs and key decisions and then gave a live demonstration. Their models were tasked with answering three identical queries, allowing the audience to compare the quality, accuracy, and speed of the answers. After live voting, Jonathan and the Indigo team were named the winners.

What's Next

Winning the challenge was a lot of fun in the moment, but more important is continuing the work reflected in the submission. The R&D team sees substantial opportunity for AI to improve knowledge sharing, standardization, and collaboration across the workflow and continues to explore its potential through focused research and experimentation. More broadly, the Indigo team remains committed to advancing solutions that expand the capacity, capabilities, and confidence of our customers and diagnostic and forensic testing in general.

More About Indigo

Indigo BioAutomation develops software and computational capabilities that bring structure, clarity, and operational discipline to clinical and forensic laboratory environments. Drawing on deep experience in analytical chemistry, laboratory science, and advanced data methods, the company focuses on improving the precision, efficiency, and scalability of diagnostic testing. ISO 13485 certified since 2014 and with software currently processing more than 15 million instrument signals a day, Indigo’s products and solutions reflect a long-standing commitment to scientific rigor and responsible innovation.

“We're really proud of Jonathan and the work he put into his submission. The whole R&D team supported his effort, and we're proud to have been a part of the challenge. It reflects Indigo's approach to innovation: laboratory medicine is a team sport. The potential of AI is exciting, but there are a lot of unknowns. This challenge and outcome were a perfect demonstration of our approach to solving laboratory problems starting from first principles.”

Sharing ADLM’s belief in the importance of expanding the impact of data science, Indigo was an early supporter of ADLM Data Science Group, serving as an inaugural sponsor of both its annual ADLM event and the recently announced Data Science Certificate program. Find out more about the Certificate program here.

Find out about analytical operations software available from Indigo BioAutomation

BACK TO ALL