Jonathan Montgomery of Indigo Designs Winning LLM Retrieval System For Laboratory Documents
Indigo BioAutomation Systems Specialist and member of Indigo’s R&D team, Jonathan Montgomery, recently took on the “ADLM Data Science Group: LabDocs Unlocked Challenge.” The challenge, developed by the ADLM Data Analytics Steering Committee, asked teams to build an AI tool that could help laboratories more easily find information in document repositories and answer questions in plain language accurately. Jonathan’s submission, a purpose-built combination of exact term matching and context-based concept matching, was selected as one of two finalists, along with a team from Baylor College of Medicine and Texas Children’s Hospital. On February 26, in a live head-to-head sharing of approaches and demonstrations (now available on demand), Jonathan’s system was selected as the winner by webinar participants.
The Challenge
The LabDocs Challenge tasked participants with designing an AI-based retrieval system capable of extracting reliable answers from a repository of laboratory documents. The dataset comprised 13,600 PDF documents of varying sizes and formats, split into two categories: SOPs (standard operating procedures) and FDA 510(k) documents. To be successful, the system needed to accurately answer questions written in plain language based on the content of the documents in the data set, and to provide quick access to the documents used to generate the answers. Naturally, the system had to be fast and stable enough to run in a live demonstration.
Read more about the challenge here.
Built For Purpose
Laboratory medicine is not a “one-size-fits-all” discipline. Some very specific technical terms have no counterparts in everyday language, and others require a specialized scientific context to understand.
With his extensive experience with lab documentation and customer needs as additional inputs, Jonathan developed a hybrid retrieval design combining exact-term matching (BM25) with semantic embedding search. This approach has been well-established in the information retrieval community since around 2020, based on work done by Karpukhin and others, and was recently highlighted by the Anthropic engineering team. Using an LLM (large language model) approach to embedding requires pre-processing documents so that current-generation LLMs can work with them.
The portion of a document that current LLMs can work reliably on is smaller than most documents. That meant using a common method called document chunking, but in a way that considers how laboratory medicine documents give language meaning by context. By chunking in a way that preserved the medical content of the chunks, Jonathan generated medically meaningful embeddings and evaluated them using advanced embedding-space visualization techniques. Accurate embedding allows the LLM to focus the query and retrieve a complete answer, across multiple documents if necessary. “Designing proper chunking and using the right embeddings was critical,” Jonathan explained. Further testing was done on the response time of various LLMs and on the content of system-level prompts to tune how answers were presented.
Named The Winner
After being selected as one of two finalists, Jonathan presented the solution during an ADLM public webinar. Both teams reviewed their system designs and key decisions and then gave a live demonstration. Their models were tasked with answering three identical queries, allowing the audience to compare the quality, accuracy, and speed of the answers. After live voting, Jonathan and the Indigo team were named the winners.
What's Next
Winning the challenge was a lot of fun in the moment, but more important is continuing the work reflected in the submission. The R&D team sees substantial opportunity for AI to improve knowledge sharing, standardization, and collaboration across the workflow and continues to explore its potential through focused research and experimentation. More broadly, the Indigo team remains committed to advancing solutions that expand the capacity, capabilities, and confidence of our customers and diagnostic and forensic testing in general.
More About Indigo
Indigo BioAutomation develops software and computational capabilities that bring structure, clarity, and operational discipline to clinical and forensic laboratory environments. Drawing on deep experience in analytical chemistry, laboratory science, and advanced data methods, the company focuses on improving the precision, efficiency, and scalability of diagnostic testing. ISO 13485 certified since 2014 and with software currently processing more than 15 million instrument signals a day, Indigo’s products and solutions reflect a long-standing commitment to scientific rigor and responsible innovation.
Sharing ADLM’s belief in the importance of expanding the impact of data science, Indigo was an early supporter of ADLM Data Science Group, serving as an inaugural sponsor of both its annual ADLM event and the recently announced Data Science Certificate program. Find out more about the Certificate program here.