How to use linked data to improve R&D efficiency
Lately we have been working on modules that enable our suite of products to automatically determine the quality of data produced in high-throughput measurement systems. The solution we have produced makes a prediction about the quality of final result values based on the values of process parameters used during the calculation. The approach leverages human knowledge about what kind of input data will result in good output – it’s an automated “garbage-in, garbage-out” detector.
In chromatographic peak picking, there are a large number of parameters which determine the quality of computed results. In high throughput systems, there can be a lot of peaks. If humans have to review peaks prior to releasing results, the overall throughput of the system will be limited by the manual steps. Further, humans are actually pretty bad at detecting real problems in data in high dimensions and we positively hate sitting in front of a computer screen staring at peak after peak after peak for hours on end. The problem screams for a computer to help – and thankfully there are solid approaches to help eliminate the time sink (think credit card fraud detection). Although this is not a visualization problem, I will use a picture to explain the approach. With a 2-D projection of a multidimensional data set you can see the outliers (hint: then so can a computer).

Peaks that fall outside a region known to produce good results should be checked. The weighting and projection angles have been selected through human validation then used by an automated program.
This plot is part of our validation process for automated processing and it can be made with many tools. We happen to like using R for algorithm development since it is a programming environment which allows us to develop high-powered statistical methods that can be used directly in our modules and solutions. This output from RGobi shows the weighted projection of eight peak picking parameters onto a plane, and shows outliers which may represent errors in a quantitative analysis process. Again, the goal is not to make a classification (good vs. bad) problem into a visualization problem, but to show that an n-dimensional system can be setup and automatically run over datasets to make predictions. If everything within the ellipse can reliably be used without human checking, then overall review time is shortened and focus is placed on data which is more likely to be problematic.
The challenge for these types of applications has never really been the math – it’s been in building a practical system in which the math can be used. It’s a shame that so little of this type of analysis is actually done in today’s labs– people could be spending their time more effectively and operations could be more efficient (instead of being the target of outsourcing to reduce the labor rate on all those manual processes). Again, it’s not the math of prediction that’s difficult; it’s getting the data into this type of leveraged prediction that’s hard.
If you look at the software used in most laboratories, you will find that most of the code is actually running the low-level workflow. It takes a shocking amount of code to even collect data these days. However, once the data is collected in the lab, you are still a long way from being able to perform leveraged predictions like the example above. We look at the situation like this:

To get data into systems which perform predictions and leverage human knowledge, data from lab workflows must be integrated and made actionable.
You have software for workflows, but before you can get to the maximum enterprise value, you need to share the knowledge (like what good peak picking means for you). Before you can share the knowledge, you have to make the information actionable. You need systems which allow you to act on data directly and quickly. The challenge is being able to understand the data you look at. Most people don’t have a complete picture of the experimental process. That’s where data linking comes in. Information integration is a key requirement to achieving the ultimate goal. Even in companies where there is a knowledge sharing strategy (you know who you are), information integration has been expensive and difficult. The ability of knowledge sharing to impact the efficiency of the organization depends on the effectiveness of the information integration approach. Too many labs have great workflow processes which result in proprietary data locked into silos which cannot be shared effectively, let alone used for leveraged prediction. Too many knowledge sharing systems focus on reports and documents – just try and build an automated data review and release system when the input data is a giant collection of PDF files random scattered on enterprise file systems or indexed by an eLN. Efficiency will result when the knowledge sharing is based on linked data which is integrated and interoperable.
Indigo’s approach to automated systems uses open linked data for the information integration step. We give human experts the ability to develop knowledge about processes which can then be turned into automated tools that impact their workflows and their organizations.
The information integration step is critical. Now ask: “How many of the workflows I depend on are still inside my company?” The answer you give this year is probably lower than last year. Welcome to globally distributed R&D – distance and organizational boundaries make the integration step more important than ever before. Next, we’ll talk about how our products tackle the distributed R&D challenge.




