Development of a Small Volume Sampling Technique and LC-MS Orbitrap Assay for Pediatric Pharmacokinetic Studies of Fentanyl and its Metabolites (Uwe Christians, Clinical Research and Development, Department of Anesthesiology, University of Colorado Denver). In this talk, Uwe showed the sensitivity and selectivity of using full scan data for PK type data. Because you don’t have to select a specific transition, you get a full spectrum during each aquisition which can be interrogated later for metabolites that may not have been part of an initial hypothesis. Kevin Bateman from Merck showed this type of experiement a few years ago at ASMS, but it appears that the Oritrap can really do this experiment. This blows up the amount of data collected in a PK study by at least an order of magnitude, and it increases the value of data stored for long term access by at least as much.
Dr Hood suggests that signal-to-noise in biological measurements is so bad that we must use 1) Statisics 2) a deep understanding of the pathways and 3) data integration; to make any progress. All of his data is shown as networks – perfectly aligned with the large-scale linked data approach Indigo uses. Hood said: “Medicine is becoming an information science.” If that’s true, new approaches to informatics and IT will be essential.
Sent from my BlackBerry
Indigo BioSystems is now using BlueLock LLC to provide hosting/infrastructure for the Indigo Platform and our Software-as-a-Service offerings. Indigo selected BlueLock for security-intensive applications in the pharmaceutical industry and apparently we are not alone.
From ComputerWorld: “Cloud security: Try these techniques now”
BlueLock’s virtualized environment allowed data and volumes to move between systems in a dynamic, low-cost way that would be impossible with a traditional, hosted environment, Westgate says.
There were, however, security concerns to be addressed before Logiq³ would entrust its critical systems to BlueLock’s cloud. The life reinsurance company handles death records, which include personal information like social security numbers, as well as financial data and information about major assets that its large financial customers have on their books. Although Logiq³ isn’t regulated by the U.S. government’s Sarbanes-Oxley Act, its customers in the financial sector are, “so they’ll be auditing us,” says Westgate. As a result, Logiq³ needed potential cloud vendors to demonstrate that they were in compliance with applicable regulations and could provide high levels of security.
The thing we like about BlueLock is the data protection architecture and the ability to perform audits while still achieving the elasticity and location transparency need for SaaS. We too are audited by our customers to ensure our applications protect data and prevent tampering. The idea of separating roles is key to security in externally hosted systems. Our approach discussed at the ALA Conference takes the separation one step further by encrypting the data so that neither the Indigo admin’s nor the BlueLock admin’s have the needed keys to access customer data.
Encryption adds to the security enabled by the “division of labor” described in the article:
The division of labor between Logiq³ and BlueLock actually strengthened security, because “no one person, or company, has all the keys to the kingdom.” says Westgate. Because BlueLock manages the firewall, for example, “none of my admins can go in and decide to sell or move the data,” he notes. “And BlueLock admins can’t do it either, because they don’t control the systems.”
Audits and accreditation are also needed because as good as this all sounds it won’t work if the SOPs are not being followed, or if there are holes in the procedures.
Therefore, due diligence is critical, Anderson says. Pfizer uses SAS 70 Type 2 certification, in which an independent third party audits the service provider’s internal and data security controls. Anderson also verifies the vendor’s level of Safe Harbor compliance and checks Dun & Bradstreet research to make sure it’s legitimate, he adds.
Another standard by which to evaluate a service provider is ISO 27001, which defines best practices for designing and implementing secure and compliant IT systems.
While such standards provide a useful starting point, their criteria tend to be generic, says Gartner’s Heiser. Companies still need to match a service provider’s specific controls to their specific requirements, he adds.
For example, after checking out BlueLock’s SAS 70 Type 2 accreditation, Logiq³’s IT staff did a further evaluation to “make sure the controls we require are supported by the controls they have in place,” Westgate says. His team then followed up on discrepancies, identifying missing controls and working with the vendor on solutions. The company plans to repeat the process at least once a year, he says.
It is clear that shared services and externally hosted data are a part of pharma’s future. Indigo is working hard to make sure that its customers gain the benefits of this new approach while minimizing the risks.
To read more of what we are up to, check out our website and blog.
I served on a panel in the informatics track at the Lab Automation Conference last week with people from Pfizer. We were each allowed a couple of slides to do an introduction to a key point. My slides are attached to this post. The interesting thing to me was how aligned Indigo and the Pfizer scientists were on the use of shared services to improve productivity in research. The idea expressed on the panel was that we can make our relationships with collaborators much richer by making data “location transparent” and the computational resources needed to process them “elastic”. These are the two main promises of so-called cloud computing infrastructures. They key is to encrypt everything in the shared service using security standards developed by other industries to ensure data protection while gaining the ‘elasticity’ and ‘location transparency’ by allowing selective access to data to those who need it.
The key idea expressed by the audience was that data security is the top concern of research organizations considering or using shared infrastructure. I was delighted that there was strong agreement between Indigo and Pfizer on how to solve this problem and that the benefits would be an increase in productivity for everyone.
At the “Chemical and Pharmaceutical Structure Analysis” (CPSA) conference last week, Indigo BioSystem presented an overview of our informatics platform. In my previous post I gave a rationale for simplifying to a single platform application and in my talk, I shared some thoughts contrasting the platform to using manually-intensive collaborative and analysis tools like e-mail and spreadsheets.
CPSA is undoubtedly the best conference for anyone working in bioanalysis today. The sense of community and personal connection created at this meeting are simply not possible at a larger meeting, and the scientific content is world-class.
My humble contribution to the meeting can be viewed via my slides: Julian CPSA 2009.
This blog is intended to give an inside view of Indigo’s product development including the rationale and thought processes we go through to create products that solve problems with data analysis in the life sciences. I usually try to explain the ideas behind our products.
This post is a little different.
As a technologist in a software company it is easy to surround yourself with like-minded technophiles. We thrive on the details of our solutions (which we sometimes treat better than we do our own children – just ask our spouses). But at some point, unless you are such a loser that you really want to unleash another crappy product with a blinking 12:00 on the display, you have to give up the love of complexity and strive for simplicity.
Easier said than done. It’s so hard it will make you want to quit. If you do, you’re in good company, crappy products are everywhere. If you don’t quit, you have a shot at really helping people.
I’m not saying that we’ve achieved the ultimate goal, but now that we know our approach to automating data analysis actually works, we have started the drive toward simplicity. This push has resulted in some unexpected benefits.
This is a long post so I will cut to the punch line. We have simplified our product to a platform that scientists can use to automate data analysis. After the work described below, it now looks like this:

It is an application which has a data repository function that makes raw and processed data available to an automation framework which includes all the features needed to automate lab data analysis. Period.
How we got to this from our “architecture” diagram:

is the story I’ll tell in this blog.
This summer I gave a talk at an AAPS meeting in Seattle on the “21st Century Bioanalytical Laboratory” and, during the vendor-heavy session I was in, the usual suspects showed up. “Get ‘yur LIMS here”, “What? You don’t have an eLN? What are you a caveman?” I wanted to shake everyone in the audience and say: “Do you really think that doing the same thing you’ve done for years will produce different results?” Come on. To paraphrase Kevin Spacey in K-PAX: “With so many doctors on this planet, why are so many people still sick?”
I loaded my PowerPoint for bear and got ready to try and wake everyone up. By the way, if you can choose to go last in a vendor-heavy session, always do it. It lets you search Google for images to make the big money marketing guys look flat footed. You can make your point in more contrast than the big boys whose uptight presentations feel like a slide-whipping…I’m just sayin’.
You can download my slides by clicking here.
The gist was this: we are doing informatics wrong for how pharma works these days, and if you keep doing things the old way, you will probably get laid off. Odds are you will get laid off anyway, but why walk around wearing a sign that says “You can probably do without me.”?
One way to not get laid off is to perform with magical efficiency. They cut your budget – you get more done. They take away people – you get more done. They ask you to work with CROs from far, far away – you get more done.
It seems unlikely that magical efficiency will come without using computers better than you do today. My favorite customer says most of pharma has reached the rarefied air of using Excel to do everything. He calls this “Spreadsheet Monkey Business” and can calculate the lost efficiency – it’s scary. Why are people content to abuse spreadsheets while waiting for the next round of cut-backs? I think it’s because Excel is easy to understand and Big Thinking Change is hard to understand.
At the end of my talk someone came to the microphone and earnestly said: “I understand LIMS, I understand eLN, but Indigo, I have never heard of anything like what you were talking about”. A few years ago that comment would have made my head swell with pride. It’s so nice to have someone confirm your genius by admitting to the room they don’t understand you. Not this time. This time, it felt like a wasp had stung my eyeball. I almost blacked out. If our product is so complex that a 20 minute talk, replete with jokes, teasing about Bioanalytical LIMS (sorry – that’s just too easy), and plenty of pictures, we’ve got a big problem. It’s not just the story, it’s the way we think about the product that matters – what does it take for people to understand a new complex product? It’s only their jobs and the salvation of the industry that’s at stake. It takes a lot – new things are hard to understand. Give an audience ten new things, and you are talking gibberish.
It started when we committed the sin of creating artificial distinctions between the components of our platform. We did it so that we could tell everyone how smart we were and show them we were a real company because we had lots of cool products. It let us talk (and talk and talk) about how clever each and every part really is. Why? Steve Jobs doesn’t do this. Hell, he glued the iPhone shut so that no one would talk about its guts. Trust me they are WAY cool, but they don’t really matter. Not that everyone got the iPhone idea at first either (Why would I want an “app” for my phone?). They got that it was a phone and that it could do some other stuff too. Now looking at Apple’s “magical” profits, people are starting to sort of get it.
Indigo started with the idea that we could automate the slow, manual error-prone analysis of lab data. We could speed up drug discovery and development, help cure disease and ease human suffering, and the market would pay for that help. We now know that this is true – now more than ever before. But automating data analysis is like saying “work smarter, not harder”. How do you actually pull it off? We figured out that in order to automate data analysis, first you have to organize data so that software can get to the raw materials of analysis easily. That means data integration and aggregation.
We are drive toward a basic science workflow:

Think of an industrial robot building iPhones (I don’t know if this is how they do it, but it should be). Robot arms are cool, but they are bolted somewhere and the parts they assemble have to be within reach. Also, if the parts are all in boxes with tape and foam peanuts, the robot will have trouble getting the part to grip. So, we need to bring the parts together, and “standardize” them, at least to the point that the robot can pick them up and make your iPhone.
We chose to use RDF and data linking (see my earlier post on this) to store structured data and a separate component called “HyperStore” and “OpenStore” to store unstructured data (XML files with raw data, etc.). Why two components? Bragging – the kind of hubris for which Zeus will smite you. It doesn’t really matter how it works. And there doesn’t need to be two components. It’s just a repository. It holds structured and unstructured data. The structured bit can hold the entire World Wide Web (we are not kidding about this – it scales to biblical proportions). The unstructured bit is blindingly fast and will allow you to search all your data at Google speeds (at least if you use our cloud approach and we parallelize the search…). That’s it. The platform has a repository. If you want a propeller-head deep dive come on in and we’ll show you the robot arm making the iPhone, but if you just want to make a call, you won’t care.
Solving the data integration and standardization problem for laboratory data was so challenging and rewarding that we couldn’t stop talking about it. Here’s another tip: only get involved in international standards bodies. And for real fun get on the site selection and menu committees as quickly as possible…I’m just sayin’.
With the integration task solved, we moved on to how to actually make people’s lives better by delivering functionality that helped people work faster. Here again, we were very clever – “too clever by half” (said the actor to the bishop). Our idea was to create plug-in’s out of Java using a modularity standard called OSGi. This is a good idea, but it means that someone has to write actual compiled code to create the module – and that is just too slow. Interestingly, for years we have been using the “R” statistics environment to develop algorithms – most recently a new peak picking approach based on some science fiction that is now working its way through the patent office (more on that later). We would take these algorithms and then move them from “R” into a plug-in. After a while we realized that it makes no sense to have a platform for which you have to write Java plug-ins to get functionality, especially when the R script has been done for months already. Why not just skip straight to providing modules based on R? Everyone loves R. So, we created a component to run R scripts that operates on data in the repository. This way, as soon as we see something working we can incorporate it directly into a workflow. Customers can get R scripts from us, create them on their own or download any of the 2500 packages on CRAN and BioConductor.
For example, we apply machine learning techniques implemented in R to automated chromatographic peak review and almost eliminate manual modification of peak-picking parameters. You can look at the lectures from a course I taught on this at Purdue a few years ago, but it starts with feature selection:
Then you can have experts help build training sets:

And then you can automate the identification of bad peaks:

Now we are getting somewhere: Indigo is a platform that has a repository which can automatically be populated from instruments and databases and it automates the task of running R scripts to do insanely good things with data – helps you do more with less.
The platform needed a few other bits to really make people happy. It needed a way to generate the myriad of files that other systems needed to operate (text files, oh well…if you make me.). OK, we can use a standard template engine and let people add, change and remove templates all they want.
We also needed a way for people to create web pages that would provide interactivity. Web pages collect data and put it in the repository, do queries, and fire off R-scripts. The Web UI engine is another open source project which allows editing, uploading and deleting of web forms.
Indigo integrates all these tools together and gives the customer total control. It makes it easy and fun to build automated data analysis processes – not just manual data analysis – which you can do any old way you want (unless you miss the cut on the next round of…well, you know).
That’s it: Indigo is a solution platform that uses an advanced repository to make data directly accessible to the best statistics system in the world and includes all the connective bits to talk to instruments, databases and other systems.
It’s just Indigo. It’s one thing. It has one name (like Oracle, Sting or Cher). If you want to know how the internal parts work you can do an exam, but looking at the insides won’t make you appreciate what it does unless you are already an Indigo employee.
Next time, I will talk about how if you have Indigo, you can do almost anything with R. I also want to explain how by putting the platform on the cloud you gain insane amounts of power cheap. The goal is magical efficiency – you can do it.
During my undergraduate studies at Purdue, a friend of mine once said, “An engineering degree indicates you are capable of learning a lot of difficult material quickly.” Those words have had more and more impact on me the longer I have worked in the software industry. When earning a technical/scientific degree, the most important thing you learn is not so much the material you are studying, but more of how to learn and apply such material. The sharpening and use of this skill is critical.
Time and again when designing software, there are new concepts to learn, new technologies available, new frameworks to implement. How does one keep up with this moving target we call progress? The best way for me is reading books. Other avenues are online journals and tutorials, podcasts by industry leaders, webinars, and conferences. These are the resources a technical person must use in order to grow in skill and thought.
If the technology or framework you want to learn is complex, a book is probably the best option. If it is only a matter of using something simple, look for online tutorials or documentation. If you want to stay current on emerging technologies, sign up for newsletters, read online journals, or listen to podcasts from industry leaders.
In the end, only you are responsible for your skill set. If you become complacent in learning, you only teach yourself to dislike change from what you already know. So be proactive, seek new skills, and don’t forget what your degree really taught you.
Here are various software resources recommended by the software team at Indigo BioSystems.
Books on software design:
Podcasts
Other References
- Jason Liechty, Developer, Indigo BioSystems

In my last post, I mentioned that there are now technologies which allow data intensive applications to support global collaborative R&D operations. Like most IT things, the approach got a nickname, which stuck, and was then used in every conceivable way until nobody could say what it really meant. The nickname is ‘Cloud Computing’ and if you are in the IT business and don’t have a cloud, you must not have a marketing department!
As end-users of IT resources, those of us in the lab have always talked about “Servers in the Sky”. We never had control over the computers we used (which is a real pain, by the way, especially if you run an instrument), and we had always relied on invisible storage systems and servers located in remote locations which we almost certainly never visited except on some kind of new employee tour.
Big IT departments then began noticing that some servers were operating at full capacity and generating complaints: “Why the $%^@ does my substructure search take longer than a Google search?”, while others were operating at almost 0% almost all of the time. Obviously, since after locking us out of our desktops and automatically installing every patch issued from Redmond (right in the middle of our critical data acquisition), these smart guys could not leave this problem alone.
Some particularly smart guys at Amazon had noticed the same thing. Amazon had built a computer system capable of keeping up with all those One-Click orders in the last weeks of December. That’s a lot of computers in case you hadn’t thought about it. What do they do in February when everyone feels broke? Nothing. Well they were adding to the carbon footprint and slowly killing us all, but that’s another story…
I don’t know the details, but a while back almost every large IT group started to install some very cool software to do something called “Virtualization”. This meant that if an application was running on a slow machine, a virtual image could be created and moved to a faster machine. This is A Really Good Thing, especially given the mean-time-between-failures for cheap commodity hardware. It meant that new servers could be added to the overall server farm and pulled into use only when needed.
It seemed that Amazon decided that they could use their virtualization system to create bare operating system images anytime they wanted – start them up, run them and then shut them down when the work was finished. And what do you think Bezos and Friends did with that? Well, they gave it an ISBN number and sold it on Amazon.com! Not quite, but close.
They created Amazon Web Services (AWS) which is a pay-for-service web interface that allows you to create those bare OS instances and use them like regular computers on the internet. The instance you get looks like a computer, and acts like a computer, but in reality, it’s just a virtual machine running an image of a computer of a particular size and speed. It could really be running on anything, but you wouldn’t know it.
If you are an application developer like Indigo, you can put your credit card number in and get a pristine Linux image ready and waiting with a little blinking cursor in your SSH terminal application. What? You don’t want a little blinking cursor in an SSH window? Sorry. You could have a Windows 2003 Server image instead – not sure what you would want that for either really.
What you probably want is a useful application running on someone else’s computer that never runs out of space and never slows down. It would also be nice if that application was accessible by all of your collaborators both inside and outside your company or institute. Good News: that’s what we did with Amazon.

Indigo developed its entire application stack in such a way that it could run in the virtualization environment created by Amazon and other cloud systems.
Because of the massive telecom needed to keep up with December orders at Amazon, our applications enjoyed very nice speeds with contract labs in India and China and nearly insane speeds in the US and EU. And because it was so easy to add new instances in the Amazon system, we were able to monitor the performance of our applications and add new servers when things started to get heavy. If you want to be attractive at cocktail parties, tell everyone that you were working on a particularly difficult problem in the fight against – insert your therapeutic group name here – and solved it by invoking a “Cloud Burst” which “spun up” a thousand servers at once all for around $17. They really go for something like $0.12/hour so a thousand would really cost you $120 – you probably don’t need a thousand as they only needed about 100 for the final Human Genome assembly ($12).
Want to meet some really smart people? Go to Google. These guys said: “look at Google Maps, Gmail, Google Docs. People love our applications, they are beautiful. You think we want to sell bare images? No way.” Google and a few others have taken a different approach to the cloud. Google has something called AppEngine. This is an execution environment for languages supported by Google: mostly Python and Java. If you develop an application in a supported language, you can
upload it to the AppEngine and it’ll be ready to use like Gmail. Microsoft has already done something similar with Windows Azure, with the expected support for the .NET platform (not the whole platform – mostly just the web-enabled bits). Even Saleforce.com has a cloud offering if you want to develop add-on applications and run them in their cloud. Like I said, there are clouds everywhere there are marketing departments.
The point is that there are many types of clouds. Compute clouds, storage clouds, application clouds, application-extension clouds. There are even very secure clouds.
This is particularly interesting in the world of pharma. It’s all well and good to use “Elastic Map Reduce” to speed up your molecular dynamics calculations on a bookstore’s spare machines, but come on, what about work that needs a validated system? Some of us have killed an unfair number of brain cells trying to validate applications so the idea of validating a virtual image when you can barely tell what country it’s in, let alone which machine it’s on, seems insane.
This is where some of the usual suspects show up. IBM and the other pharma-friendly companies have stepped up with clouds which operate under more security like the hosted systems found in the clinical trial outsourcing world. These clouds are managed with standard operating procedures and validation documentation consuming the requisite number of dead trees to prove it.
IBMs “Cloud On Demand” is an evolutionary step of their original (and impossible to understand) “On Demand” idea. The IBM approach may be slightly less flexible than some of the others, but that is exactly what you want when dealing with regulators.
Indigo BioSystems is using all of these clouds to solve different problems in drug discovery and development. The Amazon Elastic Compute Cloud (EC2) has turned out be a very good fit for non-regulated drug discovery applications. Our performance metrics on bringing data back from synthetic contractors in China have shown a significant improvement over having the contractor try to connect through a pharma firewall, and it is much faster than sending a DVD by FedEx.

Indigo BioSystems Architecture
Indigo has deployed its entire application framework on Amazon EC2 for discovery applications and IBM’s Cloud on Demand system for regulated processes (in addition to enterprise and turn-key platforms). The outcome is a distributed R&D interconnection and interoperability based on open, accessible linked data using hosted services to ensure global availability, high performance and a pay-as-you-go pricing model. The hosted model fits the needs of global research groups who work with collaborators, contractors who must integrate data on complex projects. By combining a hosted approach with the linked data information model, Indigo’s approach represents a breakthrough in research data integration and management.
Stay tuned! In our next post, we will describe some of the technical details behind our main product components: DataLink (linked data store), and LOLA (web-based plug-in system for applications).
Lately we have been working on modules that enable our suite of products to automatically determine the quality of data produced in high-throughput measurement systems. The solution we have produced makes a prediction about the quality of final result values based on the values of process parameters used during the calculation. The approach leverages human knowledge about what kind of input data will result in good output – it’s an automated “garbage-in, garbage-out” detector.
In chromatographic peak picking, there are a large number of parameters which determine the quality of computed results. In high throughput systems, there can be a lot of peaks. If humans have to review peaks prior to releasing results, the overall throughput of the system will be limited by the manual steps. Further, humans are actually pretty bad at detecting real problems in data in high dimensions and we positively hate sitting in front of a computer screen staring at peak after peak after peak for hours on end. The problem screams for a computer to help – and thankfully there are solid approaches to help eliminate the time sink (think credit card fraud detection). Although this is not a visualization problem, I will use a picture to explain the approach. With a 2-D projection of a multidimensional data set you can see the outliers (hint: then so can a computer).

Peaks that fall outside a region known to produce good results should be checked. The weighting and projection angles have been selected through human validation then used by an automated program.
This plot is part of our validation process for automated processing and it can be made with many tools. We happen to like using R for algorithm development since it is a programming environment which allows us to develop high-powered statistical methods that can be used directly in our modules and solutions. This output from RGobi shows the weighted projection of eight peak picking parameters onto a plane, and shows outliers which may represent errors in a quantitative analysis process. Again, the goal is not to make a classification (good vs. bad) problem into a visualization problem, but to show that an n-dimensional system can be setup and automatically run over datasets to make predictions. If everything within the ellipse can reliably be used without human checking, then overall review time is shortened and focus is placed on data which is more likely to be problematic.
The challenge for these types of applications has never really been the math – it’s been in building a practical system in which the math can be used. It’s a shame that so little of this type of analysis is actually done in today’s labs– people could be spending their time more effectively and operations could be more efficient (instead of being the target of outsourcing to reduce the labor rate on all those manual processes). Again, it’s not the math of prediction that’s difficult; it’s getting the data into this type of leveraged prediction that’s hard.
If you look at the software used in most laboratories, you will find that most of the code is actually running the low-level workflow. It takes a shocking amount of code to even collect data these days. However, once the data is collected in the lab, you are still a long way from being able to perform leveraged predictions like the example above. We look at the situation like this:

To get data into systems which perform predictions and leverage human knowledge, data from lab workflows must be integrated and made actionable.
You have software for workflows, but before you can get to the maximum enterprise value, you need to share the knowledge (like what good peak picking means for you). Before you can share the knowledge, you have to make the information actionable. You need systems which allow you to act on data directly and quickly. The challenge is being able to understand the data you look at. Most people don’t have a complete picture of the experimental process. That’s where data linking comes in. Information integration is a key requirement to achieving the ultimate goal. Even in companies where there is a knowledge sharing strategy (you know who you are), information integration has been expensive and difficult. The ability of knowledge sharing to impact the efficiency of the organization depends on the effectiveness of the information integration approach. Too many labs have great workflow processes which result in proprietary data locked into silos which cannot be shared effectively, let alone used for leveraged prediction. Too many knowledge sharing systems focus on reports and documents – just try and build an automated data review and release system when the input data is a giant collection of PDF files random scattered on enterprise file systems or indexed by an eLN. Efficiency will result when the knowledge sharing is based on linked data which is integrated and interoperable.
Indigo’s approach to automated systems uses open linked data for the information integration step. We give human experts the ability to develop knowledge about processes which can then be turned into automated tools that impact their workflows and their organizations.
The information integration step is critical. Now ask: “How many of the workflows I depend on are still inside my company?” The answer you give this year is probably lower than last year. Welcome to globally distributed R&D – distance and organizational boundaries make the integration step more important than ever before. Next, we’ll talk about how our products tackle the distributed R&D challenge.
It seems obvious that any project team scientist would prefer it if all their data were linked and easily accessible. This is the motivation behind the call for data integration. A typical project actually seems to be held together with spreadsheets, yellow-sticky-notes and the rare lab notebook entry. Why aren’t the electronic tools available better at keeping track of the relationships between data items in a project? One reason is that research data is incredibly varied which means any application which deals with one data type well probably deals with others poorly or not at all. The sheer value of data aggregation has led some people to pursue what Ted Slater called “the Holy Grail…complete data integration” in this Bio-IT World article. Ted warns that total integration doesn’t work and that the ‘Semantic Web’ doesn’t exist yet. I don’t know if the semantic web exists yet or not, but I do know a lot of well-intentioned people who nearly gave themselves aneurysms trying to maintain drug discovery and development data warehouses. Slater calls for data interoperability – we think that is Really Important.
Our approach to interoperability is to start with the most accessible format of data possible, then make sure experimentally relevant links can be easily made. Let’s face it, the best way to manage a 4GB raw data file is as a BLOB (XML or not…). Our approach allows raw files to be made interoperable by connecting them to as much context as possible.

Click image to enlarge
In this example, using graph notation, we can show that a “Project” (Step_21) had an “Acquired-File” (1008), which also had an “Operator” (99), occurred in a Lab (Lab A), and was run on a particular instrument (Instrument 12). The picture was generated using Cytoscape a tool for visualizing linked information using graphs.
Independent of the acquisition, the rest of the experiment could also have been captured in a graph:

Click image to enlarge
Here, all the files from project “Step 20” are shown, along with the two analytes (Verapamil and Cortisone), the overall study (“230”) and the other project within that study (Step 21). We can also see that Verapamil was used in both projects and shared an analytical method with the analyte from the “Step 21” project, Alprazolam. Here we can also see that Method-3 is not used by any of the analytes in these experiments.
By combining both graphs (something that can be done either as a query, or within a tool like Cytoscape), new relationships can be identified:

Click image to enlarge
To Slater’s point, the objective should be interoperable data. If we know that data collected on Instrument_12 is compatible with data collected on Instrument_13 via a link which was established at data collection time, then we know that it’s OK to aggregate these results. Without these links, differences between results could be the effect of the instrument (or the operator). We are clearly more effective at data analysis when we have more of the context of the experiment.
The Rubicon repository manages laboratory data, insuring integrity and security, and goes one step further by providing connective links to additional information residing inside or outside the research organization. Rubicon simplifies the linking of large scale measurement data to critical descriptive data and makes links easy to add and easy to navigate. The result: all your project data is linked and easily accessible.
In future posts, I will describe the technology we use to build up these links. I will also describe how this approach can be combined with cloud computing leads to globally interoperable data.