Linking Research Data: The Big and the Small
It seems obvious that any project team scientist would prefer it if all their data were linked and easily accessible. This is the motivation behind the call for data integration. A typical project actually seems to be held together with spreadsheets, yellow-sticky-notes and the rare lab notebook entry. Why aren’t the electronic tools available better at keeping track of the relationships between data items in a project? One reason is that research data is incredibly varied which means any application which deals with one data type well probably deals with others poorly or not at all. The sheer value of data aggregation has led some people to pursue what Ted Slater called “the Holy Grail…complete data integration” in this Bio-IT World article. Ted warns that total integration doesn’t work and that the ‘Semantic Web’ doesn’t exist yet. I don’t know if the semantic web exists yet or not, but I do know a lot of well-intentioned people who nearly gave themselves aneurysms trying to maintain drug discovery and development data warehouses. Slater calls for data interoperability – we think that is Really Important.
Our approach to interoperability is to start with the most accessible format of data possible, then make sure experimentally relevant links can be easily made. Let’s face it, the best way to manage a 4GB raw data file is as a BLOB (XML or not…). Our approach allows raw files to be made interoperable by connecting them to as much context as possible.
In this example, using graph notation, we can show that a “Project” (Step_21) had an “Acquired-File” (1008), which also had an “Operator” (99), occurred in a Lab (Lab A), and was run on a particular instrument (Instrument 12). The picture was generated using Cytoscape a tool for visualizing linked information using graphs.
Independent of the acquisition, the rest of the experiment could also have been captured in a graph:
Here, all the files from project “Step 20” are shown, along with the two analytes (Verapamil and Cortisone), the overall study (“230”) and the other project within that study (Step 21). We can also see that Verapamil was used in both projects and shared an analytical method with the analyte from the “Step 21” project, Alprazolam. Here we can also see that Method-3 is not used by any of the analytes in these experiments.
By combining both graphs (something that can be done either as a query, or within a tool like Cytoscape), new relationships can be identified:
To Slater’s point, the objective should be interoperable data. If we know that data collected on Instrument_12 is compatible with data collected on Instrument_13 via a link which was established at data collection time, then we know that it’s OK to aggregate these results. Without these links, differences between results could be the effect of the instrument (or the operator). We are clearly more effective at data analysis when we have more of the context of the experiment.
The Rubicon repository manages laboratory data, insuring integrity and security, and goes one step further by providing connective links to additional information residing inside or outside the research organization. Rubicon simplifies the linking of large scale measurement data to critical descriptive data and makes links easy to add and easy to navigate. The result: all your project data is linked and easily accessible.
In future posts, I will describe the technology we use to build up these links. I will also describe how this approach can be combined with cloud computing leads to globally interoperable data.








Reader Comments