Compute Clouds on the Horizon: the storm is closer than you think…
In my last post, I mentioned that there are now technologies which allow data intensive applications to support global collaborative R&D operations. Like most IT things, the approach got a nickname, which stuck, and was then used in every conceivable way until nobody could say what it really meant. The nickname is ‘Cloud Computing’ and if you are in the IT business and don’t have a cloud, you must not have a marketing department!
As end-users of IT resources, those of us in the lab have always talked about “Servers in the Sky”. We never had control over the computers we used (which is a real pain, by the way, especially if you run an instrument), and we had always relied on invisible storage systems and servers located in remote locations which we almost certainly never visited except on some kind of new employee tour.
Big IT departments then began noticing that some servers were operating at full capacity and generating complaints: “Why the $%^@ does my substructure search take longer than a Google search?”, while others were operating at almost 0% almost all of the time. Obviously, since after locking us out of our desktops and automatically installing every patch issued from Redmond (right in the middle of our critical data acquisition), these smart guys could not leave this problem alone.
Some particularly smart guys at Amazon had noticed the same thing. Amazon had built a computer system capable of keeping up with all those One-Click orders in the last weeks of December. That’s a lot of computers in case you hadn’t thought about it. What do they do in February when everyone feels broke? Nothing. Well they were adding to the carbon footprint and slowly killing us all, but that’s another story…
I don’t know the details, but a while back almost every large IT group started to install some very cool software to do something called “Virtualization”. This meant that if an application was running on a slow machine, a virtual image could be created and moved to a faster machine. This is A Really Good Thing, especially given the mean-time-between-failures for cheap commodity hardware. It meant that new servers could be added to the overall server farm and pulled into use only when needed.
It seemed that Amazon decided that they could use their virtualization system to create bare operating system images anytime they wanted – start them up, run them and then shut them down when the work was finished. And what do you think Bezos and Friends did with that? Well, they gave it an ISBN number and sold it on Amazon.com! Not quite, but close.
They created Amazon Web Services (AWS) which is a pay-for-service web interface that allows you to create those bare OS instances and use them like regular computers on the internet. The instance you get looks like a computer, and acts like a computer, but in reality, it’s just a virtual machine running an image of a computer of a particular size and speed. It could really be running on anything, but you wouldn’t know it.
If you are an application developer like Indigo, you can put your credit card number in and get a pristine Linux image ready and waiting with a little blinking cursor in your SSH terminal application. What? You don’t want a little blinking cursor in an SSH window? Sorry. You could have a Windows 2003 Server image instead – not sure what you would want that for either really.
What you probably want is a useful application running on someone else’s computer that never runs out of space and never slows down. It would also be nice if that application was accessible by all of your collaborators both inside and outside your company or institute. Good News: that’s what we did with Amazon.
Indigo developed its entire application stack in such a way that it could run in the virtualization environment created by Amazon and other cloud systems.
Because of the massive telecom needed to keep up with December orders at Amazon, our applications enjoyed very nice speeds with contract labs in India and China and nearly insane speeds in the US and EU. And because it was so easy to add new instances in the Amazon system, we were able to monitor the performance of our applications and add new servers when things started to get heavy. If you want to be attractive at cocktail parties, tell everyone that you were working on a particularly difficult problem in the fight against – insert your therapeutic group name here – and solved it by invoking a “Cloud Burst” which “spun up” a thousand servers at once all for around $17. They really go for something like $0.12/hour so a thousand would really cost you $120 – you probably don’t need a thousand as they only needed about 100 for the final Human Genome assembly ($12).
Want to meet some really smart people? Go to Google. These guys said: “look at Google Maps, Gmail, Google Docs. People love our applications, they are beautiful. You think we want to sell bare images? No way.” Google and a few others have taken a different approach to the cloud. Google has something called AppEngine. This is an execution environment for languages supported by Google: mostly Python and Java. If you develop an application in a supported language, you can
upload it to the AppEngine and it’ll be ready to use like Gmail. Microsoft has already done something similar with Windows Azure, with the expected support for the .NET platform (not the whole platform – mostly just the web-enabled bits). Even Saleforce.com has a cloud offering if you want to develop add-on applications and run them in their cloud. Like I said, there are clouds everywhere there are marketing departments.
The point is that there are many types of clouds. Compute clouds, storage clouds, application clouds, application-extension clouds. There are even very secure clouds.
This is particularly interesting in the world of pharma. It’s all well and good to use “Elastic Map Reduce” to speed up your molecular dynamics calculations on a bookstore’s spare machines, but come on, what about work that needs a validated system? Some of us have killed an unfair number of brain cells trying to validate applications so the idea of validating a virtual image when you can barely tell what country it’s in, let alone which machine it’s on, seems insane.
This is where some of the usual suspects show up. IBM and the other pharma-friendly companies have stepped up with clouds which operate under more security like the hosted systems found in the clinical trial outsourcing world. These clouds are managed with standard operating procedures and validation documentation consuming the requisite number of dead trees to prove it.
IBMs “Cloud On Demand” is an evolutionary step of their original (and impossible to understand) “On Demand” idea. The IBM approach may be slightly less flexible than some of the others, but that is exactly what you want when dealing with regulators.
Indigo BioSystems is using all of these clouds to solve different problems in drug discovery and development. The Amazon Elastic Compute Cloud (EC2) has turned out be a very good fit for non-regulated drug discovery applications. Our performance metrics on bringing data back from synthetic contractors in China have shown a significant improvement over having the contractor try to connect through a pharma firewall, and it is much faster than sending a DVD by FedEx.
Indigo has deployed its entire application framework on Amazon EC2 for discovery applications and IBM’s Cloud on Demand system for regulated processes (in addition to enterprise and turn-key platforms). The outcome is a distributed R&D interconnection and interoperability based on open, accessible linked data using hosted services to ensure global availability, high performance and a pay-as-you-go pricing model. The hosted model fits the needs of global research groups who work with collaborators, contractors who must integrate data on complex projects. By combining a hosted approach with the linked data information model, Indigo’s approach represents a breakthrough in research data integration and management.
Stay tuned! In our next post, we will describe some of the technical details behind our main product components: DataLink (linked data store), and LOLA (web-based plug-in system for applications).







Reader Comments