Using Informatics to Keep Up with the Data Deluge

If you’re feeling a tad overwhelmed by the rate of publications appearing lately, spare a little sympathy for those who use PubMed, a database of references and abstracts on life sciences and biomedical topics. In 1980, articles were being added at a rate of 250,000 per year; by 2011 that rate had quadrupled to one million per year—flooding in from 39,000 different refereed journals. The problem is, only ten or twenty of these articles might bear relevance to the work an individual researcher is carrying out. The question becomes: how does one efficiently find those ten or twenty papers?

In my grad student days (and my world of ChemAbstracts was admittedly not as vast as that of PubMed), we relied on restricting ourselves to a handful of journals, listening to the buzz at conferences, and word-of-mouth, especially word-of-supervisor’s-mouth. That could lead to a “silo” effect, and I remember re-writing a chapter in great haste to incorporate previously overlooked results tucked away in back issues of Berichte der Bunsengesellschaft für physikalische Chemie.

On June 7, 2012 Ian Stott, Informatics & Maths science leader at Unilever, presented a vastly better method for staying informed. In the Unilever R&D department, Stott leads an informatics team that assists in the molecular design of new ingredients for all parts of Unilever related to cleaning, (hair, deodorants, toothpaste, household cleaners, skin creams as well as washing powders & liquids) and foods. His work involves both working with academic departments (such as the Unilever Centre for Molecular Informatics at Cambridge) and running internal projects. He was speaking at the webinar “Informatics-Led Literature Search: Keeping Up with the Data Deluge” held through the American Chemical Society (ACS) partnered jointly with Accelrys, a scientific software company. Accelrys owns Pipeline Pilot, a program that aggregates volumes of disparate research data, automates the scientific analysis of that data, and enables researchers to explore, visualize and report.

At Unilever, Stott devised an ingenious application (based on Pipeline Pilot) that is of direct benefit to scientists who suffer information overload and potential “silo formation,” and as I listened I wondered why such an approach could not be generalized to risk managers and finance professionals who struggle with the same challenge of so many articles, so little time.

Stott’s application begins by defining a “general area” of publication interest. The general area should be big, maybe one or two thousand articles obtained from a keyword search, but not so big that it is unwieldy. Each article in PubMed is “text mined” to see if it contains topics of interest to an individual researcher by comparison against the most interesting papers.

Then, the PubMed update is downloaded to the system and each new article is analyzed for keywords and salient text in order to assign an “interesting” score. Call these articles “potentially interesting” because the scores are machine-based; no human has yet laid eyes on them. The titles and abstracts of these “potentially interesting” articles, along with the “interesting” scores, are e-mailed to the individual researcher.

“Now we come to the tedious bit,” said Stott, “training the model.” The recipient of the e-mail is asked to click his or her way through the list of “potentially interesting” articles, rating each article as being of “true” interest or not. “Usually the scientist only has to skim the title and abstract to determine this,” said Stott, although some might choose to download the full paper from PubMed.

Once the list has been scrutinized and rated True/False for “interesting-ness” by a human, the feedback is submitted to the training algorithm so the model can be continually refined.

As hands-on, Stott showed how a “general area” could be built by searching PubMed for all articles that contain “relativity” as a keyword or “Einstein” as an author, to build a thousand-article general area. He showed how succeeding updates to PubMed would be trained against this original set. “This is just an example,” deadpanned Stott, noting the company was “not yet” applying relativity to improve the formulation of shampoo.

In this way, researchers continue to receive good suggestions about recently published articles of interest to them, thereby restricting the the data deluge to a highly manageable trickle.

Can a similar technique be applied to managing the information flow in finance/economics/risk management? Alas, I don’t have access to a Pipeline Pilot-type system in the comparable finance and economics world, but let’s hope the application presented by Stott can be translated. For the time being, I’ll definitely rank the “interesting” score of this webinar as “True”. ª

The webinar presentation slides can be found at: http://event.on24.com/eventRegistration/EventLobbyServlet?target=lobby.jsp&eventid=461777&sessionid=1&key=F24AB6248E4B5ADAE008D1AA58AF1452&eventuserid=64514969

The Unilever Centre for Molecular Informatics at Cambridge can be found at: http://www-ucc.ch.cam.ac.uk/

Using Informatics to Keep Up with the Data Deluge

Archives

Filter

Recent Posts