For years manual curation of scientific publications has been the gold standard, with technology-based solutions ranking far behind in terms of accuracy and completeness. Today, that’s no longer the case. Versatile, well-designed and well-tested applications, combined with significantly enhanced computational power, are elevating automated curation to a more equivalent position. Proprietary text-mining technologies now rival manual curation for some types of search needs as a means of ensuring researchers aren’t missing out on valuable information.
Automated and manual curation of scientific papers each has its own strengths and weaknesses. Which solution works best for a given researcher, laboratory or organization varies depending on the application. For example, if the goal is to retrieve facts, terms and relationships from the articles, automated systems outperform manual curation due to the speed and precision computers can match pre-defined lists of items (terminologies) across thousands or millions of documents. Accuracy in matching these terms can approach 98%—far higher than human curation on average. Furthermore, when new terms are identified or added, the speed of computer processing allows the user to re-index all the existing content rapidly, extracting additional new information from older papers.
Manual (human) curation, on the other hand, excels at inferring conclusions from disconnected facts, or from different sources; at identifying and translating complex concepts into clearly understood, human-readable forms; and at summarizing large amounts of information into a distilled version. In addition, most automated text-mining systems struggle with information in tables or figures—something humans can handle easily.
Here are some key factors to consider when you are deciding on automated vs. manual text mining.
Scope and volume. The information needed to interpret experimental results, describe a cellular pathway or identify complex interactions in regulatory networks often is scattered throughout hundreds of articles and publications—more than a single researcher can review. That can leave you with a gnawing feeling of “What have I missed?”
The solution is to cast a wide net, reviewing as many articles from as many journals as possible across a specific field. But that’s simply not practical in most cases. One option is to read only the abstracts of papers. While these are considerably shorter, they don’t contain all the important information found in the full text. Multiple studies comparing the full text and abstract from the same paper concluded that less than half of the key facts from the body of a paper are present in the abstract.
Manual curation is another option, relying on PhD researchers trained as manual curators. Although they can be very accurate, those experts can likely only read and annotate about 20 to 25 papers a day—fine, perhaps, for a highly targeted query to a small number of journals in a select area, but not adequate for comprehensive coverage of a topic—or multiple disconnected topics.
Bias can also be a factor when people have to winnow down an extraordinary amount of data. Manual curation can unintentionally introduce bias by limiting journals and articles due to resource restrictions and assumptions about journal or paper value. They may cull only the most appropriate articles from high-profile journals in any given field. Yet these days, critical information regarding particular pathways or relationships can turn up just about anywhere. Automated systems, with their much higher throughput, can scan much larger quantities of documents—for example, all of the abstracts in Medline and millions of full-text articles (the limit here mainly is due to legal issues and licensing fees).
Both automated and manual systems can be constrained by documents not written in English; automated translation systems often introduce errors in translation, while it can be difficult to find large numbers of highly-trained human curators who also speak the language of the paper. Currently the vast majority of scientific papers are published in English, but there is a rapidly growing corpus of non-English journals that will need to be addressed in the future.
2. Accuracy. Some would argue quality is more important than quantity—and manual curation ensures accuracy. Overall, expert curators are about 90% accurate (as measured by inter-curator agreement on annotation) for specific tasks. In the past five to seven years, the accuracy of specialized automated text-mining systems has improved dramatically. For instance, in-house research at Elsevier indicates text-mining solution accuracy of about 85 to 90% overall. In addition, automated systems are exceptionally consistent in their annotation (~98%) from paper to paper and journal to journal, unlike human curators, who show some natural variation both over time and between curators.
3. Speed. In many situations, speed is of the essence—for example, if a researcher needs specific information to meet a grant proposal deadline, or to reduce the time required to get a new drug to market. In this realm, there’s no contest: A trained manual curator can read and annotate at most 20 to 25 papers a day, so the overall throughput per day is limited by the number of available trained curators. In contrast, text-mining technologies can process more than 20 million abstracts overnight, and 80,000 to 100,000 full-text articles per hour. Moreover, automated systems can be easily updated to include new terms and concepts in biology, simply by adding new ontologies. The speed of automated systems also allows then to reprocess entire document collections whenever new names, synonyms, or concepts of interest are added to the terminology—effectively gathering additional “new” information from previously “read” papers. This is something that rarely, if ever, happens with manually curated systems due to the previously mentioned resource limitations.
The other aspect of speed relates to timeliness—the sooner a researcher has access to information, the sooner they can act on it. Therefore curated data needs to be rapidly and frequently updated to reflect the latest literature. Although some manually curated systems add new data on a daily or weekly basis, here again the amount of new content added in such a short time is limited by the availability of expert curators. As a result, some manually curated systems only update their data monthly or quarterly because it takes time to read a significant number of new papers. In contrast, automated systems can be updated as often as needed and frequently add large quantities of new content on a weekly basis. In certain cases, having access to pre-press information often results in being able to provide customers with extracted data weeks or months ahead of the actual publication.
4. Molecular interactions. When trying to understand the underlying biology of a disease, process or drug responses, identifying relationships between entities—for example, protein-protein or drug-protein interactions—is at the heart of pathway analysis. To do this effectively, automated curation would have to be able to mimic the human ability to identify these connections from text—and indeed it can. Natural language text-mining systems can identify meaningful relationships through a combination of specialized ontologies and linguistics rules, much the same way humans identify relationships between terms and concepts as they learn to read.
As noted previously, although abstracts are short, and many can be read quickly, they don’t contain all the key facts from the full-text publication. Typically less than half of cited terms, including molecular relationships, in a paper are mentioned in the abstract. Because automated systems scan full-text articles as well as abstracts, they can identify many more relevant relationships than could be found by scanning abstracts alone.
5. Personal preference. Some researchers feel comfortable relying on a curator’s expertise to identify important information in the literature. Others want to at least be able to review the results and use their own expertise to decide whether a particular finding or relationship is relevant or credible. Automated systems do something most systems based on manual curation don’t: they show the sentence in the abstract or paper used to identify each relationship, so the researcher can personally review them and decide whether or not to include or exclude any reference. So rather than only relying on someone else’s interpretation, the final decision about what information to take into account for their research rests with each user.
6. Text-mining at home. There are number of commercial and open-source text mining applications currently available. Before deciding to develop a “DIY” text mining solution, researchers should be aware of several key factors that can significantly affect the success of their efforts:
Terminology quality: Automated systems are primarily reliant on terminologies—lists of specific terms and their related concepts—to extract matching information from source documents. With few exceptions, anything not included in the terminology can’t be reliably extracted from the literature. But the creation of comprehensive biomedical terminologies requires both deep domain expertise in the area of study, and significant expertise in linguistics, pattern matching and ontology development—not something for the faint of heart. Fortunately there are a number of high-quality domain-specific terminologies that are publicly available. The challenge then is to combine and de-dupe them into a comprehensive set of terms—something commercial vendors will have already done for their commercial systems.
Content licensing: Depending on the source of the material, there may be restrictions (licensing, volume and/or age of articles, etc.) on how end users can extract information from that content if they don’t have a subscription.
The bottom line is this. If you’re trying to decide between software and solutions based on manual and automated curation, ask yourself: Would your project benefit from information obtained from a wide swath of journals or just a chosen few? Does identifying a greater number of relevant relationships between entities give you more confidence in your data? Are you comfortable with only an external perspective on what research is most relevant to your work, or do you want to review the information and make that decision yourself? The final choice is yours.
No comments:
Post a Comment