Introduction

Integrating bioinformatics, systems biology and biomedical text mining resources: Data, Tools, Methods and Use Cases

Tutorial @ Coling 2018 Santa Fe, New Mexico (USA), 20th Aug 2018

Motivation and aim of tutorial

There is an increasing interest in the development of resources that are able to merge and empower the analysis of information derived from both structured and unstructured biomedical data. Structured data exploited by bioinformatics and systems biology analysis platforms include large scale experimental results generated by genomics, proteomics and metabolomics tools, next generation sequencing and even single cell analysis.

Interpretation, analysis and even design of large scale experiments typically require the extraction of key characterizations, e.g. interactions between biomolecules or associations between mutations, chemical substances and diseases, etc., only characterized in depth in unstructured data repositories.

The rapid growth of natural language data in biomedical sciences (including scientific articles, patents or patient record repositories) together with the practical relevance of these resources motivated the implementation of a considerable number of new applications. For the development and maintenance of manually annotated databases, text mining assisted literature curation has been especially promising, as well as for the construction of gold standard datasets and gene lists in the context of Systems Biology and gene set enrichment.

We will begin with gentle introduction to biomedical text mining and its application in various Biology and Bioinformatics related domains. Basic concepts and application types within bioinformatics and systems biology will be provided with emphasis on those aspects that are of relevance to language technologies tailored to this particular domain, as opposed to general purpose tools.

Existing resources for building biomedical text mining applications will be presented including (1) useful data collections, (2) lexical/terminological resources, (3) features of biomedical language data exploited by text mining systems and (4) existing widely used basic components integrated in biomedical text mining pipelines. Also the main types of currently available biomedical text mining applications will be discussed, including sophisticated domain specific information retrieval and chemical/biomedical/clinical text classification systems. Basic aspects underlying the design of high impact biomedical named entity recognition systems for entities including genes, proteins, microRNAs, chemical compounds, and cell types will be presented. State of the art tools for the detection and annotation of biomedical relations and events including chemical-gene and protein-protein interactions, drug-disease relationships and functional descriptions of gene products will be presented. The use of literature for knowledge discovery and hypothesis generation in biomedicine will be explained. A crucial aspect of biomedical text mining systems is evaluation and usability; these two aspects will be introduced by providing a short overview of existing Gold Standard corpora and Community evaluation efforts and shared tasks such as the BioCreative challenges, CLEF and the BioNLP shared tasks. Finally, a practical case study will illustrate the step by step implementation of a biomedical text mining system illustrating how it is possible to construct such a system for a particular information need.

After the tutorial, participants should be aware of the importance and basic characteristics of biomedical text mining systems, comprehend key bioinformatics and systems biology concepts and understand how the results produced by biomedical language technologies can be combined with the output produced by lab experiments. They should be able to understand how existing biomedical text mining systems work, on what features they rely, and which annotated resources and basic components can be exploited to construct such applications in practice.

Intended Audience

The tutorial is intended for a broad range of researchers interested in biomedical text mining, both from the perspective of developers of new systems, as well as from the point of view of end users of such resources. Potential participants include researchers and students specialized in language technologies or computational linguistics which would like to understand basic biology and bioinformatics key concepts of importance for biomedical text mining and how text mining results can be combined with traditional bioinformatics analysis.

Another type of audience that we expect will benefit from this tutorial consists of machine-learning experts developing supervised text processing tools, which would like to know more about tasks and training corpora of relevance for biomedical text mining.