Elsevier

Big Data Research

Volume 2, Issue 1, March 2015, Pages 2-11
Big Data Research

Promises and Challenges of Big Data Computing in Health Sciences

https://doi.org/10.1016/j.bdr.2015.02.002Get rights and content

Abstract

With the development of smart devices and cloud computing, more and more public health data can be collected from various sources and can be analyzed in an unprecedented way. The huge social and academic impact of such developments caused a worldwide buzz for big data. In this review article, we summarized the latest applications of Big Data in health sciences, including the recommendation systems in healthcare, Internet-based epidemic surveillance, sensor-based health conditions and food safety monitoring, Genome-Wide Association Studies (GWAS) and expression Quantitative Trait Loci (eQTL), inferring air quality using big data and metabolomics and ionomics for nutritionists. We also reviewed the latest technologies of big data collection, storage, transferring, and the state-of-the-art analytical methods, such as Hadoop distributed file system, MapReduce, recommendation system, deep learning and network Analysis. At last, we discussed the future perspectives of health sciences in the era of Big Data.

Graphical abstract

We explained the steps for Big Data projects: 1. Formulate your question; 2. Find the right ways (smart devices, Internet, hospitals …) to collect your data; 3. Store the data; 4. Analyze your data; 5. Generate the analysis report with vivid visualization. 6. Evaluate the project: problem solved or start over. The latest applications of Big Data in health sciences were reviewed. The cutting edge computational technologies of big data collection, storage, transferring, and the state-of-the-art analytical methods were introduced. The future perspectives of health sciences in the era of Big Data were discussed.

  1. Download : Download full-size image

Introduction

The concept of Big Data is causing a world-wide buzz. Its successful applications in business [1], sciences and healthcare [2] have radically changed their traditional practices. The demand for Big Data analysis is increasing day by day. More than 200 colleges provide degrees with Data Science (http://colleges.datascience.community/, accessed on September 14, 2014).

But there are many misunderstandings about Big Data and even its definition is debatable. According to http://datascience.berkeley.edu/what-is-big-data/ (accessed on September 14, 2014), there are at least 43 different definitions. Generally speaking, people agree that Big Data should have four V's: (1) big volume of data; (2) variety of data type; (3) high velocity of data generation and updating [3]; and (4) big data creates big value [4]. The first three V's focused on data engineering, such as data collection, storage and transferring. The last V focused on data science, such as analytic and statistical methods, knowledge extraction and decision-making.

To start a Big Data project, several steps are suggested as shown in Fig. 1: First, the right problem should be chosen. There are three kinds of problems. The first kind of problem has already been solved with traditional method and there is no need to use big data technologies. The second kind of problem is impossible to be solved with current technologies. We should focus on the third kind of problem that is solvable with current big data technologies. Second, we need to generate the data by sensors, monitors, molecular profiling or extract the data from public databases/sources after setting up a practical goal. Third, we need to do data pre-processing to obtain clean and meaningful data. Data pre-processing is a critical step for the success of a Big Data project. A recent publication [5] showed that sample mis-alignment for eQTL (expression Quantitative Trait Loci) and mQTL (methylation Quantitative Trait Loci) studies will reduce the discovered associations by 2–7 folds. The quality control of data essentially determines the upper bound of the data product, i.e. garbage in garbage out. The clean data will be stored into database for the next step analysis. Fourth, the insight or knowledge will be discovered from the processed data through statistical analysis. At last, the analytic results will be presented to the end user as a report, an online recommendation or a decision-making. Visualization of data, such as networks/graphs and charts, make the analytic results easy to interpret and understand. If the results do not make sense, we need to reformulate our problems and start the steps over again.

In health sciences, there are many problems that can be addressed with big data technologies, such as recommendation system in healthcare, Internet based epidemic surveillance, sensor based health condition and food safety monitoring, Genome-Wide Association Studies (GWAS) and expression Quantitative Trait Loci (eQTL), inferring air quality using big data and metabolomics and ionomics for nutritionists.

To solve these problems, many advanced computational technologies will be used. We will cover the following technological perspectives: (1) Infrastructure of Big Data; (2) Analyzing of Big Data Results; and (3) Visualization of Big Data Results. And the future perspectives of health sciences in the era of big data will be discussed.

Section snippets

The Big Data studies in health sciences

Big Data technologies have many successful applications in biomedicine, especially in health sciences. For example, the data from search engines and social networks can help to gather people's reactions and monitor the conditions of epidemic diseases. It is worldwide real-time analysis and much faster than the official channels, such as CDC (Centers for Disease Control) and WHO (World Health Organization). Several cases studies will be elaborated in following paragraphs.

Infrastructure of Big Data

Even though goals of Big Data projects are different, they share some key patterns and use similar computational technologies. In this section, we will introduce these similar technologies, such as data collection, storage and transferring, from a computer science perspective.

How to analyze Big Data

Different types of Big Data require different analysis methods. A comprehensive list of analysis methods can be found at https://github.com/onurakpolat/awesome-bigdata. We choose three widely used analysis methods in computer science and biomedicine to share with the readers: (1) Recommendation System; (2) Deep Learning and (3) Network Analysis.

Visualization of Big Data results

Graphical presentation is the best way to intuitively get the meaning of the data and insight revealed by the analysis. There are many tools to visualize the Big Data as shown in Table 1.

As a generalized programming language, R [116] has about six thousand high quality packages (http://cran.r-project.org/web/packages/, accessed on September 14, 2014) that could achieve sophisticated functions. Its excellent help system and power functions make it the most widely used language in Data Science.

The future of health sciences

The changes that Big Data will bring to health sciences are much greater than most people estimated. Take the smart device for example. The health condition measurements from users will be stored, analyzed and shared in their cloud. The time course health measurement data with almost endless time points from millions of people will change the public health researchers from large number of nurses and doctors to computer scientist and few medical experts.

It will also change the usage of health

Acknowledgements

The study was supported by research grants from National Natural Science Foundation of China grants (31030039, 31225013 and 31330036 to F.W.) and also by the Distinguished Professorship Program from Zhejiang University (to F.W.).

References (120)

  • L. Duan et al.

    Healthcare information systems: data mining methods in the creation of a clinical recommender system

    Enterp. Inf. Syst.

    (2011)
  • T.R. Hoens et al.

    Reliable medical recommendation systems with patient privacy

    ACM Trans. Intell. Syst. Technol.

    (2013)
  • L. Fernandez-Luque et al.

    Challenges and opportunities of using recommender systems for personalized health education

    Stud. Health Technol. Inform.

    (2009)
  • J. Ginsberg et al.

    Detecting influenza epidemics using search engine query data

    Nature

    (2009)
  • H.A. Carneiro et al.

    Google trends: a web-based tool for real-time surveillance of disease outbreaks

    Clin. Infect. Dis., Off. Publ. Infect. Dis. Soc. Am.

    (2009)
  • A.F. Dugas et al.

    Influenza forecasting with Google flu trends

    PLoS ONE

    (2013)
  • A. Signorini et al.

    The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic

    PLoS ONE

    (2011)
  • M. Paul et al.

    You are what you tweet: analyzing twitter for public health

    Artif. Intell.

    (2011)
  • Y. Jie

    Is your food safe? New ‘smart chopsticks’ can tell in: China real time

    Wall Street J.

    (2014)
  • G. Zheng et al.

    Analysis of Genetic Association Studies

    (2012)
  • P. Marjoram et al.

    Post-GWAS: where next? More samples, more SNPs or more biology?

    Heredity

    (2014)
  • D. Welter et al.

    The NHGRI GWAS catalog, a curated resource of SNP-trait associations

    Nucleic Acids Res.

    (2014)
  • M.J. Li et al.

    GWASdb: a database for human genetic variants identified by genome-wide association studies

    Nucleic Acids Res.

    (2012)
  • H. Zhang et al.

    Genome-wide association study identifies 1p36.22 as a new susceptibility locus for hepatocellular carcinoma in chronic hepatitis B virus carriers

    Nat. Genet.

    (2010)
  • G.S. Yeo

    Where next for GWAS?

    Brief. Funct. Genomics

    (2011)
  • M.L. Freedman et al.

    Principles for the post-GWAS functional characterization of cancer risk loci

    Nat. Genet.

    (2011)
  • K. Xia et al.

    seeQTL: a searchable database for human eQTLs

    Bioinformatics

    (2012)
  • T.P. Yang et al.

    Genevar: a database and Java application for the analysis and visualization of SNP-gene associations in eQTL studies

    Bioinformatics

    (2010)
  • The Genotype-Tissue Expression (GTEx) project

    Nat. Genet.

    (2013)
  • A.A. Shabalin

    Matrix eQTL: ultra fast eQTL analysis via large matrix operations

    Bioinformatics

    (2012)
  • T. Huang et al.

    An information-theoretic machine learning approach to expression QTL analysis

    PLoS ONE

    (2013)
  • B.J. Lee et al.

    Air pollution exposure and cardiovascular disease

    Toxicol. Res.

    (2014)
  • Urban air pollution linked to birth defects

    J. Environ. Health

    (2002)
  • C.A. Hansen et al.

    Ambient air pollution and birth defects in Brisbane, Australia

    PLoS ONE

    (2009)
  • L.C. Vinikoor-Imler et al.

    Early prenatal exposure to air pollution and its associations with birth defects in a state-wide birth cohort from North Carolina, birth defects research. Part A

    Clin. Mol. Teratol.

    (2013)
  • Xinhua

    China imposes air quality targets

  • Y. Zheng et al.

    U-Air: when urban air quality inference meets big data

  • Y. Zheng et al.

    A cloud-based knowledge discovery system for monitoring fine-grained air quality

  • S. Mei et al.

    Inferring air pollution by sniffing social media

  • R. Honicky et al.

    N-smarts: networked suite of mobile atmospheric real-time sensors

  • X. Chen et al.

    Indoor air quality monitoring system for smart buildings

  • J. Nielsen et al.

    Metabolomics A Powerful Tool in Systems Biology

    (2007)
  • M. Baker

    Metabolomics: from small molecules to big ideas

    Nat. Methods

    (2011)
  • K. Suhre et al.

    Metabolic footprint of diabetes: a multiplatform metabolomics study in an epidemiological setting

    PLoS ONE

    (2010)
  • J. Lu et al.

    Metabolomics in human type 2 diabetes research

    Front. Med.

    (2013)
  • T. Ramirez et al.

    Metabolomics in toxicology and preclinical research

    ALTEX

    (2013)
  • R.M. Salek et al.

    The MetaboLights repository: curation challenges in metabolomics

    Database, J. Biol. Databases Curation

    (2013)
  • I. Baxter

    Ionomics: the functional genomics of elements

    Brief. Funct. Genomics

    (2010)
  • B. Lahner et al.

    Genomic scale profiling of nutrient and trace elements in Arabidopsis Thaliana

    Nat. Biotechnol.

    (2003)
  • L. Sun et al.

    Associations between ionomic profile and metabolic abnormalities in human population

    PLoS ONE

    (2012)
  • Cited by (163)

    • Deep learning in biomedical informatics

      2022, Intelligent Nanotechnology: Merging Nanoscience and Artificial Intelligence
    • Healthcare data analytics for wearable sensors

      2022, Wearable Physical, Chemical and Biological Sensors: Fundamentals, Materials and Applications
    View all citing articles on Scopus

    This article belongs to Comp, Bus & Health Sci.

    1

    Co-first author.

    View full text