Promises and Challenges of Big Data Computing in Health Sciences☆
Graphical abstract
We explained the steps for Big Data projects: 1. Formulate your question; 2. Find the right ways (smart devices, Internet, hospitals …) to collect your data; 3. Store the data; 4. Analyze your data; 5. Generate the analysis report with vivid visualization. 6. Evaluate the project: problem solved or start over. The latest applications of Big Data in health sciences were reviewed. The cutting edge computational technologies of big data collection, storage, transferring, and the state-of-the-art analytical methods were introduced. The future perspectives of health sciences in the era of Big Data were discussed.
Introduction
The concept of Big Data is causing a world-wide buzz. Its successful applications in business [1], sciences and healthcare [2] have radically changed their traditional practices. The demand for Big Data analysis is increasing day by day. More than 200 colleges provide degrees with Data Science (http://colleges.datascience.community/, accessed on September 14, 2014).
But there are many misunderstandings about Big Data and even its definition is debatable. According to http://datascience.berkeley.edu/what-is-big-data/ (accessed on September 14, 2014), there are at least 43 different definitions. Generally speaking, people agree that Big Data should have four V's: (1) big volume of data; (2) variety of data type; (3) high velocity of data generation and updating [3]; and (4) big data creates big value [4]. The first three V's focused on data engineering, such as data collection, storage and transferring. The last V focused on data science, such as analytic and statistical methods, knowledge extraction and decision-making.
To start a Big Data project, several steps are suggested as shown in Fig. 1: First, the right problem should be chosen. There are three kinds of problems. The first kind of problem has already been solved with traditional method and there is no need to use big data technologies. The second kind of problem is impossible to be solved with current technologies. We should focus on the third kind of problem that is solvable with current big data technologies. Second, we need to generate the data by sensors, monitors, molecular profiling or extract the data from public databases/sources after setting up a practical goal. Third, we need to do data pre-processing to obtain clean and meaningful data. Data pre-processing is a critical step for the success of a Big Data project. A recent publication [5] showed that sample mis-alignment for eQTL (expression Quantitative Trait Loci) and mQTL (methylation Quantitative Trait Loci) studies will reduce the discovered associations by 2–7 folds. The quality control of data essentially determines the upper bound of the data product, i.e. garbage in garbage out. The clean data will be stored into database for the next step analysis. Fourth, the insight or knowledge will be discovered from the processed data through statistical analysis. At last, the analytic results will be presented to the end user as a report, an online recommendation or a decision-making. Visualization of data, such as networks/graphs and charts, make the analytic results easy to interpret and understand. If the results do not make sense, we need to reformulate our problems and start the steps over again.
In health sciences, there are many problems that can be addressed with big data technologies, such as recommendation system in healthcare, Internet based epidemic surveillance, sensor based health condition and food safety monitoring, Genome-Wide Association Studies (GWAS) and expression Quantitative Trait Loci (eQTL), inferring air quality using big data and metabolomics and ionomics for nutritionists.
To solve these problems, many advanced computational technologies will be used. We will cover the following technological perspectives: (1) Infrastructure of Big Data; (2) Analyzing of Big Data Results; and (3) Visualization of Big Data Results. And the future perspectives of health sciences in the era of big data will be discussed.
Section snippets
The Big Data studies in health sciences
Big Data technologies have many successful applications in biomedicine, especially in health sciences. For example, the data from search engines and social networks can help to gather people's reactions and monitor the conditions of epidemic diseases. It is worldwide real-time analysis and much faster than the official channels, such as CDC (Centers for Disease Control) and WHO (World Health Organization). Several cases studies will be elaborated in following paragraphs.
Infrastructure of Big Data
Even though goals of Big Data projects are different, they share some key patterns and use similar computational technologies. In this section, we will introduce these similar technologies, such as data collection, storage and transferring, from a computer science perspective.
How to analyze Big Data
Different types of Big Data require different analysis methods. A comprehensive list of analysis methods can be found at https://github.com/onurakpolat/awesome-bigdata. We choose three widely used analysis methods in computer science and biomedicine to share with the readers: (1) Recommendation System; (2) Deep Learning and (3) Network Analysis.
Visualization of Big Data results
Graphical presentation is the best way to intuitively get the meaning of the data and insight revealed by the analysis. There are many tools to visualize the Big Data as shown in Table 1.
As a generalized programming language, R [116] has about six thousand high quality packages (http://cran.r-project.org/web/packages/, accessed on September 14, 2014) that could achieve sophisticated functions. Its excellent help system and power functions make it the most widely used language in Data Science.
The future of health sciences
The changes that Big Data will bring to health sciences are much greater than most people estimated. Take the smart device for example. The health condition measurements from users will be stored, analyzed and shared in their cloud. The time course health measurement data with almost endless time points from millions of people will change the public health researchers from large number of nurses and doctors to computer scientist and few medical experts.
It will also change the usage of health
Acknowledgements
The study was supported by research grants from National Natural Science Foundation of China grants (31030039, 31225013 and 31330036 to F.W.) and also by the Distinguished Professorship Program from Zhejiang University (to F.W.).
References (120)
- et al.
Five years of GWAS discovery
Am. J. Hum. Genet.
(2012) - et al.
Air pollution and lung cancer incidence in 17 European cohorts: prospective analyses from the European Study of Cohorts for Air Pollution Effects (ESCAPE)
Lancet Oncol.
(2013) Centrality in social networks: conceptual clarification
Soc. Netw.
(1979)- et al.
Functional association between influenza A (H1N1) virus and human
Biochem. Biophys. Res. Commun.
(2009) - et al.
Deciphering the effects of gene deletion on yeast longevity using network and machine learning approaches
Biochimie
(2012) - et al.
Big data: the management revolution
Harv. Bus. Rev.
(2012) - et al.
Big data in science and healthcare: a review of recent literature and perspectives
Contribution of the IMIA Social Media Working Group
Yearb. Med. Inform.
(2014) Volume, velocity and variety: key challenges for mining large volumes of multimedia information
Trend: big data. Big data analytics: from volume to value
Healthc. Inform., Bus. Mag. Inf. Commun. Syst.
(2013)- et al.
Health recommender systems: concepts, requirements, technical basics and challenges
Int. J. Environ. Res. Public Health
(2014)
Healthcare information systems: data mining methods in the creation of a clinical recommender system
Enterp. Inf. Syst.
Reliable medical recommendation systems with patient privacy
ACM Trans. Intell. Syst. Technol.
Challenges and opportunities of using recommender systems for personalized health education
Stud. Health Technol. Inform.
Detecting influenza epidemics using search engine query data
Nature
Google trends: a web-based tool for real-time surveillance of disease outbreaks
Clin. Infect. Dis., Off. Publ. Infect. Dis. Soc. Am.
Influenza forecasting with Google flu trends
PLoS ONE
The use of Twitter to track levels of disease activity and public concern in the U.S. during the influenza A H1N1 pandemic
PLoS ONE
You are what you tweet: analyzing twitter for public health
Artif. Intell.
Is your food safe? New ‘smart chopsticks’ can tell in: China real time
Wall Street J.
Analysis of Genetic Association Studies
Post-GWAS: where next? More samples, more SNPs or more biology?
Heredity
The NHGRI GWAS catalog, a curated resource of SNP-trait associations
Nucleic Acids Res.
GWASdb: a database for human genetic variants identified by genome-wide association studies
Nucleic Acids Res.
Genome-wide association study identifies 1p36.22 as a new susceptibility locus for hepatocellular carcinoma in chronic hepatitis B virus carriers
Nat. Genet.
Where next for GWAS?
Brief. Funct. Genomics
Principles for the post-GWAS functional characterization of cancer risk loci
Nat. Genet.
seeQTL: a searchable database for human eQTLs
Bioinformatics
Genevar: a database and Java application for the analysis and visualization of SNP-gene associations in eQTL studies
Bioinformatics
The Genotype-Tissue Expression (GTEx) project
Nat. Genet.
Matrix eQTL: ultra fast eQTL analysis via large matrix operations
Bioinformatics
An information-theoretic machine learning approach to expression QTL analysis
PLoS ONE
Air pollution exposure and cardiovascular disease
Toxicol. Res.
Urban air pollution linked to birth defects
J. Environ. Health
Ambient air pollution and birth defects in Brisbane, Australia
PLoS ONE
Early prenatal exposure to air pollution and its associations with birth defects in a state-wide birth cohort from North Carolina, birth defects research. Part A
Clin. Mol. Teratol.
China imposes air quality targets
U-Air: when urban air quality inference meets big data
A cloud-based knowledge discovery system for monitoring fine-grained air quality
Inferring air pollution by sniffing social media
N-smarts: networked suite of mobile atmospheric real-time sensors
Indoor air quality monitoring system for smart buildings
Metabolomics A Powerful Tool in Systems Biology
Metabolomics: from small molecules to big ideas
Nat. Methods
Metabolic footprint of diabetes: a multiplatform metabolomics study in an epidemiological setting
PLoS ONE
Metabolomics in human type 2 diabetes research
Front. Med.
Metabolomics in toxicology and preclinical research
ALTEX
The MetaboLights repository: curation challenges in metabolomics
Database, J. Biol. Databases Curation
Ionomics: the functional genomics of elements
Brief. Funct. Genomics
Genomic scale profiling of nutrient and trace elements in Arabidopsis Thaliana
Nat. Biotechnol.
Associations between ionomic profile and metabolic abnormalities in human population
PLoS ONE
Cited by (163)
Big data in healthcare: Conceptual network structure, key challenges and opportunities
2023, Digital Communications and NetworksApplication of Unmanned Aircraft Systems for smart city transformation: Case study Belgrade
2022, Technological Forecasting and Social ChangeDeep learning in biomedical informatics
2022, Intelligent Nanotechnology: Merging Nanoscience and Artificial IntelligenceHealthcare data analytics for wearable sensors
2022, Wearable Physical, Chemical and Biological Sensors: Fundamentals, Materials and Applications