Table of Contents Author Guidelines Submit a Manuscript
Advances in Bioinformatics
Volume 2018, Article ID 9391635, 9 pages
https://doi.org/10.1155/2018/9391635
Research Article

Framework for Parallel Preprocessing of Microarray Data Using Hadoop

1Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600 Bangi, Malaysia
2Faculty of Creative Multimedia, Multimedia University, 63100 Cyberjaya, Selangor, Malaysia

Correspondence should be addressed to Ravie Chandren Muniyandi; ym.ude.mku@eivar, Mahdi Sahlabadi; moc.liamg@2002idabalhas, and Hossein Golshanbafghy; moc.liamg@nahslog.h

Received 9 September 2017; Revised 29 January 2018; Accepted 13 February 2018; Published 29 March 2018

Academic Editor: Florentino Fdez-Riverola

Copyright © 2018 Amirhossein Sahlabadi et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Nowadays, microarray technology has become one of the popular ways to study gene expression and diagnosis of disease. National Center for Biology Information (NCBI) hosts public databases containing large volumes of biological data required to be preprocessed, since they carry high levels of noise and bias. Robust Multiarray Average (RMA) is one of the standard and popular methods that is utilized to preprocess the data and remove the noises. Most of the preprocessing algorithms are time-consuming and not able to handle a large number of datasets with thousands of experiments. Parallel processing can be used to address the above-mentioned issues. Hadoop is a well-known and ideal distributed file system framework that provides a parallel environment to run the experiment. In this research, for the first time, the capability of Hadoop and statistical power of R have been leveraged to parallelize the available preprocessing algorithm called RMA to efficiently process microarray data. The experiment has been run on cluster containing 5 nodes, while each node has 16 cores and 16 GB memory. It compares efficiency and the performance of parallelized RMA using Hadoop with parallelized RMA using affyPara package as well as sequential RMA. The result shows the speed-up rate of the proposed approach outperforms the sequential approach and affyPara approach.