Title: Integrating domain knowledge in supervised machine learning to assess the risk of breast cancer

Authors: Aniket Bochare; Aryya Gangopadhyay; Yelena Yesha; Anupam Joshi; Yaacov Yesha; Mary Brady; Michael A. Grasso; Napthali Rishe

Addresses: University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, Maryland 21250, USA ' University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, Maryland 21250, USA ' University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, Maryland 21250, USA ' University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, Maryland 21250, USA ' University of Maryland Baltimore County, 1000 Hilltop Circle, Baltimore, Maryland 21250, USA ' National Institute of Standards and Technology, 100 Bureau Drive, Gaithersburg, Maryland 20899, USA ' University of Maryland, School of Medicine, 655 West Baltimore Street, Baltimore, Maryland 21201,USA ' Florida International University, 11200 SW 8th St, Miami, Florida 33174, USA

Abstract: We used various supervised machine learning and data mining techniques to generate a model for predicting risk of breast cancer in post menopausal women using genomic data, family history, and age. In this paper, we propose an approach to select nine best SNPs using various feature selection algorithms and evaluate binary classifiers performance. We have also designed an algorithm to incorporate domain knowledge into our machine learning model. Our observations revealed that the machine learning model generated using both the domain knowledge and the feature selection technique performed better compared to the naive approach of classification. It is also interesting to note that, in addition to selecting nine best SNPs, feature selection resulted in removing age from the set of features to be used for cancer risk assessment.

Keywords: breast cancer; classification; single nucleotide polymorphism; SNP; genome-clinical; domain knowledge; medical informatics; feature selection; supervised machine learning; risk assessment; cancer risk; data mining; post menopausal women; genome data; family history; age.

DOI: 10.1504/IJMEI.2014.060245

International Journal of Medical Engineering and Informatics, 2014 Vol.6 No.2, pp.87 - 99

Published online: 24 May 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article