Breast Cancer Prognosis with Apache Spark Random Forest Pipeline

Authors

  • Timmana Hari Krishna
  • C. Rajabhushanam

Abstract

Brest cancer is one of the most common cancers diagnosed in women in western countries. Breast cancer research and awareness supports the improvements in cancer diagnosis and treatment. Early detection of Breast cancer improves the survival rates and decreases the number of deaths related to this disease. Recently Computer concepts are spread across all domains including medical and healthcare. Data science and machine learning techniques are used in cancer prediction and analysis to get rapid accurate results. The cancer prediction involves the identification malignant cells from breast cells. Researchers and Pathologists used the several machine learning algorithms like K-Nearest Neighbors, logistic support vector machine, artificial neural networks and decision tree in cancer prediction. They did not conclude the feasible method for cancer prediction. In this paper we propose a scalable, fault tolerant pipeline model that analyses big cancer data in and predicts the cancerous cells in real time. This model is developed on Apache Spark using Machine Learning Pipeline. In this paper, we implemented our pipeline using Random Forest algorithm to compare with baseline model in terms of accuracy and performance.

Downloads

Published

2020-01-30

Issue

Section

Articles