Binary Classification with Apache Spark / HDFS

December 6, 2018 · View on GitHub

↖data source

The goal of the competition is to predict which parts will fail quality control

My goal is to utilize the hadoop ecosystem to handle a large dataset and establish a pipeline for machine learning

`munge` :

Aggregate columns using RDD transformations
Create a column that indicates which of those column aggregations are outliers.

`fit_predict` :

Model data with Spark Machine Learning package
Predict on test data

`munge_fit_predict` :

Run this as is to use the toy data set example