Colorectal Cancer Survival Prediction

End-to-end MLOps pipeline for 5-year survival classification

Built a reproducible MLOps pipeline for public healthcare data with Gradient Boosting, MLflow, and Kubeflow.

I built a reproducible machine-learning workflow on 167,497 public clinical records that preprocesses data, uses chi-square scoring to select the five most predictive features (healthcare costs, tumor size, treatment type, diabetes status, and a population mortality rate), trains a Gradient Boosting classifier, and serves predictions through a Flask UI that was live during the project period.

ContextPersonal Project

RoleMachine Learning Engineer

TeamSolo

DateApr–Jun 2025

Built the full pipeline solo: preprocessing, feature selection, MLflow-tracked training, DAGsHub mirroring, the Kubeflow pipeline definition, Docker packaging, and the Flask prediction UI.

167,497 clinical records28 → 5 chi-square feature selection92.9% accuracy / 0.89 ROC-AUC (real data)

Pythonscikit-learnMachine LearningSurvival PredictionGradient BoostingFeature Engineering

Source

Overview

The pipeline starts with 167,497 clinical records and 28 input features, reduces them to 5 selected features (healthcare costs, tumor size, treatment type, diabetes status, and a population mortality rate) with chi-square feature selection, and trains a Gradient Boosting classifier on the selected inputs. MLflow logs parameters and metrics locally, DAGsHub mirrors the runs remotely, and Kubeflow Pipelines compiles preprocessing and training into a two-step containerized DAG that ran on Minikube. A Flask model-serving UI provided the finished prediction path during the project period.

Problem

Public healthcare data becomes more useful when preprocessing, feature selection, training, and serving all run through a reproducible MLOps workflow rather than a one-off notebook.

Intended User

Built for ML and MLOps practitioners exploring reproducible survival-prediction workflows on public healthcare data.

Architecture

I preprocessed the 167,497-record dataset, applied chi-square feature selection to reduce 28 inputs to the 5 most predictive features (healthcare costs, tumor size, treatment type, diabetes status, and a population mortality rate), and trained a Gradient Boosting classifier (100 estimators, a learning rate of 0.1, and a max depth of 3) on a stratified train-test split. Training runs, parameters, and evaluation metrics, including accuracy, ROC-AUC, precision, recall, and F1, were tracked in MLflow and mirrored to DAGsHub. A two-step Kubeflow Pipelines DAG, compiled to an Argo workflow and run in Docker on Minikube, orchestrated preprocessing and training, and a Flask model-serving UI handled the final prediction step.

My Contribution

Built the full pipeline solo: preprocessing, chi-square feature selection, Gradient Boosting training, MLflow tracking, DAGsHub mirroring, Kubeflow orchestration, Docker packaging, and the Flask prediction UI.

Implementation

Reduced the feature space by 82% (28 → 5 features) using Chi-Squared feature selection, keeping healthcare costs, tumor size, treatment type, diabetes status, and a population mortality rate as the most predictive inputs.
Trained a Gradient Boosting classifier (100 estimators, learning rate 0.1, max depth 3) on a stratified train-test split, then logged accuracy, ROC-AUC, precision, recall, and F1 to MLflow.
Logged every training run to MLflow both locally and to a public DAGsHub server for remote experiment tracking.
Compiled the preprocessing-and-training DAG into a two-step Kubeflow Pipelines YAML spec and ran it on Minikube for containerized orchestration.
Packaged the serving path with Docker and delivered the clinician-facing Flask UI during the project period.

Key Decisions

Feature selection before training

Why — Chi-Squared selection reduced 28 clinical features to the 5 most predictive ones (healthcare costs, tumor size, treatment type, diabetes status, and a population mortality rate), an 82% reduction, before training, cutting dimensionality without hand-picking features.

Trade-off — Aggressive feature reduction risks discarding signal that a less aggressive selection threshold might have kept.

MLflow for experiment tracking

Why — Logged every training run's parameters and metrics locally and mirrored them to a public DAGsHub dashboard for remote visibility.

Kubeflow for workflow orchestration

Why — Containerized and orchestrated the preprocessing-and-training DAG on Minikube rather than running scripts ad hoc.

Trade-off — Adds Kubernetes/Minikube operational overhead compared to a simpler script-based pipeline.

Flask for model serving

Why — Flask provided a lightweight serving layer for the finished prediction UI during the project period.

Trade-off — The serving layer is intentionally simple and better suited to a prototype than a production healthcare system.

A stratified train-test split for evaluation

Why — Stratifying the 80/20 split on the survival label kept the class balance consistent between training and evaluation, so accuracy and ROC-AUC reflected the same outcome distribution the model was trained on.

Trade-off — A single stratified split is simpler than cross-validation, but it gives one evaluation snapshot rather than a distribution of scores across folds.

Testing & Validation

Validation used a stratified train-test split, with accuracy, ROC-AUC, precision, recall, and F1 tracked through MLflow and mirrored to DAGsHub on every run.

Results

Reduced the feature space by 82% (28 → 5 features) via chi-square selection and reached 92.9% accuracy with 0.89 ROC-AUC on real public clinical data. The contextual comparison on synthetic data was 59.9% accuracy with 0.50 ROC-AUC.

Reliability & Failure Handling

Not a clinical system. The project is a public-data MLOps workflow with a simple Flask serving layer, and its predictions should not be read as medical guidance.

Deployment & Runtime

Docker packaged the workflow, Kubeflow Pipelines orchestrated preprocessing and training on Minikube, and the clinician-facing Flask UI was live during the project period. The UI is no longer active today, and the project should be read as a reproducible MLOps workflow rather than a clinical deployment.

Lessons Learned

I kept the synthetic-data results as a contextual comparison to show how much the real-data pipeline outperformed a weaker baseline setting.

Evidence & Technical Proof

View data processing and feature selection View training pipeline View Kubeflow configuration View Flask serving app View MLflow tracking

Technologies

Pythonscikit-learnMachine LearningSurvival PredictionGradient BoostingFeature EngineeringChi-Square Feature SelectionMLflowDAGsHubKubeflow PipelinesDockerMinikubeFlaskMLOpsExperiment TrackingReproducible PipelinesClassification MetricsROC-AUC

Back to all projects