Colorectal Cancer Survival Prediction
End-to-end MLOps pipeline for 5-year survival classification
Built a reproducible MLOps pipeline for public healthcare data with Gradient Boosting, MLflow, and Kubeflow.
I built a reproducible machine-learning workflow on 167,497 public clinical records that preprocesses data, uses chi-square scoring to select the five most predictive features (healthcare costs, tumor size, treatment type, diabetes status, and a population mortality rate), trains a Gradient Boosting classifier, and serves predictions through a Flask UI that was live during the project period.
Built the full pipeline solo: preprocessing, feature selection, MLflow-tracked training, DAGsHub mirroring, the Kubeflow pipeline definition, Docker packaging, and the Flask prediction UI.
Overview
The pipeline starts with 167,497 clinical records and 28 input features, reduces them to 5 selected features (healthcare costs, tumor size, treatment type, diabetes status, and a population mortality rate) with chi-square feature selection, and trains a Gradient Boosting classifier on the selected inputs. MLflow logs parameters and metrics locally, DAGsHub mirrors the runs remotely, and Kubeflow Pipelines compiles preprocessing and training into a two-step containerized DAG that ran on Minikube. A Flask model-serving UI provided the finished prediction path during the project period.
Problem
Public healthcare data becomes more useful when preprocessing, feature selection, training, and serving all run through a reproducible MLOps workflow rather than a one-off notebook.
Intended User
Built for ML and MLOps practitioners exploring reproducible survival-prediction workflows on public healthcare data.
Architecture
I preprocessed the 167,497-record dataset, applied chi-square feature selection to reduce 28 inputs to the 5 most predictive features (healthcare costs, tumor size, treatment type, diabetes status, and a population mortality rate), and trained a Gradient Boosting classifier (100 estimators, a learning rate of 0.1, and a max depth of 3) on a stratified train-test split. Training runs, parameters, and evaluation metrics, including accuracy, ROC-AUC, precision, recall, and F1, were tracked in MLflow and mirrored to DAGsHub. A two-step Kubeflow Pipelines DAG, compiled to an Argo workflow and run in Docker on Minikube, orchestrated preprocessing and training, and a Flask model-serving UI handled the final prediction step.
My Contribution
Built the full pipeline solo: preprocessing, chi-square feature selection, Gradient Boosting training, MLflow tracking, DAGsHub mirroring, Kubeflow orchestration, Docker packaging, and the Flask prediction UI.
Implementation
- Reduced the feature space by 82% (28 → 5 features) using Chi-Squared feature selection, keeping healthcare costs, tumor size, treatment type, diabetes status, and a population mortality rate as the most predictive inputs.
- Trained a Gradient Boosting classifier (100 estimators, learning rate 0.1, max depth 3) on a stratified train-test split, then logged accuracy, ROC-AUC, precision, recall, and F1 to MLflow.
- Logged every training run to MLflow both locally and to a public DAGsHub server for remote experiment tracking.
- Compiled the preprocessing-and-training DAG into a two-step Kubeflow Pipelines YAML spec and ran it on Minikube for containerized orchestration.
- Packaged the serving path with Docker and delivered the clinician-facing Flask UI during the project period.
Key Decisions
Feature selection before training
Why — Chi-Squared selection reduced 28 clinical features to the 5 most predictive ones (healthcare costs, tumor size, treatment type, diabetes status, and a population mortality rate), an 82% reduction, before training, cutting dimensionality without hand-picking features.
Trade-off — Aggressive feature reduction risks discarding signal that a less aggressive selection threshold might have kept.
MLflow for experiment tracking
Why — Logged every training run's parameters and metrics locally and mirrored them to a public DAGsHub dashboard for remote visibility.
Kubeflow for workflow orchestration
Why — Containerized and orchestrated the preprocessing-and-training DAG on Minikube rather than running scripts ad hoc.
Trade-off — Adds Kubernetes/Minikube operational overhead compared to a simpler script-based pipeline.
Flask for model serving
Why — Flask provided a lightweight serving layer for the finished prediction UI during the project period.
Trade-off — The serving layer is intentionally simple and better suited to a prototype than a production healthcare system.
A stratified train-test split for evaluation
Why — Stratifying the 80/20 split on the survival label kept the class balance consistent between training and evaluation, so accuracy and ROC-AUC reflected the same outcome distribution the model was trained on.
Trade-off — A single stratified split is simpler than cross-validation, but it gives one evaluation snapshot rather than a distribution of scores across folds.
Testing & Validation
Validation used a stratified train-test split, with accuracy, ROC-AUC, precision, recall, and F1 tracked through MLflow and mirrored to DAGsHub on every run.
Results
Reduced the feature space by 82% (28 → 5 features) via chi-square selection and reached 92.9% accuracy with 0.89 ROC-AUC on real public clinical data. The contextual comparison on synthetic data was 59.9% accuracy with 0.50 ROC-AUC.
Reliability & Failure Handling
Not a clinical system. The project is a public-data MLOps workflow with a simple Flask serving layer, and its predictions should not be read as medical guidance.
Deployment & Runtime
Docker packaged the workflow, Kubeflow Pipelines orchestrated preprocessing and training on Minikube, and the clinician-facing Flask UI was live during the project period. The UI is no longer active today, and the project should be read as a reproducible MLOps workflow rather than a clinical deployment.
Lessons Learned
- I kept the synthetic-data results as a contextual comparison to show how much the real-data pipeline outperformed a weaker baseline setting.