Back to all projectsBack to portfolio
EMPrototypeResearch
Machine Learning & Research

Electricity Market Price Prediction

Electricity price forecasting with GRU, LSTM, Random Forest, and leakage-safe time-series evaluation

I built a time-series research dashboard that forecasts next-day electricity price direction and returns from 20 lagged, rolling, volatility, momentum, and trend features. I trained GRU, LSTM, and Random Forest models on a strictly time-ordered split, evaluated each one against trivial baselines, and exposed the full evaluation and strategy analysis through Streamlit.

ContextResearch Project
RoleResearcher
TeamSolo
DateDec 2025

I built the full dashboard and modeling stack solo, including the time-series feature-engineering pipeline, the leakage-safe time-based split, all three models, both trading strategies, and the Optuna tuning integration.

0.689 ROC-AUC for next-day direction20 engineered lag, rolling, momentum, and volatility featuresTime-ordered split with leakage-safe scaling
PythonPyTorchscikit-learnLSTMGRURandom Forest

Overview

I built a Streamlit dashboard backed by a time-series feature-engineering pipeline with 20 engineered features, including lagged prices, 7-day, 14-day, and 30-day rolling statistics, momentum, and trend-slope indicators. The system trains three models, LSTM, GRU with Optuna hyperparameter tuning, and Random Forest, backtests two rule-based trading strategies (Percentile Channel Breakout and Break of Structure), and evaluates classification, regression, and trading metrics against zero, mean, majority, and random baselines.

Problem

Evaluating a price-prediction model on accuracy alone says little about whether it adds real value over a trading-relevant baseline. I paired time-series forecasting models directly with baseline comparisons and backtested trading strategies so every prediction could be judged against a concrete reference point instead of in isolation.

Intended User

A quantitative research exploration rather than a deployed trading system, built as an NYU VIP-program research and coursework tool for electricity-market modeling.

Architecture

A local CSV dataset of about 8,034 daily electricity-price records from 2001 to 2022 feeds a shared feature-engineering pipeline that builds 20 features per day. These include core market series such as price, volume, natural gas price, grid load, and temperature, along with calendar features, percent-change features, lagged values for price, volume, and gas price, 7-day rolling mean and standard deviation, a 14-day price-range position, 7-day and 30-day return volatility, 3-day and 7-day momentum, and a 14-day rolling trend slope. I split the data strictly by time, so train, validation, and test never overlap or shuffle, and I fit the scaler on the training set only before transforming validation and test, which prevents look-ahead leakage. LSTM and GRU consume 14-day sequences built from these features, while Random Forest consumes the tabular form directly. I trained all three models independently, tuning GRU with Optuna, fed their predictions into two rule-based trading strategies, and surfaced everything through a single Streamlit dashboard with classification, regression, and trading metric panels.

My Contribution

I built the full dashboard and modeling stack solo, including the time-series feature-engineering pipeline, the leakage-safe time-based train, validation, and test split, all three models, both trading strategies, the Optuna tuning integration, and the 20+ diagnostic visualizations.

Implementation

  • Engineered 20 time-series features per day, including lagged price, volume, and gas-price values, 7-day rolling mean and standard deviation, a 14-day price-range position, 7-day and 30-day return volatility, 3-day and 7-day momentum, and a 14-day rolling trend slope.
  • Built a strictly time-ordered train, validation, and test split with leakage-safe scaling, fitting the scaler on the training set only and transforming validation and test without ever refitting it.
  • Implemented optional rolling-window and expanding-window split modes for comparing evaluation strategies.
  • Tuned the GRU model with Optuna rather than fixed hyperparameters.
  • Implemented two distinct rule-based trading strategies (Percentile Channel Breakout and Break of Structure) on top of the model outputs.
  • Built 20+ diagnostic visualizations, including ROC and PR curves, a confusion matrix, a calibration curve, a Q-Q plot, rolling RMSE, and residual analysis.

Key Decisions

Time-ordered train, validation, and test split with leakage-safe scaling

Why — Electricity prices are sequential, so a shuffled split would let the model see future information during training. Splitting strictly by time and fitting the scaler on the training set only kept the evaluation honest.

Lag, rolling, and momentum features over the raw series alone

Why — Lagged values, rolling means and volatility, and momentum indicators gave the models recent market context, including autocorrelation, short-term trend, and volatility regime, that the raw price series alone does not expose.

Pair prediction models directly with backtested trading strategies

Why — A price-prediction model evaluated in isolation said little about whether it was useful for trading decisions.

Benchmark every model against zero, mean, majority, and random baselines

Why — Accuracy or RMSE alone can look strong without context, and comparing every model against a trivial baseline showed whether the learned signal actually beat the simplest possible prediction.

Optuna for GRU hyperparameter tuning

Why — Automated the search rather than relying on fixed hyperparameters.

Testing & Validation

I validated the system through the time-ordered train, validation, and test split, baseline comparisons (zero and mean for regression, majority and random for classification), and 20+ diagnostic visualizations, including ROC and PR curves, confusion matrices, calibration curves, and residual analysis, covering both the classification and regression tasks.

Results

The GRU did not beat the trivial baselines on raw accuracy or regression error, but the evaluation showed a clear strength elsewhere. Its 0.689 ROC-AUC and 0.477 PR-AUC revealed substantially stronger ranking signal for next-day direction than either baseline could produce. The comparison made the broader point clear: time-series models need baseline-aware, threshold-independent metrics, not just accuracy or RMSE, to reveal whether they actually add value.

Reliability & Failure Handling

The project changelog documents two correctness fixes made during development, a classification-metrics calculation bug and a BrokenPipeError, both addressed before the benchmark results above were produced.

Deployment & Runtime

The dashboard runs through Streamlit against the local CSV dataset, with model training, baseline comparison, backtesting, and visualization all available interactively in one application.

Lessons Learned

  • The full benchmark told a clear story. On next-day direction, GRU reached 62.4% accuracy, a 0.689 ROC-AUC, and a 0.477 PR-AUC, compared with a majority baseline of 67.9% accuracy, 0.500 ROC-AUC, and 0.321 PR-AUC, and a random baseline of 48.9% accuracy, 0.505 ROC-AUC, and 0.321 PR-AUC. On next-day return regression, GRU posted 0.171 RMSE, 0.109 MAE, and −0.126 R², compared with zero and mean baselines that both landed near 0.161 RMSE.

Evidence & Technical Proof

Technologies

PythonPyTorchscikit-learnLSTMGRURandom ForestTime-Series ForecastingFeature EngineeringRegressionOptunaStreamlitPandasPlotly