ResNet-18 Training Performance Benchmark

ResNet-18 training benchmarks across data loading, optimizers, CPU and GPU execution, profiled on an NVIDIA A100

I benchmarked ResNet-18 training performance on CIFAR-10 across optimizers, data loader worker counts, and CPU versus GPU execution, then profiled the GPU runs on an NVIDIA A100 with the PyTorch Profiler to see exactly where each training step spent its time.

ContextAcademic Project

RoleMachine Learning Engineer

TeamSolo

DateMay 2025

I implemented every experiment variant solo, including the ResNet-18 model itself, the shared training loop, the CLI driver, and the PyTorch Profiler instrumentation used to capture and export each trace.

NVIDIA A100-SXM4-40GB GPU profiling7 benchmark variants plus a parameter-count checkWorker-count sweep 0 to 24, strongest range 4 to 8

PythonPyTorchtorchvisionCUDAResNet-18CIFAR-10

Source

Overview

I built a CLI driven benchmarking suite around a ResNet-18 model I implemented from scratch in PyTorch, with seven selectable experiment variants. They cover a baseline run, an optimized training loop, an input pipeline study that sweeps data loader worker counts from 0 to 24, a focused comparison of 1, 4, and 8 workers, a CPU versus GPU comparison, a five optimizer comparison across SGD, SGD with Nesterov momentum, Adagrad, Adadelta, and Adam, and a batch normalization ablation that trains a second ResNet-18 variant with every BatchNorm layer removed. Every variant shares the same training loop and integrates with the PyTorch Profiler to record a Chrome trace viewable timeline of CPU and GPU activity.

Problem

Model accuracy is only one axis of a training setup. I wanted to isolate the systems questions that decide how fast a model actually trains, including which optimizer, how many data loader workers, and CPU versus GPU execution, independent of accuracy tuning.

Intended User

Built for ML engineers and systems engineers tuning training pipeline throughput rather than model accuracy.

Architecture

A CLI driver dispatches to one of seven experiment scripts, each built around the same ResNet-18 architecture, four residual stages of 64, 128, 256, and 512 channels, trained on CIFAR-10 under configurable epochs, batch size, worker count, learning rate, optimizer, and device. When profiling is enabled, the PyTorch Profiler captures CPU and CUDA activity on a wait, warmup, and active step schedule, tags the forward pass, loss calculation, backward pass, and optimizer step as named regions, and exports a Chrome trace viewable timeline. I ran the GPU profiling on an NVIDIA A100-SXM4-40GB, where CUDA runtime 12.4 and driver 12.6 captured 1,089 individual GPU kernel launches in a single trace.

My Contribution

I implemented every experiment variant solo, including the ResNet-18 model itself, the shared training loop, the CLI driver that dispatches between variants, and the PyTorch Profiler instrumentation used to capture and export each trace.

Implementation

Implemented a ResNet-18 model from scratch in PyTorch, including the residual basic blocks and the four stage 64, 128, 256, 512 channel layout, instead of relying on a prebuilt torchvision model.
Swept data loader worker counts from 0 to 24 and found the optimal range sits between 4 and 8, shifting with system scheduling and CPU contention rather than landing on a single fixed number.
Compared five optimizers, SGD, SGD with Nesterov momentum, Adagrad, Adadelta, and Adam, and ran a focused CPU versus GPU comparison under the same training loop to isolate systems effects from modeling choices.
Built a second ResNet-18 variant with every BatchNorm layer removed to isolate its effect on training dynamics and measured accuracy.
Instrumented every run with the PyTorch Profiler, tagging the forward pass, backward pass, loss calculation, and optimizer step as named regions so the exported trace could be inspected operation by operation instead of relying on wall clock timing alone.
Profiled the GPU runs on the NVIDIA A100 and found that cuDNN selected TF32 tensor-core kernels for convolution execution, even though the training loop did not explicitly configure autocast or mixed-precision training.

Key Decisions

CLI selectable experiment variants over separate scripts

Why — One shared training loop with flags for optimizer, worker count, device, and more isolated each systems variable cleanly, so a single benchmark question could be answered without touching the others.

PyTorch Profiler with named regions over wall clock timing alone

Why — Tagging the forward pass, backward pass, loss calculation, and optimizer step let me see which part of a training step actually consumed GPU time, instead of guessing from a single epoch duration.

A hand implemented ResNet-18 instead of a prebuilt model

Why — Building the residual blocks and layer layout myself meant every experiment variant trained the exact same architecture, so timing and accuracy differences came from the systems change being tested, not from a different model definition.

A BatchNorm free variant as a direct ablation

Why — Removing BatchNorm from a second copy of the model isolated its cost and benefit directly, rather than inferring its effect from unrelated runs.

Testing & Validation

I validated each experiment through reported loss, accuracy, and timing outputs, then inspected the exported PyTorch Profiler trace to verify CPU and CUDA activity at the operation and kernel level.

Results

The worker-count experiments showed that the best data-loading configuration was machine-dependent, with the strongest range falling between 4 and 8 workers under the tested setup. GPU profiling revealed that cuDNN tensor-layout conversions between NCHW and NHWC consumed more aggregate kernel time than any individual convolution operation, while the convolution paths used TF32 tensor-core kernels on the A100.

Reliability & Failure Handling

Every timed GPU section calls torch.cuda.synchronize() before and after measurement, ensuring that reported durations reflect completed CUDA work rather than asynchronous kernel launches still waiting in the execution queue.

Deployment & Runtime

Runs as local CLI scripts against CIFAR-10, with every variant selectable through the same exercise flag and runnable on CPU or GPU.

Lessons Learned

Profiling the GPU runs revealed that the largest single consumer of GPU kernel time was not the convolution math itself but the layout conversion kernels cuDNN inserts to convert between NCHW and NHWC tensor formats. The profiler trace showed that cuDNN selected TF32 tensor-core kernels for convolution execution on the A100, even though the training loop did not explicitly configure autocast or mixed-precision training.

Evidence & Technical Proof

View benchmark driver View worker-count sweep View optimizer comparison View BatchNorm ablation View profiler trace

Technologies

PythonPyTorchtorchvisionCUDAResNet-18CIFAR-10PyTorch ProfilerGPU TrainingPerformance BenchmarkingData Loading OptimizationTF32 Tensor Cores

Back to all projects