DyT vs. LayerNorm in LLM Fine-Tuning
A transformer-normalization ablation study across DistilGPT-2 and Pythia, fine-tuned with LoRA
I implemented DyT substitutions across DistilGPT-2 and Pythia 17M/410M, fine-tuned the variants with LoRA via Hugging Face PEFT (training under 1% of parameters) on Alpaca, ShareGPT, and RE-WILD, and measured validation loss, inference time, and MT-Bench judged output quality across frozen, selectively unfrozen, and full-SFT setups.
Co-built with one collaborator: implemented the DyT layer swaps, the LoRA fine-tuning scripts, the evaluation pipeline, and the plots used to compare DyT and LayerNorm.
Overview
Research pipelines that swap every LayerNorm in DistilGPT-2 and Pythia 17M/410M for Dynamic Tanh, fine-tune with LoRA via Hugging Face PEFT under three strategies while training under 1% of total parameters, and evaluate across Alpaca, ShareGPT, and RE-WILD using validation loss, inference time, and MT-Bench judged output quality.
Experimental Setup
For each model, I replaced LayerNorm layers with Dynamic Tanh, attached LoRA adapters via Hugging Face PEFT so only about 0.2% of parameters stayed trainable, and ran frozen, selectively unfrozen, and full-SFT experiments on a single A100 GPU. The pipeline logged training and validation loss, gradient norm, token-level accuracy, prompt outputs, and inference time for DyT and LayerNorm baselines across Alpaca, ShareGPT, and RE-WILD, then judged a sample of generated outputs head to head with a PairRM preference model.
What I Built
Co-built with one collaborator: implemented the DyT layer swaps, the LoRA training scripts, the RE-WILD dataset reformatting after the source HuggingFace files came through corrupted, the evaluation runs, and the plotting workflow used to compare DyT and LayerNorm.
- Swapped every LayerNorm layer in DistilGPT-2 (80M), Pythia 17M, and Pythia 410M for Dynamic Tanh, DyT(x) = tanh(αx).
- Attached LoRA adapters through Hugging Face PEFT so only about 0.2% of parameters remained trainable, then compared frozen DyT, selectively unfrozen DyT layers, and full supervised fine-tuning across Alpaca, ShareGPT, and RE-WILD.
- Reformatted the RE-WILD dataset by hand after the HuggingFace source JSON came through corrupted, building a usable fallback pipeline for training.
- Logged training and validation loss, gradient norm, and token-level accuracy for every configuration rather than only the best result, then timed inference on MT-Bench prompts and had a PairRM model judge DyT and LayerNorm outputs head to head.
- Produced the benchmark plots and result artifacts used to interpret the quality-versus-efficiency trade-off.
Research Decisions
Compare DyT against LayerNorm baselines
Why — LayerNorm is the reference point for transformer stability, so comparing DyT directly against it showed whether the replacement changed quality or speed in the tested fine-tuning setups.
Run frozen, selective-unfreeze, and full-SFT strategies
Why — The three strategies isolated whether trainability choices, rather than normalization alone, explained DyT’s behavior during post-training, and showed the gap to LayerNorm narrowing as model scale and fine-tuning depth increased.
Evaluate across Alpaca, ShareGPT, and RE-WILD
Why — Using three datasets tested the ablation across instruction tuning, conversational data, and open-ended QA rather than a single narrow workload.
Judge generated outputs with a PairRM preference model
Why — Validation loss alone could not show whether DyT’s outputs were still usable, so a PairRM model judged DyT and LayerNorm completions on MT-Bench prompts head to head.
Trade-off — A single automated judge model is faster than human evaluation but is itself an approximation of human preference.
Manually reformat the RE-WILD dataset
Why — The HuggingFace RE-WILD JSON source files were corrupted, so a manual reformatting step was required before any model could train on the dataset at all.
Results & Evaluation
DyT’s own validation loss dropped by more than 87% over the course of training relative to its starting point, confirming it could still learn under post-training. Even so, in the matched comparison where the gap was narrowest, at the larger Pythia 410M scale under full fine-tuning, DyT’s converged loss still trailed LayerNorm’s by about 21-28%, and the gap was far larger under frozen or selectively unfrozen LoRA at smaller scale. Inference timing on MT-Bench prompts showed only about a 0.5% speedup for DyT, and a PairRM judge preferred LayerNorm’s outputs on 67 of 81 paired completions (about 83%) against 14 for DyT (about 17%).
Validation used the standard research loop: training and validation loss curves, gradient norm, token-level accuracy, inference-time measurements on MT-Bench prompts, and a PairRM preference judge comparing DyT and LayerNorm completions, all logged and plotted across the tested models and datasets on a single A100 GPU.