DyT vs. LayerNorm in LLM Fine-Tuning

A transformer-normalization ablation study across DistilGPT-2 and Pythia, fine-tuned with LoRA

I implemented DyT substitutions across DistilGPT-2 and Pythia 17M/410M, fine-tuned the variants with LoRA via Hugging Face PEFT (training under 1% of parameters) on Alpaca, ShareGPT, and RE-WILD, and measured validation loss, inference time, and MT-Bench judged output quality across frozen, selectively unfrozen, and full-SFT setups.

ContextResearch Project

RoleResearcher

Team2 collaborators

DateMay 2025

Co-built with one collaborator: implemented the DyT layer swaps, the LoRA fine-tuning scripts, the evaluation pipeline, and the plots used to compare DyT and LayerNorm.

DistilGPT-2 + Pythia 17M/410MLoRA / PEFT (~0.2% trainable)Alpaca + ShareGPT + RE-WILD

PyTorchHugging FacePEFTLoRADynamic TanhLayerNorm

Source

Overview

Research pipelines that swap every LayerNorm in DistilGPT-2 and Pythia 17M/410M for Dynamic Tanh, fine-tune with LoRA via Hugging Face PEFT under three strategies while training under 1% of total parameters, and evaluate across Alpaca, ShareGPT, and RE-WILD using validation loss, inference time, and MT-Bench judged output quality.

Experimental Setup

For each model, I replaced LayerNorm layers with Dynamic Tanh, attached LoRA adapters via Hugging Face PEFT so only about 0.2% of parameters stayed trainable, and ran frozen, selectively unfrozen, and full-SFT experiments on a single A100 GPU. The pipeline logged training and validation loss, gradient norm, token-level accuracy, prompt outputs, and inference time for DyT and LayerNorm baselines across Alpaca, ShareGPT, and RE-WILD, then judged a sample of generated outputs head to head with a PairRM preference model.

What I Built

Co-built with one collaborator: implemented the DyT layer swaps, the LoRA training scripts, the RE-WILD dataset reformatting after the source HuggingFace files came through corrupted, the evaluation runs, and the plotting workflow used to compare DyT and LayerNorm.

Swapped every LayerNorm layer in DistilGPT-2 (80M), Pythia 17M, and Pythia 410M for Dynamic Tanh, DyT(x) = tanh(αx).
Attached LoRA adapters through Hugging Face PEFT so only about 0.2% of parameters remained trainable, then compared frozen DyT, selectively unfrozen DyT layers, and full supervised fine-tuning across Alpaca, ShareGPT, and RE-WILD.
Reformatted the RE-WILD dataset by hand after the HuggingFace source JSON came through corrupted, building a usable fallback pipeline for training.
Logged training and validation loss, gradient norm, and token-level accuracy for every configuration rather than only the best result, then timed inference on MT-Bench prompts and had a PairRM model judge DyT and LayerNorm outputs head to head.
Produced the benchmark plots and result artifacts used to interpret the quality-versus-efficiency trade-off.

Research Decisions

Compare DyT against LayerNorm baselines

Why — LayerNorm is the reference point for transformer stability, so comparing DyT directly against it showed whether the replacement changed quality or speed in the tested fine-tuning setups.

Run frozen, selective-unfreeze, and full-SFT strategies

Why — The three strategies isolated whether trainability choices, rather than normalization alone, explained DyT’s behavior during post-training, and showed the gap to LayerNorm narrowing as model scale and fine-tuning depth increased.

Evaluate across Alpaca, ShareGPT, and RE-WILD

Why — Using three datasets tested the ablation across instruction tuning, conversational data, and open-ended QA rather than a single narrow workload.

Judge generated outputs with a PairRM preference model

Why — Validation loss alone could not show whether DyT’s outputs were still usable, so a PairRM model judged DyT and LayerNorm completions on MT-Bench prompts head to head.

Trade-off — A single automated judge model is faster than human evaluation but is itself an approximation of human preference.

Manually reformat the RE-WILD dataset

Why — The HuggingFace RE-WILD JSON source files were corrupted, so a manual reformatting step was required before any model could train on the dataset at all.

Results & Evaluation

DyT’s own validation loss dropped by more than 87% over the course of training relative to its starting point, confirming it could still learn under post-training. Even so, in the matched comparison where the gap was narrowest, at the larger Pythia 410M scale under full fine-tuning, DyT’s converged loss still trailed LayerNorm’s by about 21-28%, and the gap was far larger under frozen or selectively unfrozen LoRA at smaller scale. Inference timing on MT-Bench prompts showed only about a 0.5% speedup for DyT, and a PairRM judge preferred LayerNorm’s outputs on 67 of 81 paired completions (about 83%) against 14 for DyT (about 17%).

Validation used the standard research loop: training and validation loss curves, gradient norm, token-level accuracy, inference-time measurements on MT-Bench prompts, and a PairRM preference judge comparing DyT and LayerNorm completions, all logged and plotted across the tested models and datasets on a single A100 GPU.

Evidence / Technologies

View training code View MT-Bench evaluation notebook View RE-WILD data reformatting View result plots View final report

PyTorchHugging FacePEFTLoRADynamic TanhLayerNormDistilGPT-2Pythia 17MPythia 410MAlpacaShareGPTRE-WILDAblation StudiesTraining PipelinesMT-Bench EvaluationEvaluationPython

Back to all projects