Correcting Suppressed Log-Probabilities in Language Models with Post-Transformer Adapters
View PDF HTML (experimental) Abstract:Alignment-tuned language models frequently suppress factual log-probabilities on politically sensitive topics despite retaining the knowledge in their hidden representations. We show that a 786K-parameter (approximately 0.02% of the base model) post-transformer adapter, trained on frozen hidden states, corrects this suppression on 31 ideology-discriminating facts across Qwen3-4B, 8B, and 14B. The adapter memorizes all 15 training facts and generalizes to 11--39% of 16 held-out facts across 5 random splits per scale, with zero knowledge regressions via anchored training. Both gated (SwiGLU) and ungated (linear bottleneck) adapters achieve comparable results; neither consistently outperforms the other (Fisher exact p > 0.09 at all scales). On instruct models, the adapter corrects log-probability rankings. When applied at all token positions during generation, the adapter produces incoherent output; however, when applied only at the current prediction position (last-position-only), the adapter produces coherent, less censored text. A logit-space adapter operating after token projection fails to produce coherent generation at any application mode, suggesting hidden-state intervention is the correct level for generation correction. A previously undocumented silent gradient bug in Apple MLX explains all null results in earlier iterations of this work: the standard pattern nn.value_and_grad(model, fn)(this http URL()) returns zero gradients without error; the correct pattern nn.value_and_grad(model, fn)(model, data) resolves this. We provide a minimal reproduction and discuss implications for other adapter research using MLX. Comments: 12 pages, 3 figures, code at this https URL Subjects: Computation and Language (cs.CL); Machine Learning (cs.LG) Cite as: arXiv:2604.14174 [cs.CL] (or arXiv:2604.14174v1 [cs.CL] for this version) https://doi.org/10.48550/arXiv.2604.14174 arXiv-issued DOI via DataCite Submission history From: Bryan Sanchez [view email] [v1] Wed, 25 Mar 2026 16:27:43 UTC (39 KB)
No replies yet. Be first.