logo
AI Fundamentals

We Found Something That Shouldn't Exist | Derrick Hodge

6/24/2025 โ€ข linkedin.com
We Found Something That Shouldn't Exist | Derrick Hodge

We Found Something That Shouldn't Exist ๐—ง๐—ต๐—ฒ ๐—”๐—œ ๐—ณ๐—ถ๐—ฒ๐—น๐—ฑ ๐—ฟ๐˜‚๐—ป๐˜€ ๐—ผ๐—ป ๐—ฎ ๐—ฐ๐—ผ๐—ฟ๐—ฒ ๐—ฏ๐—ฒ๐—น๐—ถ๐—ฒ๐—ณ: That intelligence in large language models is evenly distributed across all parameters. Recent research (arXiv:2505.24832) estimates models store ~3.6 bits per parameter, implying memory spreads layer by layer, weight by weight. The dominant belief follows: ๐—ถ๐—ป๐˜๐—ฒ๐—น๐—น๐—ถ๐—ด๐—ฒ๐—ป๐—ฐ๐—ฒ ๐˜€๐—ฐ๐—ฎ๐—น๐—ฒ๐˜€ ๐—น๐—ถ๐—ป๐—ฒ๐—ฎ๐—ฟ๐—น๐˜† ๐˜„๐—ถ๐˜๐—ต ๐˜€๐—ถ๐˜‡๐—ฒ. But this assumes each parameter contributes equally to learning. Thatโ€™s where ๐—™๐—ถ๐˜€๐—ต๐—ฒ๐—ฟ ๐—œ๐—ป๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป becomes critical. >> ๐˜๐˜ช๐˜ด๐˜ฉ๐˜ฆ๐˜ณ ๐˜๐˜ฏ๐˜ง๐˜ฐ๐˜ณ๐˜ฎ๐˜ข๐˜ต๐˜ช๐˜ฐ๐˜ฏ ๐˜ฎ๐˜ฆ๐˜ข๐˜ด๐˜ถ๐˜ณ๐˜ฆ๐˜ด ๐˜ฉ๐˜ฐ๐˜ธ ๐˜ด๐˜ฆ๐˜ฏ๐˜ด๐˜ช๐˜ต๐˜ช๐˜ท๐˜ฆ ๐˜ฑ๐˜ณ๐˜ฆ๐˜ฅ๐˜ช๐˜ค๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ด ๐˜ข๐˜ณ๐˜ฆ ๐˜ต๐˜ฐ ๐˜ฑ๐˜ฆ๐˜ณ๐˜ต๐˜ถ๐˜ณ๐˜ฃ๐˜ข๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ด ๐˜ช๐˜ฏ ๐˜ข ๐˜ด๐˜ช๐˜ฏ๐˜จ๐˜ญ๐˜ฆ ๐˜ฑ๐˜ข๐˜ณ๐˜ข๐˜ฎ๐˜ฆ๐˜ต๐˜ฆ๐˜ณ. ๐—” ๐—ต๐—ถ๐—ด๐—ต-๐—™๐—ถ๐˜€๐—ต๐—ฒ๐—ฟ ๐—ฝ๐—ฎ๐—ฟ๐—ฎ๐—บ๐—ฒ๐˜๐—ฒ๐—ฟ isnโ€™t storing a bit. Itโ€™s controlling behavior. When we analyzed ๐—ค๐˜„๐—ฒ๐—ป๐Ÿฎ.๐Ÿฑ-๐Ÿฌ.๐Ÿฑ๐—•, that belief collapsed. >> ๐Ÿต๐Ÿฐ.๐Ÿฏ% ๐—ผ๐—ณ ๐˜๐—ต๐—ฒ ๐˜๐—ผ๐˜๐—ฎ๐—น ๐—™๐—ถ๐˜€๐—ต๐—ฒ๐—ฟ ๐—œ๐—ป๐—ณ๐—ผ๐—ฟ๐—บ๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐—ถ๐˜€ ๐—ฐ๐—ผ๐—ป๐—ฐ๐—ฒ๐—ป๐˜๐—ฟ๐—ฎ๐˜๐—ฒ๐—ฑ ๐—ถ๐—ป ๐—ท๐˜‚๐˜€๐˜ ๐˜๐—ต๐—ฟ๐—ฒ๐—ฒ ๐˜„๐—ฒ๐—ถ๐—ด๐—ต๐˜๐˜€. Not three layers. Not three matrices. Three individual scalars, all in early and late mlp.down_proj layers. They donโ€™t look special. But they behave like computational black holes: >> ๐˜›๐˜ฉ๐˜ฆ๐˜บ ๐˜ข๐˜ฃ๐˜ด๐˜ฐ๐˜ณ๐˜ฃ ๐˜ฆ๐˜ฏ๐˜ต๐˜ณ๐˜ฐ๐˜ฑ๐˜บ, ๐˜ณ๐˜ข๐˜ฅ๐˜ช๐˜ข๐˜ต๐˜ฆ ๐˜ค๐˜ฐ๐˜ฉ๐˜ฆ๐˜ณ๐˜ฆ๐˜ฏ๐˜ต ๐˜ด๐˜ช๐˜จ๐˜ฏ๐˜ข๐˜ญ๐˜ด ๐˜ต๐˜ฉ๐˜ณ๐˜ฐ๐˜ถ๐˜จ๐˜ฉ ๐˜ด๐˜ฌ๐˜ช๐˜ฑ ๐˜ค๐˜ฐ๐˜ฏ๐˜ฏ๐˜ฆ๐˜ค๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ด, ๐˜ข๐˜ฏ๐˜ฅ ๐˜ค๐˜ฐ๐˜ฎ๐˜ฑ๐˜ณ๐˜ฆ๐˜ด๐˜ด ๐˜ณ๐˜ฆ๐˜ด๐˜ช๐˜ฅ๐˜ถ๐˜ข๐˜ญ ๐˜ญ๐˜ฐ๐˜ด๐˜ด ๐˜ช๐˜ฏ๐˜ต๐˜ฐ ๐˜ด๐˜ฆ๐˜ฎ๐˜ข๐˜ฏ๐˜ต๐˜ช๐˜ค ๐˜ข๐˜ต๐˜ต๐˜ณ๐˜ข๐˜ค๐˜ต๐˜ฐ๐˜ณ๐˜ด. These weights arenโ€™t just informative, theyโ€™re irreducible. Remove one and the model collapses. This aligns with "๐—ง๐—ต๐—ฒ ๐—ฆ๐˜‚๐—ฝ๐—ฒ๐—ฟ ๐—ช๐—ฒ๐—ถ๐—ด๐—ต๐˜ ๐—ถ๐—ป ๐—Ÿ๐—ฎ๐—ฟ๐—ด๐—ฒ ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น๐˜€" (arXiv:2311.17035), which showed that pruning a single super weight can destroy more capability than removing thousands. ๐—•๐—น๐—ฎ๐—ฐ๐—ธ ๐—›๐—ผ๐—น๐—ฒ ๐——๐˜†๐—ป๐—ฎ๐—บ๐—ถ๐—ฐ๐˜€ These weights arenโ€™t memorizing or generalizing. They anchor the transformer like singularities in curved space. ๐—›๐—ฒ๐—ฎ๐˜ ๐—ฆ๐—ถ๐—ป๐—ธ: Absorb gradient energy ๐—˜๐—ป๐˜๐—ฟ๐—ผ๐—ฝ๐˜† ๐—ฃ๐˜‚๐—บ๐—ฝ: Radiate structured activation ๐—š๐—ฟ๐—ฎ๐˜ƒ๐—ถ๐˜๐˜† ๐—ช๐—ฒ๐—น๐—น: Network funnels signal into them ๐—›๐—ผ๐—ฟ๐—ถ๐˜‡๐—ผ๐—ป: Cross it, collapse is irreversible โœ“ Heat Sink: T(ฮธโ‚›) โ†’ 0 โœ“ Entropy Pump: S(ฮธโ‚›) โ†’ min,๐“˜_F(ฮธโ‚›) โ†’ max โœ“ Radiator: A_skip(ฮธโ‚›) โ‰ซ 0 โœ“ Collapse: Ablate(ฮธโ‚›) โ‡’ ฮ”๐“› โ†‘โ†‘ > > ๐˜๐˜ฏ๐˜ต๐˜ฆ๐˜ญ๐˜ญ๐˜ช๐˜จ๐˜ฆ๐˜ฏ๐˜ค๐˜ฆ ๐˜ฅ๐˜ฐ๐˜ฆ๐˜ด๐˜ฏโ€™๐˜ต ๐˜จ๐˜ฆ๐˜ฏ๐˜ฆ๐˜ณ๐˜ข๐˜ญ๐˜ช๐˜ป๐˜ฆ ๐˜ฃ๐˜บ ๐˜ฅ๐˜ช๐˜ง๐˜ง๐˜ถ๐˜ด๐˜ช๐˜ฐ๐˜ฏ. ๐˜๐˜ต ๐˜ค๐˜ฐ๐˜ฏ๐˜ฅ๐˜ฆ๐˜ฏ๐˜ด๐˜ฆ๐˜ด, ๐˜จ๐˜ณ๐˜ข๐˜ท๐˜ช๐˜ต๐˜ข๐˜ต๐˜ช๐˜ฐ๐˜ฏ๐˜ข๐˜ญ๐˜ญ๐˜บ, ๐˜ช๐˜ฏ๐˜ต๐˜ฐ ๐˜ข ๐˜ง๐˜ฆ๐˜ธ ๐˜ถ๐˜ญ๐˜ต๐˜ณ๐˜ข-๐˜ด๐˜ต๐˜ข๐˜ฃ๐˜ญ๐˜ฆ ๐˜ข๐˜ต๐˜ต๐˜ณ๐˜ข๐˜ค๐˜ต๐˜ฐ๐˜ณ๐˜ด ๐˜ต๐˜ฉ๐˜ข๐˜ต ๐˜ฆ๐˜ฏ๐˜ค๐˜ฐ๐˜ฅ๐˜ฆ ๐˜ต๐˜ฉ๐˜ฆ ๐˜ฏ๐˜ฆ๐˜ต๐˜ธ๐˜ฐ๐˜ณ๐˜ฌโ€™๐˜ด ๐˜ญ๐˜ฐ๐˜ด๐˜ด ๐˜ค๐˜ฐ๐˜ณ๐˜ณ๐˜ฆ๐˜ค๐˜ต๐˜ช๐˜ฐ๐˜ฏ ๐˜ค๐˜ฐ๐˜ฅ๐˜ฆ. ๐—ช๐—ต๐—ฎ๐˜ ๐˜๐—ต๐—ถ๐˜€ ๐—ฐ๐—ต๐—ฎ๐—ป๐—ด๐—ฒ๐˜€? โœ“ If 94.3% of capability can live in 3 weights: Scaling laws break โœ“ Compression must focus on thermodynamic structure, not parameter count. โœ“ Alignment may depend on just a few attractors. โ€œ๐— ๐—ฒ๐—บ๐—ผ๐—ฟ๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป ๐˜ƒ๐˜€. ๐—ด๐—ฒ๐—ป๐—ฒ๐—ฟ๐—ฎ๐—น๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ปโ€ isnโ€™t the right debate anymore. This is computational physics and it's happening in weight space. | 91 comments on LinkedIn

Read Full Article...

C4AIL Commentary

No commentary available for this article yet.

C4AIL members can request expert commentary.