Self-Supervised Learning as Constrained Free-Energy Systems
Set DINO’s momentum to 0.9 instead of 0.996. Training begins normally. Loss decreases, representations form, validation metrics look reasonable. Then, between epochs 15 and 20, something breaks. All embeddings collapse to a single point. Representations die. The model becomes useless. Why does changing 0.996 to 0.9—a 0.4% shift—destroy everything?
The answer is physics. That momentum parameter determines a timescale . At , you get steps—enough for coherence tracking. At , you get steps—too short for stable reference. The teacher network tracks the student too closely, both chase each other’s drift, and coherence evaporates. A geometric necessity written into the physics of constrained optimization.
VICReg stabilizes representations through variance and covariance terms weighted around 0.04. DINO uses momentum encoding at 0.996. SimCLR demands batches of 4096 samples. BYOL removes negative samples entirely but adds predictor networks. Barlow Twins forces independence through redundancy reduction with . JEPA predicts exactly ten steps into the future. Each method emerged from different intuitions, different motivations, different conceptual starting points. Yet they all work. They all discover stable representations. They all avoid collapse.
The question isn’t why these methods work individually—decades of empirical study have traced their mechanisms. The question is why they work the same way. What underlying structure connects these seemingly arbitrary numbers? Why does variance regularization at 0.04, momentum at 0.996, and batch sizes in the thousands all stabilize training? These parameters appear unrelated until you recognize they may encode the same underlying constraints through different geometric transformations.
The answer lives in physics. Self-supervised models are physical systems minimizing free energy under representational constraints. Every method implements a constraint manifold. Every system carries an irreducible structural constant imposed by those constraints. Every failure mode emerges when coherence deviation grows faster than gradient dynamics can correct. The empirical “magic numbers” of self-supervised learning cluster into narrow ranges because constrained optimization permits only certain configurations to remain stable.
Why Machine Learning Must Obey Thermodynamics
Why must gradient descent obey free energy principles? Because every operation in neural network training performs irreversible computation on finite-precision hardware. Landauer’s principle establishes that erasing one bit of information dissipates at least of energy as heat. Neural networks don’t merely erase bits—they continuously update millions of parameters, each update destroying previous information states and generating entropy.
When you compute a gradient , store it in memory, update parameters , and discard intermediate activations, you’ve performed irreversible operations. The previous parameter state is lost. The forward-pass activations are erased. Floating-point rounding destroys information at every step. Each training iteration dissipates energy proportional to the information erased.
GPUs running backpropagation generate hundreds of watts as heat—measurable thermal output representing entropy increase in the physical universe, bounded by Landauer’s limit. The learning dynamics cannot violate thermodynamic constraints any more than a heat engine can violate the second law.
The free-energy functional emerges naturally. It measures the cost of maintaining internal model given environmental structure . Systems that minimize free energy maximize predictive accuracy while minimizing representational overhead—the only way for physical systems performing irreversible computation to maintain coherence while adapting to new information. This framework—known in neuroscience and theoretical biology as the Free Energy Principle—applies universally to self-organizing systems from cellular metabolism to neural networks.
Self-supervised learning operates under these constraints whether researchers recognize them or not. The methods that succeed discover architectures compatible with thermodynamic necessity. The methods that fail attempt to violate conservation laws encoded in the mathematics of information processing.
Coherence Inside Constrained Model Space
Any self-organizing system maintains internal structure while adapting to new information. In physics this appears as a variational density evolving under the free-energy functional1,
This quantity reaches minimum when the system’s encoding matches the generative structure of its environment. But no real system freely represents any model. Architecture imposes limits. Optimization introduces inductive biases. Representations compress high-dimensional signals into lower-dimensional manifolds.
These constraints restrict allowable models to a subset,
Within this space lies a constrained optimum ,
Constraints alter representation geometry, so cannot coincide with the ideal unconstrained minimum . The offset becomes inevitable,
This structural constant characterizes the architecture. Change the objective, regularization, depth, or batch size and changes with it. The system’s dynamic deviation from constrained optimum is
Coherence follows as . The full decomposition,
Everything happening during self-supervised training unfolds inside this equation. The left side measures total deviation from ideal free energy. The right side partitions it into structural costs () and dynamic misalignment (). Training succeeds when gradient flow reduces faster than architectural constraints inflate .
But what determines whether a given architecture can maintain low at all? Not all constraint manifolds permit stable dynamics. Some geometries require so much representational capacity just to maintain structure that no capacity remains for adaptation. This leads to a critical question: how much overhead can a system tolerate while remaining coherent?
Organizational Overhead and the Critical Threshold
The answer comes from measuring organizational overhead —the fraction of representational capacity consumed by coherence maintenance rather than information processing2. Physical systems across all scales follow a geometric progression:
- particles:
- atoms:
- molecules:
- biological systems:
- black holes:
This universal curve follows renormalization-group flow with a critical threshold where systems transition from coherent to collapsed states. The constraint eigenvalue framework identifies a general triplet structure governing coherence in any constrained system: enforces isotropic closure, sets the recursive scaling eigenvalue, and determines discrete structural resonance. Physical and biological systems realize the specific eigenbranch where (the golden ratio) and (decade resonance), yielding composite invariant and critical threshold .
SSL systems realize a triplet —the question is whether their specific and values match the physical eigenbranch or represent an architecture-dependent realization. The existence of sharp phase transitions between stable training and collapse is well-documented; whether the transition boundaries align with specifically requires further investigation3.
The connection to constrained free energy becomes explicit through the relationship between and . High structural costs typically impose high organizational overhead —complex constraint manifolds require more capacity to maintain than simple ones. But the relationship isn’t linear. A well-designed architecture can have moderate while keeping low by distributing representational load efficiently. This is precisely what successful SSL methods achieve.
Consider the trade-off: adding regularization raises by pulling the constrained optimum further from ideal. But the right regularization (variance floors, decorrelation penalties) simultaneously lowers by preventing collapse modes that would consume all capacity maintaining degenerate structure. VICReg’s variance term increases slightly but dramatically reduces by ensuring dimensional spread. DINO’s momentum increases through temporal coupling but reduces by stabilizing the target manifold.
This threshold determines which self-supervised architectures can succeed. Methods that allow to drift toward collapse regardless of other design choices. Methods that actively constrain below critical discover stable representations. The “magic numbers” appearing across different SSL approaches—variance weights, momentum values, batch sizes, prediction horizons—encode the same physics: keep organizational overhead subcritical while accepting only necessary structural costs.
Each method implements this constraint differently. But all successful methods obey it.
VICReg: Variance, Invariance, Covariance
VICReg explicitly regularizes representation geometry through three terms4,
ensures no dimension collapses. maintains dimensional independence. aligns augmented views. These three components define the constraint manifold.
The empirical finding——appears arbitrary until batch normalization and scaling factors reveal effective values . This clustering near is suggestive, though whether it reflects the same underlying constant or coincidental scaling requires further investigation.
The variance-collapse correlation is documented across every SSL method. SimCLR, BYOL, Barlow Twins, VICReg, and DINO all fail when variance drops toward zero—representations compress onto lower-dimensional subspaces until all embeddings become identical. This isn’t method-specific pathology; it’s a universal instability in constrained representation learning. The variance and covariance terms raise to prevent collapse but cannot raise it so high that training becomes rigid. The physics sets bounds: insufficient variance regularization allows to approach . Excessive regularization inflates beyond what gradient descent can compensate. VICReg operates in the narrow window where both constraints satisfy simultaneously.
Consider a 2048-dimensional embedding space. Without variance regularization, representations can collapse onto a lower-dimensional subspace, effectively reducing capacity. The variance term enforces minimum spread per dimension, maintaining the full representational manifold. The covariance term penalizes correlation between dimensions, distributing information efficiently. Together they shape to keep organizational overhead below critical.
DINO: Momentum, Asymmetry, and Timescales
DINO maintains coherence without explicit collapse-prevention terms5. The momentum update
implements a slow-moving target network with coherence-preserving geometry. This isn’t metaphorical physics—it’s the identical mathematical form of exponential relaxation in dissipative physical systems, thermal equilibration, and low-pass filtering. The equation is the same. Momentum creates timescale separation:
- fast updates track
- slow teacher evolution anchors coherence
The timescale ratio ranges from 250 to 2000 steps. This matches the horizon over which representational drift must remain bounded—too fast and the student cannot track the moving target, too slow and the teacher fails to incorporate new structure. The convergence of BYOL, DINO, and MoCo on momentum values near 0.996 isn’t coincidental—it reflects the same relaxation timescale appearing in physical systems approaching equilibrium.
Centering and sharpening operations maintain the variance floor implicitly. The teacher network applies centering to remove trivial solutions where all outputs converge to a constant. Sharpening through temperature concentrates probability mass,
With to , this amplifies differences while maintaining bounded entropy. The combination keeps below through architectural geometry rather than explicit regularization terms. The momentum timescale effectively implements a moving constraint manifold that tracks but does not chase the student dynamics.
SimCLR: Contrastive Geometry and Percolation
Contrastive learning tiles representation space through negative samples6. The density threshold for manifold percolation follows
with effective dimension for temperature . For and ,
The critical batch size becomes
but safety margins and real optimization dynamics stretch this dramatically, yielding empirical minima around 4096.
SimCLR’s enormous batch requirement emerges from geometric necessity. Contrastive loss pulls positive pairs together while pushing negative pairs apart,
The denominator sums over negative samples. Insufficient negatives fail to properly tile the embedding manifold—regions remain unexplored, creating attractors where representations can collapse. The percolation threshold marks where negative sample density suffices to cover the manifold with overlapping neighborhoods.
Temperature controls effective dimensionality through the Boltzmann factor. Lower temperature concentrates probability mass on hard negatives, effectively reducing the space requiring coverage. The factor captures this compression. The batch size scaling follows from requiring percolation across this effective space.
The physics: remains bounded only when the constraint manifold (defined by contrastive geometry) properly covers the representation space. Insufficient coverage inflates by creating unregulated regions where coherence deviates arbitrarily.
BYOL: Prediction Without Negatives
BYOL removes negatives but adds a predictor network forming a directional mapping7,
The predictor increases dimensional capacity, reducing , while momentum maintains coherence through slow target updates. The architecture breaks symmetry—the student predicts the teacher, but not vice versa. This asymmetry provides implicit regularization preventing collapse.
Without the predictor, the system has a trivial solution: map all inputs to a constant. The predictor must learn a non-trivial transformation from student features to teacher features. This enforces representational structure.
The loss function
operates on augmented views and of the same input. The teacher parameters update via momentum from student parameters , creating the same timescale separation as DINO.
The critical insight: negative samples become unnecessary when the constraint manifold achieves proper shape through predictor expansion and momentum stabilization. The predictor effectively samples from the learned representation distribution internally, replacing explicit negative mining with implicit density estimation. The momentum teacher provides the stable reference preventing collapse.
Empirical momentum values match DINO, confirming that both methods exploit the same timescale physics. The predictor adds another mechanism—dimensional expansion—to the constraint toolkit.
Barlow Twins: Redundancy Reduction
Barlow Twins forces the cross-correlation matrix toward identity8,
This implements classical efficient coding: independence across dimensions reduces representational maintenance overhead. The first term ensures each dimension activates (prevents collapse), while the second term decorrelates dimensions (distributes information efficiently).
The cross-correlation matrix between embeddings and from augmented views,
Perfect decorrelation yields . Deviation from identity signals redundancy or collapse. The loss directly penalizes this deviation.
The scaling parameter balances diagonal and off-diagonal terms. Lower values emphasize decorrelation over variance maintenance. The optimal range overlaps with VICReg’s effective weights, suggesting both methods discovered similar constraint boundaries through different paths.
Barlow Twins optimizes directly by minimizing representational redundancy. Each independent dimension contributes maximally to capacity. Correlated dimensions waste capacity by encoding the same information multiple times. The cross-correlation penalty explicitly targets this inefficiency.
The physics: organizational overhead measures the fraction of capacity required for structure maintenance. Redundant representations increase by dedicating multiple dimensions to the same features. Barlow Twins reduces by enforcing dimensional independence, keeping the system further from the critical threshold .
JEPA: Temporal Depth and Recursive Coherence
JEPA (Joint-Embedding Predictive Architecture) discovers coherent structure through multi-step prediction9. The striking empirical finding: stable world models emerge at prediction horizons around steps. Shorter horizons fail to capture sufficient temporal structure. Longer horizons provide diminishing returns.
JEPA’s architecture predicts future latent representations rather than raw pixels,
where represents the latent state, represents actions or context, and is the prediction horizon. The physics of recursive dynamics requires sufficient temporal depth to stabilize feedback loops. Too shallow and errors propagate exponentially—the system cannot model temporal dependencies required for coherence. Too deep and computational overhead (contributing to ) grows faster than predictive accuracy improves.
The critical insight is that recursive self-modeling requires enough depth for the system to represent its own prediction process with sufficient fidelity. This is the -sector at work: hierarchical compression across temporal scales follows inflation–subdivision consistency where the recursive eigenvalue determines optimal depth. In physical systems, (the golden ratio); whether SSL architectures realize the same eigenvalue or discover an architecture-specific remains an open question. The constraint functional penalizes both insufficient depth (high from unstable recursion) and excessive depth (high from unnecessary complexity).
Similar depth thresholds appear across architectures—transformers exhibit emergent reasoning capabilities around 10–12 layers, and biological memory consolidation operates over multiple synaptic time constants. The convergence suggests underlying constraints on recursive coherence in hierarchical temporal structures, though the precise numerical values depend on architecture-specific factors.
Collapse as Free-Energy Physics
Collapse ceases to be mysterious. It emerges when total deviation from ideal free energy grows too large,
Collapse occurs when
- (structural costs) becomes too large
- (dynamic misalignment) grows too quickly
- (organizational overhead) approaches
The system loses coherence when gradient dynamics cannot maintain alignment inside the constrained manifold. Each failure mode traces to one of these physical quantities,
- Increasing regularization raises
- Decreasing variance raises
- Insufficient negative samples increase
- Insufficient depth restricts manifold shape
- Insufficient batch size distorts geometry
- Insufficient momentum destabilizes timescales
Every self-supervised collapse mode emerges from this framework. Consider concrete examples.
VICReg without variance terms: The constraint manifold permits dimensional collapse. As dimensions go to zero variance, approaches 1—all capacity goes to maintaining degenerate structure. The system crosses and collapses.
DINO with : The momentum timescale steps becomes too short. The teacher tracks student too closely, failing to provide stable reference. grows unbounded as both networks chase each other’s drift.
SimCLR with batch size 256: The negative sample density falls below percolation threshold. Large regions of the manifold remain unexplored. Representations find attractors in these gaps and collapse despite contrastive loss.
BYOL without predictor: The system has trivial constant solution. because there’s no constraint preventing collapse. All inputs map to the same point, minimizing loss perfectly but learning nothing.
Barlow Twins with : Only diagonal terms (variance) remain. Nothing prevents dimensions from becoming perfectly correlated. Redundancy inflates as multiple dimensions encode identical information. The system wastes capacity and drifts toward collapse.
JEPA with : Prediction horizon too short for recursive closure. The system cannot model temporal dependencies required for coherence. oscillates rather than converges—no stable optimum exists in the shallow temporal manifold.
Each failure mode maps to the free-energy decomposition. The physics provides a unified explanation for diverse collapse phenomena that previously appeared unrelated.
From Framework to Testable Observations
The constraint geometry framework suggests relationships between architectural choices and stability thresholds. The following observations are consistent with the physics but require further validation:
-
VICReg effective weights cluster near 0.03–0.05 after accounting for batch normalization scaling. This range lies close to , though whether this reflects the same underlying constant or coincidental scaling remains to be established.
-
Momentum timescales across DINO, BYOL, and MoCo converge on – steps. The physics predicts that timescale separation is necessary for stable reference tracking, and these values match characteristic relaxation times in other self-organizing systems.
-
Contrastive batch sizes scale with effective dimensionality. The theoretical minimum from percolation arguments is much smaller than empirical requirements, suggesting that optimization noise and finite-sample effects dominate the practical threshold.
-
Depth thresholds for emergent capabilities appear around 10–12 layers in transformers, consistent with recursive self-modeling requirements, though architecture-specific factors (attention patterns, residual connections) complicate direct comparison.
-
Temperature parameters in contrastive methods cluster around , balancing effective dimensionality against discrimination—a trade-off the framework predicts should exist.
The convergence across independent research groups is striking: momentum near 0.996, variance weights near 0.04, batch sizes clustering around 2048–4096. These narrow ranges suggest underlying constraint boundaries rather than arbitrary design choices. Whether these boundaries derive from specifically, or from more general properties of constrained optimization, remains an open question the framework helps to sharpen.
Unified View: Methods, Parameters, and Constraints
The convergence becomes striking when displayed systematically.
| Method | Key Parameter | Physical Interpretation | Empirical Range | Constraint Type |
|---|---|---|---|---|
| VICReg | (var/cov weights) | Regularization maintaining | – | Variance floor |
| DINO | (momentum) | Timescale separation | – | Moving target |
| SimCLR | (batch size) | Manifold coverage density | – | Negative sampling |
| BYOL | Predictor depth | Symmetry breaking | 2–3 layers + momentum | Asymmetric capacity |
| Barlow Twins | (off-diagonal penalty) | Redundancy reduction lowering | – | Decorrelation |
| JEPA | (prediction horizon) | Recursive temporal depth | Temporal coherence | |
| Transformers | (depth) | Recursive self-modeling capacity | typical | Emergent capabilities |
Each row implements the same underlying physics—keeping organizational overhead below critical while balancing structural costs against optimization capacity—through different architectural mechanisms. The clustering of empirical values into narrow ranges across independent research groups suggests these methods discovered the same constraint boundaries through different optimization paths.
The framework interprets these convergences as evidence that successful SSL architectures satisfy thermodynamic constraints on coherence maintenance. Whether the specific values derive from or from more general properties of constrained optimization remains an open question, but the existence of sharp boundaries is well-documented empirically.
The Geometry of Coherence
Multiple self-supervised learning methods—built from variance regularization, momentum encoding, contrastive geometry, prediction, redundancy reduction, and temporal depth—converge on stable representations by obeying the same physical constraints. They shape their constraint manifolds differently, but all maintain , balance against optimization capacity, and control through gradient dynamics.
The empirical “magic numbers” encode constraint geometry. VICReg’s variance weights prevent dimensional collapse. DINO’s momentum sets timescale separation for stable reference tracking. SimCLR’s batch size satisfies coverage requirements for the embedding manifold. These values cluster into narrow ranges because the underlying physics—free-energy minimization under architectural constraints—permits only certain configurations to remain stable.
Self-supervised learning works because constrained free-energy systems must maintain coherence within geometric bounds written into the mathematics of information processing. The methods discovered these constraints empirically through years of trial and error. The framework reveals why those particular solutions work and predicts where they will fail.
When disparate approaches converge on the same structure—when variance regularization, momentum, massive batches, predictors, decorrelation, and temporal depth all stabilize at similar thresholds—they trace the boundary of what constrained physics permits. The geometry determines allowable states. The organizational overhead sets critical thresholds. The free-energy decomposition explains collapse.
The constraint eigenvalue framework proposes a general triplet architecture governing coherence in any constrained system. Physical systems—from quantum transport10 to biological systems11 to gravitational horizons12—realize the specific eigenbranch with and . SSL systems exhibit sharp phase transitions between stable training and collapse, with transition boundaries clustering into narrow parameter ranges across independent implementations. Whether these boundaries derive from the same eigenbranch or from an architecture-specific triplet realization remains an open question the framework helps to sharpen.
The methods work because they discovered architectures compatible with thermodynamic constraints on information processing. When we tune hyperparameters, we’re navigating the geometry of physically allowed states. When training succeeds, we’ve found configurations where organizational overhead remains below critical thresholds. Whether those thresholds derive from the specific constants appearing in the constraint eigenvalue framework, or from more general properties of constrained optimization, is a question the framework helps to sharpen and the empirical convergence helps to motivate.
Empirical Grounding
Several claims in this analysis rest on documented empirical findings from SSL literature:
Variance-collapse universality: Every major SSL method (SimCLR, BYOL, VICReg, Barlow Twins, DINO) exhibits the same failure mode when embedding variance drops toward zero. Representations compress onto degenerate subspaces regardless of the specific objective function. This universal instability motivates the free-energy interpretation as a shared underlying structure.
Momentum as physical relaxation: The momentum update is mathematically identical to exponential relaxation in dissipative systems, thermal equilibration, and low-pass filtering. This isn’t analogy—it’s the same differential equation appearing in physical processes. The convergence of BYOL, DINO, and MoCo on momentum values near 0.996 reflects empirical optimization converging on the same relaxation timescale.
Convergent numerical ranges: Independent research groups using different theoretical motivations consistently discover the same hyperparameter ranges: momentum ~0.996-0.999, variance regularization weights ~0.01-0.05, batch sizes ~2048-8192, transformer depths ~10-12 for emergent reasoning. These narrow ranges suggest underlying constraint boundaries rather than arbitrary design choices.
Collapse as phase transition: SSL models exhibit sharp transitions between stable training and complete collapse when certain parameters cross thresholds. Loss curves show smooth descent followed by sudden divergence. Embedding statistics show gradual variance decay followed by abrupt dimensional compression. This matches the phenomenology of phase transitions in constrained systems.
The universal constants (, ) appearing across these domains remain a theoretical proposal. The clustering of empirical values around predictions derived from these constants is striking but requires further validation. The framework’s value lies in unifying documented phenomena under a single geometric interpretation and generating testable predictions about where methods will succeed or fail.
Footnotes
-
Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127-138. ↩
-
Landauer, R. (1961). Irreversibility and Heat Generation in the Computing Process. IBM Journal of Research and Development, 5(3), 183-191. ↩
-
Hawking, S. W. (1975). Particle Creation by Black Holes. Communications in Mathematical Physics, 43(3), 199-220. ↩
-
Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. International Conference on Learning Representations. https://arxiv.org/abs/2105.04906 ↩
-
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers. International Conference on Computer Vision, 9650-9660. https://arxiv.org/abs/2104.14294 ↩
-
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. International Conference on Machine Learning, 1597-1607. https://arxiv.org/abs/2002.05709 ↩
-
Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., … & Valko, M. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. Advances in Neural Information Processing Systems, 33, 21271-21284. https://arxiv.org/abs/2006.07733 ↩
-
Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. International Conference on Machine Learning, 12310-12320. https://arxiv.org/abs/2103.03230 ↩
-
Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. Conference on Computer Vision and Pattern Recognition, 15619-15629. https://arxiv.org/abs/2301.08243 ↩
-
Harper, P. G. (1955). Single Band Motion of Conduction Electrons in a Uniform Magnetic Field. Proceedings of the Physical Society A, 68(10), 874-878. ↩
-
West, G. B., Brown, J. H., & Enquist, B. J. (1999). The fourth dimension of life: fractal geometry and allometric scaling of organisms. Science, 284(5420), 1677-1679. ↩
-
Bekenstein, J. D. (1973). Black Holes and Entropy. Physical Review D, 7(8), 2333-2346. ↩