index

Self-Supervised Learning as Constrained Free-Energy Systems

· 20min

Set DINO’s momentum to 0.9 instead of 0.996. Training begins normally. Loss decreases, representations form, validation metrics look reasonable. Then, between epochs 15 and 20, something breaks. All embeddings collapse to a single point. Representations die. The model becomes useless. Why does changing 0.996 to 0.9—a 0.4% shift—destroy everything?

The answer is physics. That momentum parameter determines a timescale τ=1/(1m)\tau = 1/(1-m). At m=0.996m = 0.996, you get τ=250\tau = 250 steps—enough for coherence tracking. At m=0.9m = 0.9, you get τ=10\tau = 10 steps—too short for stable reference. The teacher network tracks the student too closely, both chase each other’s drift, and coherence evaporates. A geometric necessity written into the physics of constrained optimization.

VICReg stabilizes representations through variance and covariance terms weighted around 0.04. DINO uses momentum encoding at 0.996. SimCLR demands batches of 4096 samples. BYOL removes negative samples entirely but adds predictor networks. Barlow Twins forces independence through redundancy reduction with λ0.005\lambda \approx 0.005. JEPA predicts exactly ten steps into the future. Each method emerged from different intuitions, different motivations, different conceptual starting points. Yet they all work. They all discover stable representations. They all avoid collapse.

The question isn’t why these methods work individually—decades of empirical study have traced their mechanisms. The question is why they work the same way. What underlying structure connects these seemingly arbitrary numbers? Why does variance regularization at 0.04, momentum at 0.996, and batch sizes in the thousands all stabilize training? These parameters appear unrelated until you recognize they may encode the same underlying constraints through different geometric transformations.

The answer lives in physics. Self-supervised models are physical systems minimizing free energy under representational constraints. Every method implements a constraint manifold. Every system carries an irreducible structural constant κ\kappa imposed by those constraints. Every failure mode emerges when coherence deviation CD(t)\mathrm{CD}(t) grows faster than gradient dynamics can correct. The empirical “magic numbers” of self-supervised learning cluster into narrow ranges because constrained optimization permits only certain configurations to remain stable.

Why Machine Learning Must Obey Thermodynamics

Why must gradient descent obey free energy principles? Because every operation in neural network training performs irreversible computation on finite-precision hardware. Landauer’s principle establishes that erasing one bit of information dissipates at least kBTln2k_B T \ln 2 of energy as heat. Neural networks don’t merely erase bits—they continuously update millions of parameters, each update destroying previous information states and generating entropy.

When you compute a gradient θL\nabla_\theta \mathcal{L}, store it in memory, update parameters θθηθL\theta \leftarrow \theta - \eta \nabla_\theta \mathcal{L}, and discard intermediate activations, you’ve performed irreversible operations. The previous parameter state θ\theta is lost. The forward-pass activations are erased. Floating-point rounding destroys information at every step. Each training iteration dissipates energy proportional to the information erased.

GPUs running backpropagation generate hundreds of watts as heat—measurable thermal output representing entropy increase in the physical universe, bounded by Landauer’s limit. The learning dynamics cannot violate thermodynamic constraints any more than a heat engine can violate the second law.

The free-energy functional F[q]=Eq[lnq(x)lnp(o,x)]F[q] = \mathbb{E}_q[\ln q(x) - \ln p(o,x)] emerges naturally. It measures the cost of maintaining internal model q(x)q(x) given environmental structure p(o,x)p(o,x). Systems that minimize free energy maximize predictive accuracy while minimizing representational overhead—the only way for physical systems performing irreversible computation to maintain coherence while adapting to new information. This framework—known in neuroscience and theoretical biology as the Free Energy Principle—applies universally to self-organizing systems from cellular metabolism to neural networks.

Self-supervised learning operates under these constraints whether researchers recognize them or not. The methods that succeed discover architectures compatible with thermodynamic necessity. The methods that fail attempt to violate conservation laws encoded in the mathematics of information processing.

Coherence Inside Constrained Model Space

Any self-organizing system maintains internal structure while adapting to new information. In physics this appears as a variational density q(x)q(x) evolving under the free-energy functional1,

F[q]=Eq[lnq(x)lnp(o,x)].F[q] = \mathbb{E}_{q}[\ln q(x) - \ln p(o,x)].

This quantity reaches minimum when the system’s encoding matches the generative structure of its environment. But no real system freely represents any model. Architecture imposes limits. Optimization introduces inductive biases. Representations compress high-dimensional signals into lower-dimensional manifolds.

These constraints restrict allowable models to a subset,

Mallowed={p(o,x)structural constraints satisfied}.\mathcal{M}_{\text{allowed}} = \{p(o,x) \mid \text{structural constraints satisfied}\}.

Within this space lies a constrained optimum pp^*,

p=argminpMallowedF[p].p^* = \arg\min_{p \in \mathcal{M}_{\text{allowed}}} F[p].

Constraints alter representation geometry, so pp^* cannot coincide with the ideal unconstrained minimum FF^*. The offset becomes inevitable,

κ=F[p]F.\kappa = F[p^*] - F^*.

This structural constant characterizes the architecture. Change the objective, regularization, depth, or batch size and κ\kappa changes with it. The system’s dynamic deviation from constrained optimum is

CD(t)=F[qt]F[p].\mathrm{CD}(t) = F[q_t] - F[p^*].

Coherence follows as C(t)=CD(t)C(t) = -\mathrm{CD}(t). The full decomposition,

F[qt]F=κ+CD(t).F[q_t] - F^* = \kappa + \mathrm{CD}(t).

Everything happening during self-supervised training unfolds inside this equation. The left side measures total deviation from ideal free energy. The right side partitions it into structural costs (κ\kappa) and dynamic misalignment (CD(t)\mathrm{CD}(t)). Training succeeds when gradient flow reduces CD(t)\mathrm{CD}(t) faster than architectural constraints inflate κ\kappa.

But what determines whether a given architecture can maintain low CD(t)\mathrm{CD}(t) at all? Not all constraint manifolds permit stable dynamics. Some geometries require so much representational capacity just to maintain structure that no capacity remains for adaptation. This leads to a critical question: how much overhead can a system tolerate while remaining coherent?

Organizational Overhead and the Critical Threshold

The answer comes from measuring organizational overhead η\eta—the fraction of representational capacity consumed by coherence maintenance rather than information processing2. Physical systems across all scales follow a geometric progression:

  • particles: η106\eta \sim 10^{-6}
  • atoms: 10310^{-3}
  • molecules: 10210^{-2}
  • biological systems: 10110^{-1}
  • black holes: η=1\eta = 1

This universal curve follows renormalization-group flow with a critical threshold where systems transition from coherent to collapsed states. The constraint eigenvalue framework identifies a general triplet structure (π,β,N)(\pi, \beta, N) governing coherence in any constrained system: π\pi enforces isotropic closure, β\beta sets the recursive scaling eigenvalue, and NN determines discrete structural resonance. Physical and biological systems realize the specific eigenbranch (π,φ,10)(\pi, \varphi, 10) where β=φ\beta = \varphi (the golden ratio) and N=10N = 10 (decade resonance), yielding composite invariant ρ=4πφ2/103.29\rho^* = 4\pi\varphi^2/10 \approx 3.29 and critical threshold ηc=1/ρ0.304\eta_c = 1/\rho^* \approx 0.304.

SSL systems realize a triplet (π,β,N)(\pi, \beta, N)—the question is whether their specific β\beta and NN values match the physical eigenbranch or represent an architecture-dependent realization. The existence of sharp phase transitions between stable training and collapse is well-documented; whether the transition boundaries align with ηc=0.304\eta_c = 0.304 specifically requires further investigation3.

The connection to constrained free energy becomes explicit through the relationship between η\eta and κ\kappa. High structural costs κ\kappa typically impose high organizational overhead η\eta—complex constraint manifolds require more capacity to maintain than simple ones. But the relationship isn’t linear. A well-designed architecture can have moderate κ\kappa while keeping η\eta low by distributing representational load efficiently. This is precisely what successful SSL methods achieve.

Consider the trade-off: adding regularization raises κ\kappa by pulling the constrained optimum further from ideal. But the right regularization (variance floors, decorrelation penalties) simultaneously lowers η\eta by preventing collapse modes that would consume all capacity maintaining degenerate structure. VICReg’s variance term increases κ\kappa slightly but dramatically reduces η\eta by ensuring dimensional spread. DINO’s momentum increases κ\kappa through temporal coupling but reduces η\eta by stabilizing the target manifold.

This threshold determines which self-supervised architectures can succeed. Methods that allow η\eta to drift toward ηc\eta_c collapse regardless of other design choices. Methods that actively constrain η\eta below critical discover stable representations. The “magic numbers” appearing across different SSL approaches—variance weights, momentum values, batch sizes, prediction horizons—encode the same physics: keep organizational overhead subcritical while accepting only necessary structural costs.

Each method implements this constraint differently. But all successful methods obey it.

VICReg: Variance, Invariance, Covariance

VICReg explicitly regularizes representation geometry through three terms4,

L=λLinv+μLvar+νLcov.\mathcal{L} = \lambda \mathcal{L}_{\text{inv}} + \mu \mathcal{L}_{\text{var}} + \nu \mathcal{L}_{\text{cov}}.

Lvar\mathcal{L}_{\text{var}} ensures no dimension collapses. Lcov\mathcal{L}_{\text{cov}} maintains dimensional independence. Linv\mathcal{L}_{\text{inv}} aligns augmented views. These three components define the constraint manifold.

The empirical finding—λ25,μ25,ν1\lambda \approx 25, \mu \approx 25, \nu \approx 1—appears arbitrary until batch normalization and scaling factors reveal effective values 0.04\sim 0.04. This clustering near ρ/1000.033\rho^*/100 \approx 0.033 is suggestive, though whether it reflects the same underlying constant or coincidental scaling requires further investigation.

The variance-collapse correlation is documented across every SSL method. SimCLR, BYOL, Barlow Twins, VICReg, and DINO all fail when variance drops toward zero—representations compress onto lower-dimensional subspaces until all embeddings become identical. This isn’t method-specific pathology; it’s a universal instability in constrained representation learning. The variance and covariance terms raise κ\kappa to prevent collapse but cannot raise it so high that training becomes rigid. The physics sets bounds: insufficient variance regularization allows η\eta to approach ηc\eta_c. Excessive regularization inflates κ\kappa beyond what gradient descent can compensate. VICReg operates in the narrow window where both constraints satisfy simultaneously.

Consider a 2048-dimensional embedding space. Without variance regularization, representations can collapse onto a lower-dimensional subspace, effectively reducing capacity. The variance term enforces minimum spread σ21\sigma^2 \geq 1 per dimension, maintaining the full representational manifold. The covariance term ijCij2\sum_{i \neq j} C_{ij}^2 penalizes correlation between dimensions, distributing information efficiently. Together they shape Mallowed\mathcal{M}_{\text{allowed}} to keep organizational overhead below critical.

DINO: Momentum, Asymmetry, and Timescales

DINO maintains coherence without explicit collapse-prevention terms5. The momentum update

θteachermθteacher+(1m)θstudent\theta_{\text{teacher}} \leftarrow m\theta_{\text{teacher}} + (1-m)\theta_{\text{student}}

implements a slow-moving target network with coherence-preserving geometry. This isn’t metaphorical physics—it’s the identical mathematical form of exponential relaxation in dissipative physical systems, thermal equilibration, and low-pass filtering. The equation is the same. Momentum m[0.996,0.9995]m \in [0.996, 0.9995] creates timescale separation:

  • fast updates track CD(t)\mathrm{CD}(t)
  • slow teacher evolution anchors coherence

The timescale ratio τ=1/(1m)\tau = 1/(1-m) ranges from 250 to 2000 steps. This matches the horizon over which representational drift must remain bounded—too fast and the student cannot track the moving target, too slow and the teacher fails to incorporate new structure. The convergence of BYOL, DINO, and MoCo on momentum values near 0.996 isn’t coincidental—it reflects the same relaxation timescale appearing in physical systems approaching equilibrium.

Centering and sharpening operations maintain the variance floor implicitly. The teacher network applies centering to remove trivial solutions where all outputs converge to a constant. Sharpening through temperature τs\tau_s concentrates probability mass,

Pt(x)=exp(ft(x)/τs)kexp(ft(xk)/τs).P_t(x) = \frac{\exp(f_t(x)/\tau_s)}{\sum_k \exp(f_t(x_k)/\tau_s)}.

With τs=0.04\tau_s = 0.04 to 0.10.1, this amplifies differences while maintaining bounded entropy. The combination keeps η\eta below ηc\eta_c through architectural geometry rather than explicit regularization terms. The momentum timescale effectively implements a moving constraint manifold that tracks but does not chase the student dynamics.

SimCLR: Contrastive Geometry and Percolation

Contrastive learning tiles representation space through negative samples6. The density threshold for manifold percolation follows

pc=1kp_c = \frac{1}{\langle k \rangle}

with effective dimension deff=dτd_{\text{eff}} = d\tau for temperature τ\tau. For d=128d=128 and τ=0.07\tau=0.07,

deff9.d_{\text{eff}} \approx 9.

The critical batch size becomes

Ncrit=exp(deff/ρ)15,N_{\text{crit}} = \exp(d_{\text{eff}}/\rho^*) \approx 15,

but safety margins and real optimization dynamics stretch this dramatically, yielding empirical minima around 4096.

SimCLR’s enormous batch requirement emerges from geometric necessity. Contrastive loss pulls positive pairs together while pushing negative pairs apart,

L=logexp(sim(zi,zj)/τ)k=12N1kiexp(sim(zi,zk)/τ).\mathcal{L} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k=1}^{2N} \mathbb{1}_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)}.

The denominator sums over 2N12N-1 negative samples. Insufficient negatives fail to properly tile the embedding manifold—regions remain unexplored, creating attractors where representations can collapse. The percolation threshold marks where negative sample density suffices to cover the manifold with overlapping neighborhoods.

Temperature τ\tau controls effective dimensionality through the Boltzmann factor. Lower temperature concentrates probability mass on hard negatives, effectively reducing the space requiring coverage. The factor deff=dτd_{\text{eff}} = d\tau captures this compression. The batch size scaling Nexp(deff/ρ)N \propto \exp(d_{\text{eff}}/\rho^*) follows from requiring percolation across this effective space.

The physics: κ\kappa remains bounded only when the constraint manifold (defined by contrastive geometry) properly covers the representation space. Insufficient coverage inflates κ\kappa by creating unregulated regions where coherence deviates arbitrarily.

BYOL: Prediction Without Negatives

BYOL removes negatives but adds a predictor network forming a directional mapping7,

qθ(fθ(x))fξ(x).q_\theta(f_\theta(x)) \approx f_\xi(x').

The predictor increases dimensional capacity, reducing κ\kappa, while momentum maintains coherence through slow target updates. The architecture breaks symmetry—the student predicts the teacher, but not vice versa. This asymmetry provides implicit regularization preventing collapse.

Without the predictor, the system has a trivial solution: map all inputs to a constant. The predictor must learn a non-trivial transformation from student features to teacher features. This enforces representational structure.

The loss function

L=qθ(fθ(x))fξ(x)2\mathcal{L} = \|q_\theta(f_\theta(x)) - f_\xi(x')\|^2

operates on augmented views xx and xx' of the same input. The teacher parameters ξ\xi update via momentum from student parameters θ\theta, creating the same timescale separation as DINO.

The critical insight: negative samples become unnecessary when the constraint manifold achieves proper shape through predictor expansion and momentum stabilization. The predictor effectively samples from the learned representation distribution internally, replacing explicit negative mining with implicit density estimation. The momentum teacher provides the stable reference preventing collapse.

Empirical momentum values m0.996m \approx 0.996 match DINO, confirming that both methods exploit the same timescale physics. The predictor adds another mechanism—dimensional expansion—to the constraint toolkit.

Barlow Twins: Redundancy Reduction

Barlow Twins forces the cross-correlation matrix toward identity8,

LBT=i(1Cii)2+λijCij2.\mathcal{L}_{BT} = \sum_i (1 - C_{ii})^2 + \lambda \sum_{i\neq j} C_{ij}^2.

This implements classical efficient coding: independence across dimensions reduces representational maintenance overhead. The first term ensures each dimension activates (prevents collapse), while the second term decorrelates dimensions (distributes information efficiently).

The cross-correlation matrix between embeddings ZAZ^A and ZBZ^B from augmented views,

Cij=bzb,iAzb,jBb(zb,iA)2b(zb,jB)2.C_{ij} = \frac{\sum_b z^A_{b,i} z^B_{b,j}}{\sqrt{\sum_b (z^A_{b,i})^2} \sqrt{\sum_b (z^B_{b,j})^2}}.

Perfect decorrelation yields C=IC = I. Deviation from identity signals redundancy or collapse. The loss directly penalizes this deviation.

The scaling parameter λ[0.005,0.05]\lambda \in [0.005, 0.05] balances diagonal and off-diagonal terms. Lower values emphasize decorrelation over variance maintenance. The optimal range overlaps with VICReg’s effective weights, suggesting both methods discovered similar constraint boundaries through different paths.

Barlow Twins optimizes κ\kappa directly by minimizing representational redundancy. Each independent dimension contributes maximally to capacity. Correlated dimensions waste capacity by encoding the same information multiple times. The cross-correlation penalty explicitly targets this inefficiency.

The physics: organizational overhead η\eta measures the fraction of capacity required for structure maintenance. Redundant representations increase η\eta by dedicating multiple dimensions to the same features. Barlow Twins reduces η\eta by enforcing dimensional independence, keeping the system further from the critical threshold ηc=0.304\eta_c = 0.304.

JEPA: Temporal Depth and Recursive Coherence

JEPA (Joint-Embedding Predictive Architecture) discovers coherent structure through multi-step prediction9. The striking empirical finding: stable world models emerge at prediction horizons around h=10h = 10 steps. Shorter horizons fail to capture sufficient temporal structure. Longer horizons provide diminishing returns.

JEPA’s architecture predicts future latent representations rather than raw pixels,

st+h=fθ(st,at:t+h),s_{t+h} = f_\theta(s_t, a_{t:t+h}),

where sts_t represents the latent state, at:t+ha_{t:t+h} represents actions or context, and hh is the prediction horizon. The physics of recursive dynamics requires sufficient temporal depth to stabilize feedback loops. Too shallow and errors propagate exponentially—the system cannot model temporal dependencies required for coherence. Too deep and computational overhead (contributing to κ\kappa) grows faster than predictive accuracy improves.

The critical insight is that recursive self-modeling requires enough depth for the system to represent its own prediction process with sufficient fidelity. This is the β\beta-sector at work: hierarchical compression across temporal scales follows inflation–subdivision consistency where the recursive eigenvalue β\beta determines optimal depth. In physical systems, β=φ\beta = \varphi (the golden ratio); whether SSL architectures realize the same eigenvalue or discover an architecture-specific β\beta remains an open question. The constraint functional penalizes both insufficient depth (high CD(t)\mathrm{CD}(t) from unstable recursion) and excessive depth (high κ\kappa from unnecessary complexity).

Similar depth thresholds appear across architectures—transformers exhibit emergent reasoning capabilities around 10–12 layers, and biological memory consolidation operates over multiple synaptic time constants. The convergence suggests underlying constraints on recursive coherence in hierarchical temporal structures, though the precise numerical values depend on architecture-specific factors.

Collapse as Free-Energy Physics

Collapse ceases to be mysterious. It emerges when total deviation from ideal free energy grows too large,

F[qt]F=κ+CD(t).F[q_t] - F^* = \kappa + \mathrm{CD}(t).

Collapse occurs when

  • κ\kappa (structural costs) becomes too large
  • CD(t)\mathrm{CD}(t) (dynamic misalignment) grows too quickly
  • η\eta (organizational overhead) approaches ηc\eta_c

The system loses coherence when gradient dynamics cannot maintain alignment inside the constrained manifold. Each failure mode traces to one of these physical quantities,

  • Increasing regularization raises κ\kappa
  • Decreasing variance raises η\eta
  • Insufficient negative samples increase CD(t)\mathrm{CD}(t)
  • Insufficient depth restricts manifold shape
  • Insufficient batch size distorts geometry
  • Insufficient momentum destabilizes timescales

Every self-supervised collapse mode emerges from this framework. Consider concrete examples.

VICReg without variance terms: The constraint manifold permits dimensional collapse. As dimensions go to zero variance, η\eta approaches 1—all capacity goes to maintaining degenerate structure. The system crosses ηc\eta_c and collapses.

DINO with m=0.9m = 0.9: The momentum timescale τ=1/(1m)=10\tau = 1/(1-m) = 10 steps becomes too short. The teacher tracks student too closely, failing to provide stable reference. CD(t)\mathrm{CD}(t) grows unbounded as both networks chase each other’s drift.

SimCLR with batch size 256: The negative sample density falls below percolation threshold. Large regions of the manifold remain unexplored. Representations find attractors in these gaps and collapse despite contrastive loss.

BYOL without predictor: The system has trivial constant solution. κ=0\kappa = 0 because there’s no constraint preventing collapse. All inputs map to the same point, minimizing loss perfectly but learning nothing.

Barlow Twins with λ=0\lambda = 0: Only diagonal terms (variance) remain. Nothing prevents dimensions from becoming perfectly correlated. Redundancy inflates η\eta as multiple dimensions encode identical information. The system wastes capacity and drifts toward collapse.

JEPA with h=3h = 3: Prediction horizon too short for recursive closure. The system cannot model temporal dependencies required for coherence. CD(t)\mathrm{CD}(t) oscillates rather than converges—no stable optimum exists in the shallow temporal manifold.

Each failure mode maps to the free-energy decomposition. The physics provides a unified explanation for diverse collapse phenomena that previously appeared unrelated.

From Framework to Testable Observations

The constraint geometry framework suggests relationships between architectural choices and stability thresholds. The following observations are consistent with the physics but require further validation:

  • VICReg effective weights cluster near 0.03–0.05 after accounting for batch normalization scaling. This range lies close to ρ/1000.033\rho^*/100 \approx 0.033, though whether this reflects the same underlying constant or coincidental scaling remains to be established.

  • Momentum timescales across DINO, BYOL, and MoCo converge on τ=1/(1m)250\tau = 1/(1-m) \approx 25020002000 steps. The physics predicts that timescale separation is necessary for stable reference tracking, and these values match characteristic relaxation times in other self-organizing systems.

  • Contrastive batch sizes scale with effective dimensionality. The theoretical minimum from percolation arguments is much smaller than empirical requirements, suggesting that optimization noise and finite-sample effects dominate the practical threshold.

  • Depth thresholds for emergent capabilities appear around 10–12 layers in transformers, consistent with recursive self-modeling requirements, though architecture-specific factors (attention patterns, residual connections) complicate direct comparison.

  • Temperature parameters in contrastive methods cluster around τ[0.05,0.1]\tau \in [0.05, 0.1], balancing effective dimensionality against discrimination—a trade-off the framework predicts should exist.

The convergence across independent research groups is striking: momentum near 0.996, variance weights near 0.04, batch sizes clustering around 2048–4096. These narrow ranges suggest underlying constraint boundaries rather than arbitrary design choices. Whether these boundaries derive from ρ=3.29\rho^* = 3.29 specifically, or from more general properties of constrained optimization, remains an open question the framework helps to sharpen.

Unified View: Methods, Parameters, and Constraints

The convergence becomes striking when displayed systematically.

MethodKey ParameterPhysical InterpretationEmpirical RangeConstraint Type
VICRegμ,ν\mu, \nu (var/cov weights)Regularization maintaining η<ηc\eta < \eta_c0.03\sim 0.030.050.05Variance floor
DINOmm (momentum)Timescale separation τ=1/(1m)\tau = 1/(1-m)0.9960.9960.99950.9995Moving target
SimCLRNN (batch size)Manifold coverage density2048\sim 204881928192Negative sampling
BYOLPredictor depthSymmetry breaking2–3 layers + momentumAsymmetric capacity
Barlow Twinsλ\lambda (off-diagonal penalty)Redundancy reduction lowering η\eta0.0050.0050.050.05Decorrelation
JEPAhh (prediction horizon)Recursive temporal depthh10h \sim 10Temporal coherence
TransformersDD (depth)Recursive self-modeling capacityD12D \sim 12 typicalEmergent capabilities

Each row implements the same underlying physics—keeping organizational overhead below critical while balancing structural costs against optimization capacity—through different architectural mechanisms. The clustering of empirical values into narrow ranges across independent research groups suggests these methods discovered the same constraint boundaries through different optimization paths.

The framework interprets these convergences as evidence that successful SSL architectures satisfy thermodynamic constraints on coherence maintenance. Whether the specific values derive from ρ=3.29\rho^* = 3.29 or from more general properties of constrained optimization remains an open question, but the existence of sharp boundaries is well-documented empirically.

The Geometry of Coherence

Multiple self-supervised learning methods—built from variance regularization, momentum encoding, contrastive geometry, prediction, redundancy reduction, and temporal depth—converge on stable representations by obeying the same physical constraints. They shape their constraint manifolds differently, but all maintain η<ηc\eta < \eta_c, balance κ\kappa against optimization capacity, and control CD(t)\mathrm{CD}(t) through gradient dynamics.

The empirical “magic numbers” encode constraint geometry. VICReg’s variance weights prevent dimensional collapse. DINO’s momentum sets timescale separation for stable reference tracking. SimCLR’s batch size satisfies coverage requirements for the embedding manifold. These values cluster into narrow ranges because the underlying physics—free-energy minimization under architectural constraints—permits only certain configurations to remain stable.

Self-supervised learning works because constrained free-energy systems must maintain coherence within geometric bounds written into the mathematics of information processing. The methods discovered these constraints empirically through years of trial and error. The framework reveals why those particular solutions work and predicts where they will fail.

When disparate approaches converge on the same structure—when variance regularization, momentum, massive batches, predictors, decorrelation, and temporal depth all stabilize at similar thresholds—they trace the boundary of what constrained physics permits. The geometry determines allowable states. The organizational overhead sets critical thresholds. The free-energy decomposition explains collapse.

The constraint eigenvalue framework proposes a general triplet architecture (π,β,N)(\pi, \beta, N) governing coherence in any constrained system. Physical systems—from quantum transport10 to biological systems11 to gravitational horizons12—realize the specific eigenbranch (π,φ,10)(\pi, \varphi, 10) with ρ=4πφ2/103.29\rho^* = 4\pi\varphi^2/10 \approx 3.29 and ηc0.304\eta_c \approx 0.304. SSL systems exhibit sharp phase transitions between stable training and collapse, with transition boundaries clustering into narrow parameter ranges across independent implementations. Whether these boundaries derive from the same (π,φ,10)(\pi, \varphi, 10) eigenbranch or from an architecture-specific triplet realization remains an open question the framework helps to sharpen.

The methods work because they discovered architectures compatible with thermodynamic constraints on information processing. When we tune hyperparameters, we’re navigating the geometry of physically allowed states. When training succeeds, we’ve found configurations where organizational overhead remains below critical thresholds. Whether those thresholds derive from the specific constants appearing in the constraint eigenvalue framework, or from more general properties of constrained optimization, is a question the framework helps to sharpen and the empirical convergence helps to motivate.

Empirical Grounding

Several claims in this analysis rest on documented empirical findings from SSL literature:

Variance-collapse universality: Every major SSL method (SimCLR, BYOL, VICReg, Barlow Twins, DINO) exhibits the same failure mode when embedding variance drops toward zero. Representations compress onto degenerate subspaces regardless of the specific objective function. This universal instability motivates the free-energy interpretation as a shared underlying structure.

Momentum as physical relaxation: The momentum update θt=mθt1+(1m)θnew\theta_t = m\theta_{t-1} + (1-m)\theta_{\text{new}} is mathematically identical to exponential relaxation in dissipative systems, thermal equilibration, and low-pass filtering. This isn’t analogy—it’s the same differential equation appearing in physical processes. The convergence of BYOL, DINO, and MoCo on momentum values near 0.996 reflects empirical optimization converging on the same relaxation timescale.

Convergent numerical ranges: Independent research groups using different theoretical motivations consistently discover the same hyperparameter ranges: momentum ~0.996-0.999, variance regularization weights ~0.01-0.05, batch sizes ~2048-8192, transformer depths ~10-12 for emergent reasoning. These narrow ranges suggest underlying constraint boundaries rather than arbitrary design choices.

Collapse as phase transition: SSL models exhibit sharp transitions between stable training and complete collapse when certain parameters cross thresholds. Loss curves show smooth descent followed by sudden divergence. Embedding statistics show gradual variance decay followed by abrupt dimensional compression. This matches the phenomenology of phase transitions in constrained systems.

The universal constants (ρ=3.29\rho^* = 3.29, ηc=0.304\eta_c = 0.304) appearing across these domains remain a theoretical proposal. The clustering of empirical values around predictions derived from these constants is striking but requires further validation. The framework’s value lies in unifying documented phenomena under a single geometric interpretation and generating testable predictions about where methods will succeed or fail.

Footnotes

  1. Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127-138.

  2. Landauer, R. (1961). Irreversibility and Heat Generation in the Computing Process. IBM Journal of Research and Development, 5(3), 183-191.

  3. Hawking, S. W. (1975). Particle Creation by Black Holes. Communications in Mathematical Physics, 43(3), 199-220.

  4. Bardes, A., Ponce, J., & LeCun, Y. (2022). VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning. International Conference on Learning Representations. https://arxiv.org/abs/2105.04906

  5. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging Properties in Self-Supervised Vision Transformers. International Conference on Computer Vision, 9650-9660. https://arxiv.org/abs/2104.14294

  6. Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A Simple Framework for Contrastive Learning of Visual Representations. International Conference on Machine Learning, 1597-1607. https://arxiv.org/abs/2002.05709

  7. Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., … & Valko, M. (2020). Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. Advances in Neural Information Processing Systems, 33, 21271-21284. https://arxiv.org/abs/2006.07733

  8. Zbontar, J., Jing, L., Misra, I., LeCun, Y., & Deny, S. (2021). Barlow Twins: Self-Supervised Learning via Redundancy Reduction. International Conference on Machine Learning, 12310-12320. https://arxiv.org/abs/2103.03230

  9. Assran, M., Duval, Q., Misra, I., Bojanowski, P., Vincent, P., Rabbat, M., LeCun, Y., & Ballas, N. (2023). Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture. Conference on Computer Vision and Pattern Recognition, 15619-15629. https://arxiv.org/abs/2301.08243

  10. Harper, P. G. (1955). Single Band Motion of Conduction Electrons in a Uniform Magnetic Field. Proceedings of the Physical Society A, 68(10), 874-878.

  11. West, G. B., Brown, J. H., & Enquist, B. J. (1999). The fourth dimension of life: fractal geometry and allometric scaling of organisms. Science, 284(5420), 1677-1679.

  12. Bekenstein, J. D. (1973). Black Holes and Entropy. Physical Review D, 7(8), 2333-2346.