essay / control systems / representation learning

Inverse Problems as Memory Compression in Control Systems

An essay on perturbation, task-relative truth, compressed representations, universality, and the strange information chemistry created by the pursuit of new objectives.

Published 2026-05-11 · inverse problems · control · compression · multi-task learning · mechanistic interpretability

Central thesis

A representation is a compressed record of the world’s controllable distinctions. An inverse problem is the attempt to find that record. A task is the lens that decides which distinctions matter. And the pursuit of new tasks is one of the most powerful ways a system can discover new structure in reality.

Introductory thought pondering

The closest established term for the idea that different systems may converge toward equivalent learned feature spaces while pursuing the same task is universality, especially in mechanistic interpretability. In classical systems theory, the older analogue is minimal realization: for linear time-invariant systems, the smallest state-space model preserving the same input-output behavior is reachable and observable, and its coordinates are unique only up to a similarity transformation.[1]

In automata and formal language theory, a related idea is Myhill–Nerode equivalence: histories that cannot be distinguished by any possible future continuation collapse into the same minimal state.[6] In controlled stochastic systems and reinforcement learning, related notions include predictive state representations, bisimulation, and behavioral equivalence.[7][8]

These are not identical theories. But they all point toward the same deep principle: the right internal state is the most compressed memory that preserves all distinctions relevant to future action.

Inverse problems as compressed memory

An inverse problem is usually described as the problem of inferring hidden causes from observable effects. Given measurements, recover the source; given a signal, recover the system; given behavior, recover the latent state, parameters, or law that produced it. But in a control system, this description is incomplete. A control system does not merely observe the world. It perturbs the world, receives responses, stores something about those responses, and uses that stored representation to act.

From this perspective, an inverse problem is not simply a reconstruction problem. It is a memory compression problem under intervention.

A controller does not need to know everything about the system it controls. It needs to remember the distinctions that matter for the control objective. If two hidden states, parameters, or histories lead to the same future action-relevant consequences under all admissible perturbations, then for that task they are the same state. Conversely, if a tiny hidden distinction changes the outcome of a future control action, then a good representation must preserve it.

The representation space of the inverse problem is therefore not the full state space of reality. It is the quotient of reality induced by perturbation, observation, and objective.

A dynamical system can be viewed as a relation between manipulable inputs, observable outputs, and disturbances. System identification traditionally proceeds by designing input signals, running experiments, measuring responses, and fitting models whose complexity should depend on the purpose for which they are built.[2] That last phrase is crucial. The “true” model for control is not necessarily the most detailed model. It is the most economical model that preserves the relevant causal affordances of the system.

x_t+1 = f_θ(x_t, u_t, w_t) and y_t = g_θ(x_t, v_t)

Here x_t is the latent state, u_t is a control input or perturbation, y_t is the observation, θ is an unknown parameter or structure, and w_t, v_t represent process and measurement noise. The controller does not directly receive x_t or θ. It receives a history of actions and observations.

h_t = (u₀, y₁, u₁, y₂, …, u_t−1, y_t)

A standard inverse problem asks for an estimate of the hidden state or hidden parameters. A control-centered view asks for something subtler: a compressed memory state r_t = φ(h_t) that is sufficient for future action. The representation is good if a policy can achieve the same task performance from r_t as it could from the full history h_t.

minimize complexity(R) subject to J(π ∘ φ) ≥ J* − ε

The representation is not judged by whether it stores all information. It is judged by whether it discards the right information.

Perturbations are questions asked of the system

The obtainable representation space is constrained by the perturbations one is allowed to perform. A system reveals different aspects of itself under different interventions. A passive observer sees only the trajectories produced by the environment. An active controller can ask questions.

A sinusoidal perturbation probes frequency response. A step input reveals time constants and steady-state gains. A randomized excitation can reveal couplings hidden under ordinary operation. A task-directed perturbation asks an even more selective question: what do I need to know in order to do this?

This means that every inverse problem has an information cone. Inside the cone are distinctions that can be revealed by the allowed perturbations and observations. Outside the cone are distinctions that may exist in the physical system but remain unidentifiable, irrelevant, or both.

Control theory recognizes this through ideas such as persistent excitation and optimal experiment design. Persistent excitation ensures that the inputs are rich enough to make certain parameters identifiable.[3] More generally, an experiment is valuable when it distinguishes between competing compressed models that would otherwise make the same predictions.

A representation is not merely learned from data. It is learned from questions posed to the system. The chosen perturbations define the grammar of possible answers.

Compression as a measure of representation quality

The claim that a more compressed representation achieving the same task is likely better has strong relatives in information theory and statistical learning. The minimum description length principle treats regularities in data as opportunities for compression: to learn is, in part, to find the shorter description that still explains the observations.[4] The information bottleneck method similarly formalizes the search for a short code that preserves information about a relevant variable.[5]

For control, the relevant variable is not simply a label or prediction target. It is the future task outcome under possible interventions. A control-theoretic information bottleneck would compress the action-observation history H into a representation R while preserving information about future controllable consequences Z.

minimize I(H; R) subject to I(R; Z) ≥ required relevance

Here Z might include rewards, reachable sets, constraint violations, stability margins, goal-conditioned outcomes, or predicted observations under future action sequences. This gives a task-relative meaning of truth. A representation is “truer” not because it contains more variables, but because it captures the invariants that survive counterfactual intervention. It is true in the sense that it supports successful action across the relevant perturbation family.

Compression alone is not enough. A representation can be too compressed. It can discard distinctions that matter in rare regimes, under distribution shift, or under newly introduced tasks. Compression is a quality measure only under a fixed or explicitly stated objective, perturbation class, and tolerance. The best representation is not the smallest representation absolutely. It is the smallest representation that preserves the required behavioral distinctions.

Minimal realization: the classical backbone

The cleanest classical analogue appears in linear systems theory. A linear time-invariant system may have many state-space realizations that produce the same input-output transfer function. Some realizations include redundant internal coordinates: states that cannot be reached by any input, or states that cannot affect any output. These states are not behaviorally meaningful for input-output control.

The minimal realization theorem says that a realization is minimal precisely when it is both reachable and observable. The Kalman decomposition separates the reachable and observable part from unreachable or unobservable components, and minimal structures are unique up to similarity transformation rather than unique in their raw coordinate representation.[1]

This is almost exactly the principle of inverse problems as memory compression. The controller does not need arbitrary hidden coordinates. It needs a minimal state that preserves the same input-output behavior. Internal coordinates may change, but the behaviorally relevant structure remains.

That also clarifies the feature-space convergence idea. If two different systems, algorithms, or neural networks solve the same sufficiently constraining control problem, we should not expect their internal variables to be identical neuron-by-neuron or coordinate-by-coordinate. We should expect equivalence up to transformation: a shared latent organization, common invariants, analogous circuits, or mutually translatable representations.

In other words, the correct analogue of “the same learned feature space” is usually not literal equality. It is behavioral isomorphism.

Equivalence classes of histories

The same idea can be stated more generally. Suppose two histories h and h′ have produced different observations. Are they truly different states for the controller?

They are different only if some future experiment can distinguish them in a task-relevant way. Define h ∼ h′ if, for every admissible future action sequence and every task-relevant outcome, the predictions from h and h′ are the same. Then the compressed state is the equivalence class [h].

r = [h]

This is the control version of a pattern that appears in several fields. Myhill–Nerode theory constructs minimal automata by merging strings that cannot be distinguished by future continuation.[6] Predictive state representations model controlled dynamical systems by representing state as predictions about observable outcomes of future experiments rather than as hidden nominal variables.[7] Bisimulation groups states by behavioral equivalence, often comparing rewards and transition distributions, and has become a control-relevant representation-learning tool in reinforcement learning and model predictive control.[8]

These theories all say, in different languages: state is memory modulo future distinguishability.

Multi-task learning as representation enrichment

A single task produces a representation optimized for one slice of reality. It compresses aggressively around what matters for that task and discards what does not. This can be powerful, but also narrow. The representation may be efficient while still being brittle.

Now suppose the same system must solve many tasks: T₁, T₂, …, T_n. A shared representation R is useful if, for each task, there exists a policy or decoder that achieves near-optimal performance from the same internal memory. This shared representation must preserve distinctions relevant across multiple objectives.

As tasks accumulate, equivalence classes become finer in some directions and more structured in others. The representation is forced to discover latent factors that explain many different forms of controllability.

This is where deep neural networks become especially interesting. Their learned weights are not merely parameters for one mapping. In a multi-task or multi-modal model, the weights become a shared memory substrate shaped by many objectives. Vision, language, code, music, mathematical reasoning, physical prediction, and interactive control all impose different constraints. When these constraints meet inside the same representational medium, they can produce unexpected internal organizations.

Some of those organizations may be genuinely compressive. A model trained on language and code may discover shared abstractions about syntax, recursion, reference, hierarchy, symbolic substitution, and long-range dependency. A model trained on music and language may discover temporal motifs, expectation, variation, rhythm, and phrase structure. A model trained on vision and action may discover object permanence, affordance, geometry, and causality. The representation becomes richer not because it stores separate databases for each domain, but because it discovers reusable invariants.

This is the sense in which multi-task learning can produce something like an emergent information chemistry. Objectives interact. Gradients from one task reshape features used by another. A representation that was once merely visual becomes useful for sound. A feature first learned for language becomes useful for code. A structure learned for mathematics becomes useful for music. New tasks act like reagents introduced into a latent chemical medium.

But the chemistry is nonlinear. Multi-task learning can also create interference, spurious shortcuts, or overgeneral abstractions. The richer representation is not guaranteed. It emerges when tasks are sufficiently diverse, sufficiently compatible, and sufficiently grounded by perturbations that punish false compression.

Universality: convergent feature spaces across systems

The machine-learning term closest to the idea of different models converging toward common internal features is universality. In mechanistic interpretability, universality is the hypothesis that different neural networks trained on similar tasks may learn similar features, circuits, or algorithms. A Transformer Circuits essay describes universality as repeated structure across networks and suggests that if features and circuits are universal, understanding one model can transfer to others.[9]

Work on early vision in neural networks likewise argues that early visual features and circuits recur across architectures and tasks, with some neuron families appearing across models.[10] More recent work has studied mechanistic similarity across architectures such as Transformers and Mamba models, explicitly asking whether different neural networks converge to similar algorithms on similar tasks.[11]

This does not prove that all models converge to one true feature space. The stronger claim would be too broad. But it supports a weaker and more plausible principle: when a task imposes strong enough constraints, and when optimization finds efficient enough solutions, different systems may converge toward equivalent internal abstractions, up to coordinate transformations, permutations, superposition, and implementation details.

This is exactly what minimal realization would lead us to expect in the clean linear case. It is also what Myhill–Nerode equivalence suggests in the automata case. And it is what predictive-state and bisimulation ideas suggest in controlled stochastic systems. The common theme is not identical internal machinery. The common theme is convergence toward a minimal sufficient structure.

In deep networks, this convergence may be obscured by superposition. Neural networks can pack many sparse features into fewer representational dimensions, leading individual neurons to respond to multiple unrelated concepts. Toy models of superposition describe this as networks storing additional sparse features in superposition, which makes the observed neuron basis less directly interpretable.[12] Thus two models may share a feature space in a deeper sense even if no single neuron lines up cleanly between them.

The task discovers the representation

The most radical implication is that representation learning is not merely about collecting more data. It is about discovering better tasks.

A task is not just a benchmark. It is an interrogation protocol. It tells the system what distinctions matter. It defines which perturbations will be attempted, which consequences will be measured, and which compressions will be rewarded.

A system trained only to classify images may learn visual categories. A system trained to act in the world must learn affordances. A system trained to predict language may learn syntax and semantics. A system trained to write code must learn executable structure, modularity, and formal constraint. A system trained on mathematics must learn invariance, proof-like transformation, abstraction, and compositional reasoning. A system trained on music must learn temporal expectation, recurrence, symmetry, and expressive deviation.

When these tasks are pursued together, the shared representation may find bridges that no single task would require. The model may discover that rhythm and syntax share hierarchical temporal structure; that code and mathematics share symbolic transformation; that language and social reasoning share latent models of agents; that vision and action share object-centered causal structure.

The pursuit of novel tasks therefore becomes a method of scientific discovery. Not because the system is explicitly told the laws of nature, but because each new task forces the representational memory to reorganize around new invariants. The world is not revealed all at once. It is revealed through the family of things one tries to do.

Toward systems that seek new objectives

The design principle that follows is simple:

Do not build intelligent systems only to optimize known objectives. Build systems that search for objectives whose pursuit improves their compressed, transferable representation of the world.

Such a system would not merely ask, “How do I maximize reward on this task?” It would ask, “What new task would force me to discover a representation that compresses more of the world while preserving more controllable consequences?”

τ* = argmax_τ E[ΔCompression + ΔTransfer + ΔControllability − λCost]

The system would seek tasks that reduce uncertainty, merge previously separate domains, expose hidden variables, or create reusable abstractions. This is not curiosity in the shallow sense of novelty seeking. It is curiosity as representation pressure: the search for interventions that improve the minimal memory needed to control, predict, and understand.

In scientific practice, this is already familiar. A good experiment is not merely one that collects data. It is one that distinguishes between competing compressed models. A good theory is not one that stores every measurement. It is one that makes many phenomena fall out of a smaller set of principles. A good control representation is not one that mirrors every physical detail. It is one that preserves the distinctions needed for robust action.

Reality as discovered through controllable compression

Inverse problems, viewed through control, are not passive acts of reconstruction. They are active acts of compression. The system perturbs the world, observes the response, and compresses its history into a memory state sufficient for future action. The quality of that memory is measured by how little it stores while preserving how much it can do.

Single tasks produce narrow compressions. Multiple tasks produce shared representational pressures. Novel tasks can create unexpected internal chemistry. Across systems pursuing similar tasks, feature spaces may converge not because there is one privileged coordinate system, but because there are shared behavioral invariants that any efficient controller must represent.

The deepest version of the thesis is this: a representation is a compressed record of the world’s controllable distinctions. An inverse problem is the attempt to find that record. A task is the lens that decides which distinctions matter. And the pursuit of new tasks is one of the most powerful ways a system can discover new structure in reality.

References

MIT OpenCourseWare, Dynamic Systems and Control, Lecture 21. Used for minimal realization, reachability, observability, and similarity-transform equivalence.
ETH Zürich Control Systems Lab, “System Identification”. Used for the system-identification framing of models learned from dynamical input-output data.
ScienceDirect Topics, “Persistent Excitation Condition”. Used for persistent excitation as a control and identification condition.
Peter Grünwald, The Minimum Description Length Principle. Used for the compression-based view of learning and model selection.
Tishby, Pereira, and Bialek, “The Information Bottleneck Method”. Used for the idea of compressing one variable while preserving information about a relevant variable.
Cornell CS, “Myhill–Nerode Theorem” handout. Used for the equivalence-class view of minimal automata.
Littman, Sutton, and Singh, “Predictive State Representations”. Used for the idea of representing state through predictions about future tests or experiments.
“A Survey of Bisimulation Metrics in Markov Decision Processes and Reinforcement Learning”. Used for behavioral equivalence and bisimulation-related representation learning.
Transformer Circuits, “Interpretability Dreams”. Used for the mechanistic-interpretability framing of universality across models.
Distill, “An Overview of Early Vision in InceptionV1”. Used for recurring early-vision features and circuits across neural networks.
“Mechanistic Similarity of Neural Network Architectures”. Used for the question of whether different architectures learn similar algorithms on similar tasks.
Elhage et al., “Toy Models of Superposition”. Used for feature superposition as an explanation for why shared latent structure may not align neuron-by-neuron.

← back to essays