Knowledge Discovery Efficiency (KEDE) and Ashby's Law of Requisite Variety

Abstract

We address Real-world applications of Ashby's Law by adopting Ashby's strict black-box perspective: only external behaviour is observable. First we define the multi-staged selection process of narrowing down and selecting the appropriate response from the set of alternative responses as the Knowledge Discovery Process. We then establish H(X|Y) as the knowledge to be discovered, which is the gap in internal variety that had to be compensated by selection. This quantifies how much disorder the regulator still permits and, conversely, how close the system comes to meeting Ashby's requisite-variety condition. In information-theoretic terms, perfect regulation requires H(X|Y) = 0. Then we quantify the knowledge to be discovered H(X|Y) based on the observable outcomes. Building on this result, we generalize Knowledge-Discovery Efficiency (KEDE) - scalar metric that quantifies how efficiently a system closes the gap between the variety demanded by its environment and the variety embodied in its prior knowledge. KEDE operationalises requisite variety when internal mechanisms remain opaque, offering a diagnostic tool for evaluating whether biological, artificial, or organisational systems absorb environmental complexity at a rate sufficient for effective regulation. Finally we present applications of KEDE in diverse domains, including typing the longest English word, measuring software development, testing intelligence, basketball game, assembling furniture, and speed of light in medium.

Introduction

The Law of Requisite Variety, formulated by W. Ross Ashby, states that for a system to effectively regulate its environment, it must have at least as much variety/complexity as its environment. This principle is foundational in disciplines such as cybernetics, control theory, and machine learning.

The concept of requisite variety has since been applied across diverse domains, including organizational theory, ecology, and information systems. It underscores the necessity for systems to adapt to environmental complexity in order to maintain stability and achieve intended outcomes.

Real-world attempts to apply Ashby's Law of Requisite Variety face three persistent obstacles. (i) Combinatorial explosion: enumerating all relevant states of a system and its environment quickly becomes intractable, especially when hidden or unmeasured variables are present. (ii) Dual control dilemma: a regulator must simultaneously amplify its own control variety and attenuate external variety—an optimization that is delicate in multiscale, hierarchical, and time-varying settings such as digital ecosystems or military command structures. (iii) Resource constraints: limited data, computational power, and organisational capacity often preclude sophisticated control architectures. Existing remedies—markup-language state catalogues, iterative multidimensional sampling, and distributed self-organising controllers—mitigate but do not eliminate these limitations.

In section 2, we provide a detailed overview of Ashby's Law of Requisite Variety, including its mathematical formulation and implications for system regulation. We also discuss the present day understanding of residual variety and its significance in the context of Ashby's Law. In section 3, we discuss the challenges of applying Ashby's Law to real-world systems, including combinatorial explosion, dual control dilemma, and resource constraints. We propose a solution to these challenges by treating the system as a black box, observing probability of successful outcomes to disturbances, and estimate the gap in its internal variety based on that. In section 4, we start establishing the solution by introducing the Knowledge Discovery Process, which is narrowing down and selecting the appropriate response from its set of alternative responses. We then establish H(X|Y) as the knowledge to be discovered, which is the gap in internal variety that had to be compensated by selection. In section 5, we show how to quantify the knowledge to be discovered H(X|Y) based on the observable outcomes. In section 6, we generalize the Knowledge-Discovery Efficiency (KEDE) - scalar metric that quantifies how efficiently a system closes the gap between the variety demanded by its environment and the variety embodied in its prior knowledge. Finally, in section 7, we explore applications of KEDE in various domains, demonstrating its utility as a diagnostic tool for evaluating system performance and adaptability.

The Law of Requisite Variety

In practice, the question of regulation usually arises in this way: The essential variables E are given, and also given is the set of states S in which they must be maintained if the organism is to survive (or the industrial plant to run satisfactorily). These two must be given before all else. Before any regulation can be undertaken or even discussed, we must know what is important and what is wanted. ... It is assumed that outside considerations have already determined what is to be the goal, i.e. what are the acceptable states S. Our concern...is solely with the problem of how to achieve the goal in spite of disturbances and difficulties[1].

Given a set of elements, its variety is the number of elements that can be distinguished. Thus the set {g b c g g c } has a variety of 3 letters. Variety comprises any attribute of a system capable of multiple 'states' that can be made different or changed.

The Law of Requisite Variety, formulated by W. Ross Ashby states that:

For a system to effectively regulate its environment, it must have at least as much variety as its environment

Ashby's Law is held true across the diverse disciplines of informatics, system design, cybernetics, communications systems and information systems.

Mathematical Formulation

In Ashby's terms, we can think of the system as a transducer:

DROE

Where:

  • D be the set of disturbances, representing the possible states of disturbances that a system may experience.
  • R be the set of responses, representing the possible regulatory actions that counteract disturbances.
  • O be the set of realized outcomes, representing the possible outcomes that can result from the disturbance-response pairs (D, R), without any implication of desirability. D x R → O is the fixed transition rule of the environment.
  • E be the set of values / essential-variable states that the system aims to achieve, induced by a valuation mapping v : O → E. The scale of values in E may be as simple as the 2-element set {good, bad}, and is commonly an ordered set.

This can be represented as a Table of Outcomes (T) where:

  • T be the pay-off matrix, i.e. a fixed mapping T : D x R → O or the fixed transition rule of the environment.
  • Rows represent disturbances D
  • Columns represent regulatory responses R
  • Entries represent realized outcomes O = {o11, o12, o13, ..., o21, o22, o23, ..., o31, o32, o33, ..., ...}
  • in the cells of the matrix, O = {o} = T(d,r) ∈ O, where d is a disturbance and r is a regulatory response.
T R
r₁ r₂ r₃ ...
D d₁ o₁₁ o₁₂ o₁₃ ...
d₂ o₂₁ o₂₂ o₂₃ ...
d₃ o₃₁ o₃₂ o₃₃ ...
d₄ o₄₁ o₄₂ o₄₃ ...
... ... ... ... ...

The Table of Outcomes (T) is a fixed, structured space of possibilities, or the fixed transition rule of the environment, which directly determines the essential variables E. In simplified analyses, O and E are sometimes conflated, but they are conceptually distinct. In the deterministic pay-off-matrix case, the environment is the fixed function T : D x R → O, and evaluation is v : O → E. If D and R are treated as random variables, then the induced conditional distribution is P(E | D, R) determined by E = v(T(D,R)).

The regulator's accumulated structure M is its learned law of action. In the deterministic case, M : D → R. In general (learning, uncertainty), treat it as a policy P(R | D), i.e. how probability mass is allocated across responses for each disturbance.

If the Table of Outcomes T is the fixed space of possibilities, then M is the mechanism that induces a measure over possible paths through that space. Across time, the object M is not fixed. Changes in M alter which disturbance-action pairs become more or less probable and therefore how the system's realized trajectories are distributed over the Table of Outcomes T. The probability mass is shifted away from previously selected or ineffective responses and reassigned to alternative responses for the same disturbance.

Each input from D is transformed via selection into a response rR, which is then processed to produce an outcome oO. which is then evaluated as eE, which is the essential variable that the system aims to achieve, induced by a valuation mapping v : O → E.

If R does nothing, i.e. keeps to one value, then the variety in D threatens to go through O to E, contrary to what is wanted. It may happen that O, without change by R, will block some of the variety and occasionally this blocking may give sufficient constancy at E. More commonly, a further suppression at E is necessary; it can be achieved only by further variety at R.

In Ashby's original formulation, a disturbance-response pair determines a realized outcome in the pay-off table, i.e. T:DxR→O . The goal is not defined directly on O by default. Rather, the facts determine a further valuation mapping v:O→E, where E is the set of values or essential-variable states. Given a subset of acceptable values of E defined as the goal then the inverse mapping of this subset will define, over O, the subset S of acceptable outcomes[8]. Thus is defined a relation equivalent to “ri, as response to di, gives an acceptable outcome s ∈ S”[8]. Thus O and E are conceptually distinct: multiple realized outcomes may map to the same value in E. In later cybernetic restatements, the law is often written directly in terms of the essential variables E, because regulation is evaluated by the residual variety that reaches the system's viability-relevant variables[52][54]. That compressed form is exact only under an additional modeling assumption that the control-relevant distinctions in O are preserved at the E-layer, e.g. when there is a one-to-one mapping between outcome and essential-variable state. Accordingly, in this article we keep the full Ashby structure DxR→O followed by O→E, and treat direct statements in terms of E as a reduced representation used only when that identification is explicitly assumed.

Under that additional assumption, a lower bound on the achievable variety in the values / essential-variable states cannot be less than the quotient of the number of rows divided by the number of columns. V E V D V R

Where:

  • V(D) = variety in the disturbances
  • V(R) = variety in the regulatory responses
  • V(E) = variety in the essential variables

This inequality shows that the variety in the essential variables cannot be reduced below the division between the variety in the disturbances and the variety in the regulatory responses.

Perfect regulation, in the reduced essential-variable representation, means that the residual uncertainty in the essential variable is driven to zero, i.e. V ( E ) 1 . In the full Ashby formulation, however, this is a stronger condition than merely achieving acceptable regulation: the system may still realize multiple distinct outcomes  o O  while all of them map, through  OE , into the acceptable region of the essential variables. Accordingly, throughout this article, “perfect regulation” will denote the stronger reduced-model case of a deterministic essential-variable state, not merely the weaker condition that all realized outcomes remain within the acceptable set.

The law reflects a fundamental insight: control is about the ratio of disturbances to regulatory responses, not just their arithmetic difference.

Information-Theoretic Formulation

All acts of regulation can be related to the concepts of communication theory by noticing that the “disturbances” correspond to noise, and the “goal” is a message of zero entropy, because the target value E is constant. Thus, the law of Requisite Variety says that R 's capacity as a regulator cannot exceed R 's capacity as a channel of communication.

The variety is measured by the logarithm of its value. If the logarithm is taken to base 2, the unit is the bit of information. Applying this to Ashby's Law we get:

log2(VDVR) = log 2 V ( D ) - log 2 V ( R )

Ashby's law of requisite variety can be expressed in Shannon-style notation by representing the relevant environmental distinctions and the system distinctions as random variables. In this formulation, the environment is not the whole world indiscriminately, but only the set of environmental behaviors that require distinct responses from the system[56].

In practice, we use the Shannon information entropy, denoted by H. For a quantifiable variable, entropy is just another measure of variance. If we assume equiprobable disturbances/responses and treat variety as cardinality, then H reduces to log 2 V, yielding:

HEHDHR
under the assumption of equiprobable/ uniform prior probabilities.

Ashby's Law can be interpreted as a cybernetic analogue of Shannon's "Noisy channel coding theorem" which states that communication through a channel that is corrupted by noise may be restored by adding a correction channel with a capacity equal to or larger than the noise corrupting that channel. The disturbance D, which threatens to get through to the outcome E, clearly corresponds to the noise; and the correction channel is the system R, which is supposed to restore the outcome E[8].

Ashby's law can thus be reformulated clearly:

The information-processing capacity (entropy) of a control system must be at least as large as the information (entropy) in the system it regulates.

It has been shown shown that the law of requisite variety can be extended to include knowledge or ignorance by simply adding this conditional uncertainty term[31] When buffering is present, part of the environmental variety is absorbed passively before reaching the regulator. This reduces the effective disturbance entropy by an amount K, which is the buffering capacity:

H E H D + H ( R | D ) H ( R ) K

Where:

  • H(E) is the residual variety i.e. the realized essential-variable distribution, not the size of the set E
  • H(R) is the entropy of the regulator, representing its information-processing capacity.
  • H(D) is the entropy of the disturbances, representing the complexity of the environment.
  • H(R|D) is the conditional entropy of the regulator given disturbances, representing the lack of requisite knowledge i.e. the ignorance of the regulator about how to react correctly to each appearance of a disturbance D. Only a regulator that knows how to use the available regulatory variety H(R) to react correctly to each disturbance D will reach the optimal result of regulation[31],
  • K is the buffering capacity measured in bits of disturbance variety absorbed before reaching the regulator. Buffering is the passive absorption or damping of disturbances i.e. the amount of noise that a system can absorb without requiring an active regulatory response.

A necessary but not sufficient condition for effective control (to make H(E) small) is: H(R)H(D)+ H(R|D)  K . Sufficiency additionally requires H(R|D)0

Successful (essential) outcomes E do not depend solely on the variety of responses H(R) available to a regulator R; the system must also know which response to select for a given disturbance. Effective compensation of disturbances requires that the system possess the ability to map each disturbance to an appropriate response from its repertoire. The absence or incompleteness of such knowledge can be quantified using the conditional entropy H(R|D)[31]. In other words, H(R|D) measures how much the regulator R lacks the requisite knowledge to match responses to disturbances. In the absence of such requisite knowledge, the system would have to select responses, until eliminating all disturbances. Thus, merely increasing the response variety H(R) is not sufficient; it must be complemented by a corresponding increase in selectivity, that is, reduction in H(R|D) i.e. increasing knowledge. H(R|D) = 0 represents the case of no uncertainty or complete knowledge, where the action is completely determined by the disturbance. This requirement may be called the law of requisite knowledge[29].

H(R|D) reminds us that response alone is not sufficient: if the regulator does not know which response is appropriate for the given disturbance, it can only try out regulatory actions at random, in the hope that one of them will be effective and that none of them would make the situation worse. The larger the H(R|D), the larger the probability that the regulator would choose a wrong regulatory response, and thus fail to reduce the variety in the outcomes H(E). Therefore, this term H(R|D) has a “+” sign in the inequality: more uncertainty (less knowledge) produces more variation in the essential variables E[54].

To achieve control, the regulator R must possess sufficient information-processing capacity (entropy) such that the following is achieved: HRHD HR|D=0

In other words, the complexity of the environment D can not exceed that of the system R, which means the system R fully matches the environment D[7].

Since H(R)-H(R|D)=I(R:D) the law simplifies to: HEHD-I(R:D)-K

The mutual information I(R:D) represents the requisite knowledge of the regulator R about how to react correctly to each disturbance D, i.e. the amount of regulatory variety that is effectively correlated with and therefore absorbs the variety in the disturbances. Such knowledge may be realized structurally as the regulator's learned law of action M represented by a mapping M: D → R , by which disturbances are mapped to regulatory responses[54]. The mutual information I(R:D) quantifies how much of the regulator's learned law of action M: D → R effectively couples disturbances to responses, while the remaining uncertainty H(R|D) quantifies the lack of requisite knowledge[29].

Core challenges in applying Ashby's Law to real systems

We conducted a literature review aimed at identifying the primary challenges and limitations associated with applying Ashby's Law in real-world systems.

A central challenge that emerges is the measurement of variety. In most of the reviewed literature, the concept of variety is either poorly defined or not explicitly measured, resulting in ambiguity and potential misinterpretation of the law's implications. Key obstacles to effective measurement include:

  • The direct measurement of variety is fundamentally incomputable for all but the simplest systems [14].
  • Hidden variables introduce uncertainty and complicate measurement efforts [15].
  • Trade-offs often arise between variety at different scales [16].
  • A combinatorial explosion occurs when attempting to enumerate all possible system states [15,16].
  • Resource limitations constrain the feasibility of comprehensive measurement [20].
  • Environmental complexity is frequently “unknowable,” preventing complete assessment [25].
  • Most studies lack explicit or standardized methods for quantifying variety [14,17'-20,25,27].
  • Existing approaches often lack rigorous quantitative validation [17].

Several measurement methods have been proposed, including:

  • Markup language-based variety estimation [18],
  • Iterative sampling techniques [21],
  • Entropy and determinism metrics to evaluate communication complexity, where greater variety was correlated with improved effectiveness [22],
  • Social network and cluster analysis to assess resilience [23], and
  • Multiple Correspondence Analysis (MCA) for capturing organizational complexity [24].

In addition, a subset of studies estimate variety through observed performance rather than structural attributes. Notable examples include:

  • Communication-based performance measures, employing determinism metrics to evaluate repeatable patterns in team behavior [22];
  • Team performance assessments, using task-based surveys to evaluate an organization's risk-handling capabilities [23];
  • Leadership behavior analysis, based on actual behavioral responses to simulated scenarios [26]; and
  • Relative performance comparisons, assessing organizational effectiveness across contexts using perception-based rather than absolute metrics [14].

While these performance-based approaches provide practical insights, they often rely on subjective or indirect indicators of variety, which may introduce biases and limit their generalizability. For example, performance outcomes may fail to account for hidden variables or the underlying complexity of the system [15]. Moreover, these approaches remain underrepresented in the literature, where structural and theoretical analyses still dominate.

In summary, although numerous methods for measuring variety have been proposed, no single comprehensive or universally accepted solution has emerged. Quantification remains a persistent challenge in the application of Ashby's Law to complex real-world systems.

Solution

These challenges significantly hinder the practical application of Ashby's Law. Whether considering a human, an AI model, or an organization, we are typically limited to observing external behavior rather than internal mechanisms—unless we are able to "open the box."

Ashby himself emphasized that all real systems can be considered black boxes. He argued that while black boxes mimic the behavior of real objects, in practice, real objects are black boxes: we have always interacted with systems whose internal workings are, to some extent, unknown.

This leads to what Ashby termed the black box identification approach [2], which involves:

  1. Perturbing the system by applying external disturbances,
  2. Measuring the system's responses to these perturbations, and
  3. Inferring the internal variety or capacity from the observed input-outcome relationships.

In most practical scenarios, we are only able to observe the outcomes of a system. These observable outcomes can be used to infer bounds on the system's internal variety—specifically, the extent of variety it must possess or lack in order to exhibit the observed behavior.

We propose such an approach: to treat the system as a black box, observe the probability of successful outcomes to disturbances, and estimate the gap in its internal variety based on that. Let E denote the event that the system gives an response to disturbance D, and let R be the regulator's action. In information-theoretic terms, perfect regulation requires H(R|D) = 0[31]. Using our novel information-theoretic estimator, empirical estimates of P(E=1) are used to quantify H(R|D) in bits of information. This quantifies how much disorder the regulator still permits and, conversely, how close the system comes to meeting Ashby's requisite-variety condition.

Knowledge Discovery Process

The process of selection may be either more or less spread out in time. In particular, it may take place in discrete stages. What is fundamental quantitatively is that the overall selection achieved cannot be more than the sum (if measured logarithmically) of the separate selections. (Selection is measured by the fall in variety.) 13/17[2]

Ashby's selection in design and regulation via requisite variety are structurally identical: they describe how constraints (or regulation) reduce the variety of possible outcomes from an initial space. in Ashby, constraints, tests, feedback, rules, observations are all selection mechanisms that reduce variety.

Ashby[13/15 [2]] measures the amount of selection in bits as:

σ=log2VbeforeVafter

Where

  • σ is the amount of selection (the amount by which the variety is reduced) or the information gained, i.e., how much the uncertainty has been reduced.,
  • Vbefore is the variety before the selection i.e. before a constraint (filter, decision, control action) is applied, and
  • Vafter is the variety after the selection i.e after the constraint is applied.

From here on, we treat "variety in bits" as Shannon entropy i.e., using the distribution over possible outcomes. If possible outcomes are equiprobable, this reduces to Ashby's log2|V| counting form.

Thus every time we introduce a rule or a constraint we throw away some of the possibilities and gain information equal to the logarithm of that reduction:

  • “What fraction of possibilities remains?” '- that is Vbefore / Vafter
  • “How many bits of information does this represent?” '- that is σ = log2(Vbefore/Vafter)

Rather than a single act, selection is often a multi-stage process consisting of k successive selections from a range of possibilities[2][4]. Each selection stage reduces the set of admissible alternatives, progressively transforming an initial space of variety into a more constrained set of alternatives with the goal to produce an acceptable outcome. Formally, this process can be understood as a sequence of k uncertainty-reducing operations, where each selection narrows the possibility space and thereby decreases entropy. Mathematically we have:

V0σ1V1σ2V2σ3...σkVk

We denote the amount of selection achieved at each stage i as: σ1 , σ2 , σk . The total selection is the sum of the amount of selections achieved at each stage because logarithms turn multiplications of ratios into additions:

σtotal=i=1kσi

where k is the number of selections preceding an acceptable outcome.

The total process therefore consists of k such selections, each conditioned on the result of prior selections. Reductions add only for nested/refining partitions (each stage refines the previous stage's partition of possibilities). If two stages constrain the same dimension in overlapping ways, we must count the second stage's reduction relative to the first, not from the original variety. The number of selections k thus characterizes the depth of the selection process and corresponds to the number of distinct uncertainty-reducing decisions required to reach the final state.

We refer to the multi-stage process of narrowing down and selecting the response from its set of alternative responses to produce an acceptable outcome as a Knowledge Discovery Process.

We can say that we've got "it from bit" - a phrase coined by John Wheeler. "It from bit" symbolizes the idea that every item in the physical world has knowledge as an immaterial source and explanation at its core[6].

Information-Theoretic Operationalization of Staged Selection

Aulin-Ahmavaara and Heylighen characterize the ignorance term H(R|D) at the level of disturbance, regulator, action, appropriate response, and quality of regulation. They describe it as uncertainty about how to react correctly to a disturbance and how to use the available regulatory acts appropriately or optimally. However, their prose does not fully fix the granularity of the response variable. It leaves open whether the response denotes: (i) an equivalence class of acts that achieve the same acceptable outcome, (ii) the exact concrete response emitted, or (iii) the optimal response among several acceptable ones.

In the present article, we operationalize that idea at the level of the concrete response to be fixed. Accordingly, X denotes the specific response actually selected by the regulator (a random variable), not an equivalence class of acceptable responses. Under this modeling choice, H(X|Y) is the residual uncertainty about which concrete response X will be fixed, given the observed disturbance Y. This is an operationalization of the requisite-knowledge idea, not a claim that the source texts uniquely force this exact response granularity.

This distinction matters because multiple concrete responses may be acceptable for the same disturbance. In such cases, H(X|Y) does not automatically measure uncertainty about whether regulation can succeed. Rather, it measures uncertainty about which concrete response will be selected under the regulator's current policy Pt(X|Y).

Thus, in the present model, H(X|Y) should be read primarily as response-fixation uncertainty. It coincides with lack of requisite knowledge in the stronger sense only under an additional assumption: for the task class under consideration, either there is a unique success-relevant response, or tie-breaking among multiple acceptable responses has itself been structuralized and is part of the regulator's stored requisite knowledge.

Illustrative example. If the disturbance is “pay $10,” and both “use one $10 bill” and “use ten $1 bills” may be equally acceptable responses, from the valuation layer's point of view. In that case the regulator may still spend time selecting one concrete act and there may still be uncertainty about which concrete response will be fixed even if there is no uncertainty about whether payment can succeed. That residual uncertainty is not necessarily ignorance in Aulin's strong sense. It may just be unresolved choice among equally acceptable responses. In Ashby's own setup, acceptability is defined by the goal relation between disturbance and response yielding an acceptable outcome; multiple different responses can belong to that acceptable set. But if the environment imposes additional structure — fastest payment, preserve change, cashier policy, social norm, transaction cost — then the concrete choice does become part of “react correctly.” In that case, uncertainty over which bill combination to use is genuine lack of requisite knowledge about how to use the available acts optimally. That reading fits Aulin and Heylighen well. This illustrates why H(X|Y) is defined here at the level of concrete response selection rather than success-equivalence classes.

The table below aligns Ashby's language of regulation with Shannon's information-theoretic quantities by showing that both describe the same process: the progressive reduction of uncertainty about which action will succeed.

Ashby term Symbol Shannon / Information-theoretic term Symbol
Disturbance (observed at decision time) D Conditioning variable (given side-information) Y
Regulator response selected R The specific concrete response that will be committed for the observed disturbance Y X
Requisite knowledge of the regulator R about how to react correctly to each disturbance D. I(R:D) Stored requisite coupling between the regulator's responces X and the disturbance Y. I(X:Y)
Lack of requisite knowledge of the regulator about which response will produce an acceptable outcome given a disturbance H(R|D) The regulator's residual uncertainty about which concrete response X will be fixed, given the disturbance Y. It coincides with lack of requisite knowledge only when correctness is defined at that same granularity — meaning either there is one uniquely appropriate response or tie-breaking among acceptable responses has been structuralized and is itself part of the requisite knowledge. H(X|Y)
Selection signals (tests, observations, feedback, constraints, rules, partial executions) Z1, Z2, …, Zk Auxiliary information sources that reduce uncertainty about which response X is acceptable for a given Y Z1, Z2, …, Zk
Residual variety after the i-th selection stage V(R | D, Z1, …, Zi) Conditional entropy remaining after i-th selection stage H(X | Y, Z1, …, Zi)
Selection achieved at stage i (reduction in variety due to one constraint) log2   Vbefore / Vafter,i Conditional mutual information acquired at stage i I(X ; Zi | Y, Z<i)
Residual variety after k selection stages V(R | D, Z1, …, Zk) Conditional entropy of the regulator's selected response, given the observed disturbance and k selection signals H(X | Y, Z1, …, Zk)
Total selection achieved (successful adaptation) log2 Vbefore / Vafter Total mutual information acquired through all selection stages I(X ; Z1, …, Zk | Y)

Following the information-and-selection reading of Ashby's law, which says that the amount of rational selection is limited by the information available[57], we introduce a stochastic formulation, where X is the response to be fixed, Y is the observed disturbance, and Zi is the information-bearing signal available at stage i. In that formulation, when mapping one stage of selection with one selection signal Zi, the expected selection achieved at stage i is represented by conditional mutual information:

σi = H X | Y , Z<i - H X | Y , Zi = I X ; Zi | Y , Z<i

The selection Ashby counts in bits at stage i is conditional mutual information I(X ; Zi | Y, Z<i) and expected remaining uncertainty is H(X | Y, Z≤i). In the stochastic restatement of Ashby's staged selection, the amount of selection achieved at each stage is represented in expectation by the conditional mutual information contributed by that stage's signal. Thus Ashby's staged selection is not replaced by mutual information; rather, it is re-expressed probabilistically as expected uncertainty reduction under successive information-bearing constraints. This is a modeling restatement of Ashby's staged-selection idea, not a claim that Ashby originally formulated conditional mutual information. This restatement is equivalent to Ashby's staged selection in expectation.

As selections accumulate, the remaining uncertainty H X | Y , Z1 , , Zi shrinks. The total amount of selection Ashby describes corresponds in expectation to the total mutual information accumulated across stages.

σ total = H X | Y - H X | Y , Z1 , , Zk = I X ; Z1 , , Zk | Y

Shannon's chain rule for mutual information is:

I X ; Z1 , , Zk | Y = I X ; Z1 | Y stage 1 + I X ; Z2 | Y , Z1 stage 2 + ... I X ; Zk | Y , Z<k stage k

Those k summands are exactly the k amounts of selection Ashby would add up when he says “the total selection is the sum of the separate selections.”

If stages share information or impose overlapping constraints, summing their marginal “reductions” overcounts. The correct decomposition credits each stage only for the reduction of residual uncertainty left by previous stages. Therefore the per-stage contributions must be conditional (incremental) to avoid double counting, because stages may share information or constrain overlapping parts of X. The correct staged accounting is the mutual-information chain rule over the selection signals Z:

I X ; Z1 , , Zk | Y = i = 1 k I X ; Zi | Y , Z < i

i.e., each stage is credited only for the information it contributes beyond what earlier stages already provided.

After k selections we have:

H X | Y , Z1 , , Zk   = H ( X   |   Y ) initial   variety - [ I X ; Z1 , , Zk | Y ] total   selection

The principle is independent of how the ruling is expressed; all that matters is the fraction of the search-space that each selection stage discards. But whatever mix we choose, the grand total must still cover H(X|Y) if we want the essential variable E to reach (or stay at) zero entropy with respect to its target value.

Regulation (within-episode)

An episode represents a specific interaction or event in time, characterized by a basic structure:

  • A disturbance (Y): An external stimulus, perturbation, or environmental input that acts upon the system.
  • A response (X): The system's internal action or selection made or outcome generated in response to the disturbance Y.
  • A within-episode sequence of evidence (Z1, Z2, …, Zk): A sequence of constraints, tests, feedback signals, or rules that are applied to the system in response to the disturbance.
  • A closure: An externally visible event that terminates the search for one disturbance-response pair by the end of the episode. A counted closure event is only an external marker that a selected response has been committed and that its realized outcome is currently counted as acceptable by the ledger.

Following Ashby's idea that selection may be distributed across stages and that total selection is the sum of separate selections, we model within-episode response fixation as a staged uncertainty-reduction process. In this model, complete response fixation occurs when H X | Y , Z1 , , Zk   = 0 equivalently, I X ; Z1 , , Zk | Y = H X | Y Or equivalently, the sum of the bits removed by selection by every stage must at least equal the bits of uncertainty injected by the original range of possibilities or by disturbances. This is a specialization of Ashby's staged-selection idea to the response-fixation model, not a restatement of the general law of requisite variety.

A disturbance Y is fixed and fully observed at episode start and is treated as given. All remaining uncertainty is only about which response X will be selected; subsequent Z is evidence about X, not new disturbance info. This initial uncertainty is the lack of requisite knowledge, measured as the initial conditional entropy H(X|Y). The system then applies a sequence Z1, Z2, …, Zk of constraints, tests, feedback signals, or rules, each of which removes some possibilities and therefore reduces the lack of requisite knowledge. In Ashby's terms this is “selection”; in Shannon's terms each stage i contributes conditional mutual information I(X ; Zi | Y, Z<i). “Non-outcome time” is entirely spent on discriminating information about X for the current episode, and outcome emissions are atomic and don't hide extra selection. “episode closes with an acceptable outcome” event occurs only after the regulator has already fixed the exact response X (so the closure event certifies that the episode's X has been identified). Therefore, the episode closes with an acceptable outcome event certifies that the episode's X has been identified.

Regulation uses staged evidence Z1:k to eliminate uncertainty within an episode t, where k is the number of selection signals within the episode t. During an episode staged selection supplies Z 1 : k until the response X is effectively determined:

H ( X Y , Z 1 : k ) 0   I ( X ; Z 1 : k | Y ) = H ( X | Y ) H ( X | Y , Z 1 : k ) H ( X | Y )

This is “regulation”: success happens because H(X|Y,Z=z) becomes small after receiving a particular selection signal z.

Regulation can use within-episode evidence Z to determine the response X for a given disturbance Y during an episode t. Learning makes that success persistent by updating the stored structural coupling (mapping) M so future episodes start with less esidual uncertainty about which response will be selected for the same class of disturbances.

Learning (across episodes in a window)

A window contains consecutive episodes and is defined by a bounded observation interval, typically from a start time tstart to an end time (tend). Within that window, the regulator may encounter multiple episodes belonging to the same task class, while its stored structure may be updated from one episode to the next.

Let Mt denote the regulator's internal stored structural coupling (mapping) at the start of episode t. In Ashby's terms, M is not a separate object but the regulator's law of action i.e. the learned functional relation by which disturbances are mapped to responses to produce acceptable outcomes[54]. By contrast, the Table of Outcomes remains the fixed environmental relation T : Y x X O , which specifies what outcome would result from each disturbance-response pair. Learning does not alter that table. Learning does not alter that table. Learning alters the regulator's stored coupling Mt, and therefore alters the response law induced at the start of later episodes.

For episode t, the stored coupling Mt induces a conditional response law P t ( X | Y ) := P ( X | Y ; M t ) .

For each episode t in a window, Mt is treated as a parameter (not a random variable) fixed at the start of that episode: Therefore we do not write entropies or mutual informations “conditioned on Mt”. Instead, we write the information-theoretic quantities induced by the distribution generated from the mapping Mt:

H t := Ht (X|Y) := HPt (X|Y) I t := It (X;Y) := IPt (X;Y)

Here I t is the stored requisite coupling: the amount of response-selection structure that is already aligned with the disturbance distinctions relevant to the task. Dually, H t is the residual lack of requisite knowledge at episode start: after the disturbance is known, it measures how uncertain the regulator still is about which response it will select under its current stored structure.

What changes across episodes is not the semantic definition of the task, but the regulator's stored coupling for that task. Thus, the learning question is: after one episode's within-episode discoveries have been incorporated into M t , does the next comparable episode begin with stronger stored coupling M t + 1 and less residual ignorance than before?

Comparability assumption for stored requisite knowledge and ignorance. In the across-episode analysis that follows, we treat X as the same selected-response variable for the same task class across episodes: its response alphabet, coding granularity, and any tie-breaking convention among equally acceptable responses are held fixed. Thus, although the stored structural coupling Mt may be updated from one episode to the next, the semantic definition of X is unchanged across the comparison.

For the specific purpose of reading changes in stored requisite knowledge and changes in lack of requisite knowledge as exact duals, we impose the additional comparability assumption that the marginal response entropy is invariant across the comparison:

H t ( X ) = H t + 1 ( X )

Under this assumption, the standard identity I t ( X ; Y ) = H t ( X ) H t ( X | Y ) implies

Δ t I = I t + 1 ( X ; Y ) I t ( X ; Y ) = ( H t + 1 ( X Y ) H t ( X Y ) )

Therefore, within this across-episode comparison regime, an increase in stored requisite knowledge is exactly equivalent to an equal decrease in lack of requisite knowledge: Δ I t = Δ H t ( X Y )

If this comparability assumption is relaxed, then Δ I t and Δ H t ( X Y ) need not coincide, because part of the change in mutual information may come from drift in the marginal entropy H t , not only from the conditional ignorance term H t ( X Y ) .

We do not relax this comparability assumption in what follows.

Learning Axiom (Structural Knowledge Accumulation). A system is said to learn, in the structural-coupling sense, if and only if the within-episode evidence stream Zt = (Zt,1, …, Zt,kt) is incorporated into an updated mapping Mt+1 = Update(Mt, Zt), such that for subsequent encounters with the same class of disturbances Y, the stored requisite coupling increases:

I t + 1 ( X ; Y ) > I t ( X ; Y )

Under the comparability assumption stated above, this is equivalently expressed as a decrease in lack of requisite knowledge:

H t + 1 ( X | Y ) < H t ( X | Y )

In words: learning means that after the update of stored structure, the regulator begins the next comparable episode with stronger disturbance-response coupling Mt+1 and therefore less residual ignorance about which response to select. We do not assume any particular memory mechanism (overwrite, patching, versioning, or full replacement), but only track the effect of the update on the regulator's induced coupling measures.

Strong Learning Assumption (Posterior-Becomes-Prior Rule). A stronger form of learning is obtained when the regulator stores and reuses, without loss, the uncertainty reduction achieved within episode t. For a stable task class, we then assume that the posterior coupling achieved by the end of episode t becomes the prior stored coupling at the start of episode t+1.

Under the ignorance view, this may be written as the following Posterior-Becomes-Prior Rule:

H t + 1 ( X | Y ) := H t ( X | Y , Z t )
Under this assumption, the posterior uncertainty achieved by the end of episode t becomes the prior uncertainty at the start of episode t+1.

Under the dual knowledge view, and under the comparability assumption on H t ( X ) , this is equivalently:

I t + 1 ( X ; Y ) = H t ( X ) H t ( X Y , Z t )
.

This Posterior-Becomes-Prior Rule is a strong additional assumption, not a general consequence of conditioning alone. It requires, at minimum:

  • the same task class or disturbance semantics across episodes,
  • successful retention of the within-episode discoveries,
  • reuse of that stored structure in later episodes,
  • no intervening forgetting or context drift that would invalidate the stored coupling.

Corollary (Complete Adaptation under Strong Learning). Under the Posterior-Becomes-Prior Rule, repeated successful adaptation drives the stored requisite coupling toward its task-class ceiling and drives the residual lack of requisite knowledge toward zero for the task class: I t ( X ; Y ) H ( X ) H t ( X | Y ) 0

In that limit, the response becomes effectively determined by the disturbance for the task class under study: H t ( X | Y ) = 0 This is the state of complete adaptation for the task class: no further within-episode discovery is required in order to determine the response.

Clarification. The weak learning axiom is sufficient to define learning. The strong form is only needed because in what follows our model will assume that within-episode discoveries are fully carried forward as next-episode prior structure.

Temporal scope of windows and operative outcomes

The present model combines two distinct structures that must be kept separate. First, there is Ashby's fixed two-dimensional Table of Outcomes T:Y×XO , which specifies what realized outcome would result from each disturbance-response pair. Second, there is the one-dimensional temporal trace of completed episodes observed across time. The table of outcomes is a fixed space of possibilities. The temporal trace records only which episodes actually completed, and in what order.

A window is a bounded temporal slice of that one-dimensional trace. It contains the non-overlapping episodes that complete within the chosen start and end times, ordered by completion time. An episode remains defined as one temporally bounded search-and-fixation process ending in exactly one externally visible closure. Thus windows are temporal objects, while Ashby's table T is not temporal: it is the fixed outcome space within which temporally ordered episodes select and revise realized outcomes.

For a given decision target τ , let the operative outcome at the start of a time interval mean the outcome in O that is currently in force for that target immediately before the next episode acts on it. That operative outcome may have been established earlier in the same window or in prior history before the window began. Accordingly, a current episode may act upon a target whose operative outcome was not produced in the current window, but is nevertheless the currently operative point in the fixed outcome space.

This means that temporal correction is not restricted to outcomes first closed in the same window. A later episode in the current window may revise an already-operative outcome inherited from earlier time. For example, a later review episode may overturn an outcome that was previously accepted and currently remains operative at window start. In that case, the corrective episode occurs in the present window, but the outcome being corrected belongs to the same fixed Table of Outcomes T .

Accordingly, throughout what follows, “final” always means final relative to the currently observed bounded temporal window, not necessarily final for all future time. A closure may survive to the end of one window and still be revised by a later episode in a subsequent window. This does not imply that the earlier window was defined incorrectly. It means only that windows are local temporal observations of an ongoing process of regulation and correction unfolding over time.

Rework

Because the regulator is treated as a black box, the observer does not see the internal staged-selection path. What is observed is only the operative fixation at the end of an episode. Therefore, whenever a later episode fixes a different value for a target that already had an operative value, the model interprets that later episode as the observable residue of an additional unobservable staged selection process. The entire later episode is therefore counted as rework.

Having distinguished the fixed two-dimensional Table of Outcomes from the one-dimensional temporal trace of completed episodes, we can now define rework precisely. Rework should be understood temporally but not structurally. It is not a change in the structure of Ashby's table T:Y×XO , nor a change in the valuation mapping v:OE . What rework changes is which realized outcome, and therefore which induced value, is currently operative for the same target within that fixed outcome space. Thus rework is a temporally later revision of an operative outcome-value assignment inside a fixed two-dimensional space of possible outcomes.

Accordingly, a rework episode is a later non-overlapping episode that fixes a different operative outcome-value assignment for a target that already had an operative outcome in force when that episode began. The prior operative outcome may have been established earlier in the same window or in prior history before the window opened. Thus rework is counted in the window in which the corrective episode occurs, even when the outcome being corrected was already operative at window start. What matters is not where the earlier operative outcome was first produced in time, but that an additional corrective selection had to occur after a prior selection had already fixed the matter operationally. Under the information-selection reading of Ashby's law, this matters because the amount of selection that can be performed is limited by the information available; hence each rework episode is evidence of additional selection burden, and therefore of additional informational burden, beyond the initial fixing of response.

Illustrative example. Consider a football game. In Episode 1, the referee performs a staged selection process and closes the episode with response x1 to disturbance y1, yielding realized outcome o1 = T(y1, x1). with value v ( o2 ) = goal allowed . Now suppose that in Episode 2, VAR reviews the same target τ. The earlier outcome o1 is already operative for that target when the new episode begins. 𝑜 VAR then performs an additional staged selection process, selects response x2, and closes with outcome o2 = T(y2, x2), where v ( o2 ) = no goal v ( o1 ) . This closure is counted as rework because it changes the operative outcome-value assignment of a target that had already been fixed operationally. In the VAR example, the final accepted state “no goal” is not obtained by one selection process but by the accumulated effect of two temporally separated staged selections acting on the same target: the referee's initial fixation and VAR's later corrective overrule.

In Ashby's terms, if appropriate effects appear before the corresponding causes/questions have been fully resolved, one must look for the missing channel that carried the needed information. Ashby explicitly says that apparent overstepping of the limitation leads us to search for the additional communication channel that accounted for it[8]. That is very close to our model: the overrule reveals that the earlier fixing of response was not actually final.

Rework, retraction, and net learning

A mapping update may include both the addition of improved structure and withdrawal of previously stored structure. These internal components must be distinguished from externally observable rework and from the net epistemic effect of the update as a whole.

Observable rework. Rework is any externally visible correction, undo, replacement, deletion, or repair of prior commitments. It is defined at the level of artifacts and execution behavior, not at the level of the regulator's internal coupling ledger.

Internal retraction and accretion. A mapping update of M may contain both a retraction component in which previously stored structure is withdrawn or weakened, and an accretion component, in which new or improved structure is added or strengthened.

These internal components should not individually be identified with negative learning or positive learning. In particular, withdrawing false or obsolete structure is often part of successful learning.

Net learning. Learning is evaluated by the net epistemic effect of the whole update after it is complete. The relevant question is not whether some fragment of prior structure was removed, but whether the completed update leaves the regulator with stronger or weaker stored disturbance-response coupling than before.

Net learning criterion

We evaluate learning by the net change in stored requisite coupling induced by the whole update:

ΔtI := It+1 - It
This is the primary across-episode learning quantity.

Under the comparability assumption H t ( X ) = H t + 1 ( X ) , the same net change may be written dually as a change in lack of requisite knowledge

ΔtH := Ht+1 - Ht Δ I t = Δ H t

Accordingly:

  • Net positive learning iff Δ I t > 0 (equivalently, a net decrease in lack of requisite knowledge under the comparability assumption Δ H t < 0 ).
  • Zero net learning iff Δ I t = 0 (equivalently, under the comparability assumption, Δ H t = 0 ).
  • Net negative learning iff Δ I t < 0 i.e. the final mapping is worse than the initial one with respect to stored requisite coupling (equivalently, a net increase in lack of requisite knowledge under the comparability assumption Δ H t > 0 ).

Thus, negative learning is reserved for a net worsening of stored requisite coupling after the whole update is complete. It should not be inferred merely from the fact that part of the update involved deletion, withdrawal, or correction of previously stored structure.

Exact knowledge ledger of net change

For knowledge accounting purposes, define the positive and negative parts of the net coupling change:

Gt := max ( ΔtI ,0) Lt := max ( -ΔtI ,0)

Then the stored requisite coupling satisfies the exact ledger identity:

It+1 = It + Gt - Lt

Under the comparability assumption, the same net update may be read dually in ignorance form:

Ht+1 = Ht - Gt + Lt

This ledger records only the net epistemic effect of the update. It does not assert that the internal update process itself consisted of a pure gain or a pure loss. A single episode may contain both correction of prior structure and acquisition of improved structure, yet still end with Gt > 0 .

Why rework is not the same as negative learning

An episode may contain observable rework and still yield net positive learning. For example, a software developer may correct a previously committed line of code by deleting a wrong symbol and replacing it with the correct one. This is clearly rework at the artifact level. But if the completed update leaves the developer with stronger stored disturbance-response coupling for future comparable episodes, then the episode yields net positive learning, not negative learning.

Accordingly, all three combinations are possible:

  • Rework + net positive learning
  • Rework + zero net learning
  • Rework + net negative learning.

Therefore, observable rework does not by itself imply Δ I t < 0 . Rework is an execution-level phenomenon; negative learning is a coupling-ledger judgment about the net epistemic effect of the completed update.

Defining Knowledge To Be Discovered

Ashby's Law of Requisite Variety provides the general control frame: disturbances must be countered by adequate regulatory variety if essential variables are to be kept within acceptable bounds, it must possess sufficient variety relative to the relevant variety of the environment[1]. In modern entropy-based restatements, this requirement can be expressed as a matching condition between relevant environmental distinctions and the system distinctions available to answer them[56]. That control frame tells us what successful regulation requires, but it does not by itself define the regulator's residual uncertainty inside a concrete episode of action.

Heylighen's Law of Requisite Knowledge adds the missing condition for effective regulation: it is not enough for a regulator to possess a repertoire of possible responses; it must also possess the requisite knowledge needed to select the appropriate one for the disturbance encountered. Otherwise, increased action variety increases the chance of choosing the wrong action, forcing trial-and-error selection[29]. In information-theoretic form, that remaining lack of requisite knowledge is expressed by the conditional entropy term H(X|Y). That residual lack of requisite knowledge is represented by the conditional entropy HR|D . In Heylighen's formulation, HR|D=0 means complete knowledge, while HR|D=HR means complete ignorance[29].

In the present model, we retain that same object, but write it using the episode notation Y for the observed disturbance and X for the selected response. Thus the regulator's lack of requisite knowledge is represented by HX|Y . This is not introduced here as a new replacement for Ashby's law. It is the same kind of ignorance term, now expressed in the symbols of the operational model.

Crucially, conditional entropy is itself an average quantity: it is the expected value of the entropies of the conditional response distributions, averaged over the conditioning variable[5]. Therefore the primary object of interest in this article is not a one-off episode quantity taken in isolation, but the regulator's average lack of requisite knowledge over a window of episodes.

Window-level estimand

Let a window W contain m episodes. For episode t , let Mt be the regulator's stored structure at episode start, inducing a response law Pt(X|Y) . If the regulator structure changes during the window, then each episode may have a different induced law Pt(X|Y) . The corresponding episode-start ignorance term is

Ht (X|Y) := HPt (X|Y)

This quantity measures, for episode t , how uncertain the regulator still is, after observing the disturbance, about which specific response it will select under its current structure.

Knowledge To Be Discovered is the regulator's average episode-start uncertainty about which concrete response will be fixed for the observed disturbance:

H¯W:=1mWt=1mWHt(X|Y)
as operationally revealed by the number of episodes mW that survive as accepted closures by the end of a particular bounded execution window W , under that window's ledger of observable closures and rework.

Here X denotes the specific response actually fixed after the disturbance is known, not merely an equivalence class of acceptable responses. Accordingly, H ¯ W measures the average response-fixation uncertainty induced by the regulator's stored structure over the episodes that survive as accepted closures by the end of the bounded execution window W , under that window's ledger of observable closures and rework. Under the additional standing assumption that correctness is defined at this same response granularity, or that tie-breaking among multiple acceptable responses has itself been structuralized, H¯ W may also be read as the latent estimand of the regulator's average lack of requisite knowledge H X | Y .

This has important consequences:

  • the estimand is not global over the full Ashby table,
  • it is not window-invariant,
  • it may change if the same work is sliced into different windows,
  • and it is non-retroactive.

Episode-level interpretation

Although the estimand is window-level, each episode still has a natural local interpretation. For a particular episode t with observed disturbance realization Y=yt , the regulator begins with a latent response-selection uncertainty Ht X|Y=yt . Within the episode, staged evidence Z1:k reduces that uncertainty until a response is sufficiently fixed for closure. Thus the episode-level search process is the local mechanism, while H¯ W is the window-level quantity summarizing those episodes on average.

Relation to a pooled window distribution

We treat the whole window as generating one empirical joint distribution P W ( X | Y ) , and define pooled window entropy as:

H W ( X | Y ) = y P ^ W ( y ) x P ^ W ( x | y ) log P ^ W ( x | y )

If the regulator's structure is approximately stable throughout the window, so that the induced response law does not materially drift across episodes, then the average ignorance term may be represented by the pooled window distribution:

HW (X|Y) H¯ W

But in general they should be kept distinct. If learning, forgetting, or context drift changes Mt within the window, then the window-average ignorance term is more faithful than a single pooled conditional entropy.

What this definition does and does not mean

  • It does preserve the cybernetic meaning of the ignorance term: uncertainty about which response to select for a given disturbance[31].
  • It does not redefine Ashby's law as a theorem about one isolated episode.
  • It does not identify Knowledge To Be Discovered with arbitrary outcome uncertainty, with success probability, or with the full entropy of the environment's outcome table.
  • It does provide the latent quantity that the operational execution model will later try to estimate from window-level counts.

Quantifying Knowledge To Be Discovered

How is the desired regulator to be brought into being? With whatever variety the components were initially available, and with whatever variety the designs (i.e. input values) might have varied from the final appropriate form, the maker Q acted in relation to the goal so as to achieve it. He therefore acted as a regulator. Thus the making of a machine of desired properties (in the sense of getting it rather than one with undesired properties) is an act of regulation[2].

We now turn from definition to quantification. The quantity of theoretical interest is the regulator's Knowledge To Be Discovered HX|Y. Accordingly, the structure of the argument in what follows is: (1) define the latent window-average ignorance term, (2) relate episode-level binary discrimination depth to conditional entropy, and (3) construct a black-box observable count-based proxy that serves as an operational estimator for that average ignorance under explicit counting assumptions.

The quantity H X | Y is latent. In practice, we do not directly observe the disturbance Y, response X, evidence Z1, Z2, …, Zk or the internal selection process directly. What we do observe is the externally visible execution stream: bounded action capacity, episode-closing commitments, and observable rework. The goal is therefore not to recover the full response distribution directly, but to construct an obesrvable count-based proxy for the decision depth associated with response fixation, and then to show how observable rework W increases the effective number of binary discriminations required to identify the finally accepted outcomes under fixed capacity N. We focus on the case where regulation is eventually successful, i.e. the within-episode residual uncertainty is driven toward 0 before closure, while the window-average ignorance term may decrease across windows under learning..

The present model does not estimate a property of Ashby's full two-dimensional outcome table T. It estimates a window-relative, ledger-relative quantity induced by the one-dimensional execution trace of completed episodes observed in that window. Accordingly, both the episode population and the corresponding average lack of requisite knowledge are defined relative to the closures that survive in the observer's ledger at window end. Thus, “final” in this model means final-within-the-window-ledger, not necessarily final for all future time.

That follows from three modeling choices we have already made.

  • First, the object we observe is not Ashby's full space of possibilities T:DxR→O. It is only the temporal trace of what actually completed in the window. So we are no longer estimating a property of the full outcome table. We are estimating a property of the regulator as revealed through the bounded execution channel in that window.
  • Second, episode identity is tied to observable closure in the trace, not to an abstract disturbance-response pair considered independently of observation. So whether something counts as an episode depends on whether it survives as a counted closure in that window's ledger.
  • Third, rework is defined relative to the history visible to the observer in or before that window. So finality is not metaphysical finality. It is finality relative to the ledger available at window end.

Accordingly, the operational model keeps distinct:

  • the observed disturbance Y
  • the selected response X
  • the acceptability criterion induced by the goal/valuation layer
  • the externally visible episode-closing event recorded in the execution stream B

Clarifying “outcomes” vs. “activity” (episode-closure). In this model an outcome is defined operationally as a episode-closing commitment: an externally visible event that terminates the search for one disturbance-response pair by the end of the episode. It is not “any produced artifact” or “any activity.” In software it might be a merge/accept/release; in other domains it might be an approved artifact, a published paragraph, a signed decision, etc. The theory depends only on having a consistent, externally observable episode-close event.

Clarifying “rework” (in-model). Rework is any externally observable behavior in which previously produced artifacts are revised, undone, deleted, replaced, or corrected due to new evidence (e.g. failing tests, defects, requirement changes, reversals, rollbacks). Rework is observable through external change signals such as deletions, reversions, churn, reopened items, or corrective follow-up actions. Cybernetically: rework is capacity spent on revising prior episode-closing commitments, which reduces marginal episode-closure per unit capacity because the system revisits and repairs earlier choices instead of closing new episodes.

Here are the assumptions our operational model is based on:

  • Observed disturbance.
    Each episode begins with an observed disturbance realization Y = y .
    During that episode, y is treated as fixed and fully observed. Thus the within-episode uncertainty is not about which disturbance occurred, but about which response must be selected.
  • Meaning of the response variable.
    The random variable X denotes the specific response that will be fixed in that episode. It does not denote the whole equivalence class of acceptable responses, and it is not identical to the closure indicator.
  • Acceptable closure definition.
    An acceptable closure is an episode-closing commitment whose realized outcome o belongs to the acceptable-outcome set induced by the goal on E , i.e. v o lies within the admissible region of essential-variable states. In Ashby's terms, this is the acceptable-outcome relative to the goal criterion slice of the broader outcome table, not the whole table. A closure event also records that the regulator has committed some specific response for that episode, but doesn not imply uniqueness of the selected response. Multiple distinct responses may still yield acceptable outcomes. Therefore, acceptable closure records that the episode ended with a committed response whose realized outcome is acceptable, but it implies H X | Y = 0 only under the additional assumption that the acceptable response is unique or that tie-breaking among acceptable responses has itself been structuralized.
  • Scope of the observed outcome stream.
    The operational model does not track the full Ashby outcome space. It tracks only those episode-closing commitments whose realized outcomes are counted, by the operational ledger, as acceptable closures.
  • Acceptability, response, and closure are distinct.
    The model keeps separate:
    • the selected response X ,
    • the acceptability criterion induced by the goal/valuation layer,
    • the observed closure event recorded in the execution stream.
    These three are linked, but they are not the same object. A counted closure means that the episode has ended with an acceptable closure under the operational ledger; it does not erase the distinction between response selection and acceptability.
  • Shared execution channel.
    There is a single shared execution channel with finite window capacity N . Time is partitioned into atomic intervals. In the base model, each interval contains exactly one atomic action. No parallel channels, batching, or idle intervals are assumed.
  • Gross closure-event Any externally observable fixation episode-closing commitment that the ledger counts as an acceptable closure for one episode. Some survive as counted episodes; some are later overruled and become rework.
  • Episode is defined at the net ledger level In this operational model, an episode is defined not by any provisional closure-event, but by a closure that survives as an accepted observable closure relative to the ledger available at window end, not eternal finality.
  • Binary selection units.
    Any interval that does not contain an episode-closing commitment is treated as one binary discriminating selection step. One such step contributes at most one bit of information relevant to fixing X given Y . If a real-world test yields more than one bit, it is represented as multiple counted binary units. This is the operational bridge from staged selection to question-depth.
  • Atomic action types. Each atomic interval is used either:
    • to perform one binary discriminating selection step, or
    • to emit one externally visible episode-closing commitment.
  • No hidden selection inside closure.
    An episode-closing commitment is atomic at the observation level. It records that the episode has closed; it does not hide additional uncounted selection work inside the same interval.
  • Closure occurs only after response fixation.
    Episode closure occurs only after the regulator has already fixed the episode's selected response X . Therefore a counted closure event certifies that the search for that episode has terminated with a finalized response.
  • Gross closure indicator.
    For each atomic interval t ∈ { 1,...,n} in the observation window define the binary event-type indicator:

    Bt= { 1 iff interval t contains a gross closure-event, 0 otherwise }

    Bt is a purely operational a gross channel marker, not a net-of-window-end marker. It is therefore an execution-level marker, not the response variable X itself, not the goal variable / acceptability criterion G, not a success probability, and not a value of the essential variable E.

Latent decision-depth bound and observable count-based proxy

Consider a regulator that interacts with an environment in episodes. In each episode: A disturbance Y=y is observed at episode start and is treated as fixed for the episode. The regulator ultimately emits an episode-closing commitment oi (an externally visible acceptable outcome) by selecting responses. Let X denote the available response set (Ashby's R ; the action alphabet). Let X denote the regulator's selected response random variable, with values in X , distributed according to the regulator-induced policy P(X|Y) (induced by its current coupling/structure M t ).

The quantity H ¯ W is the regulator's average uncertainty about which concrete response act will be fixed for an observed disturbance, operationalized through the execution-trace model. It is not read directly from the execution trace. What the trace provides is an operational count of atomic action units expended per net surviving closure. Accordingly, the argument is stated in two distinct layers.

Information-theoretic layer

Inside an episode t, let kt denote the number of binary discriminating selections required before the response is fixed and the episode can close acceptably. Then kt is a latent decision depth variable. The expected number of selections needed to fix the response for a given disturbance per episode is: Ekt|Y=yi .

For each Y=y , if the regulator's within-episode binary selection strategy is optimal (minimizes Ekt|Y=y for the given posterior PX|Y=y ), then the expected number of binary selections required to identify X satisfies the Shannon bound:

H X | Y = y E k t | Y = y < H X | Y = y + 1
Averaging over Y yields:

H X | Y E [ k t ] < H X | Y + 1

Equivalently, averaging over the mW net surviving episodes in the window yields

H ¯ W E ¯ W [ k ] < H ¯ W + 1

(1)

where E ¯ W [ k ] := 1 m W t = 1 m W E [ k t ] .

Thus, the expected binary decision depth E ¯ W [ k ] needed to fix the response X under optimal binary questioning is the information-theoretic bridge that links the latent Knowledge To Be Discovered entropy to average questioning depth in Shannon-style prediction and source-coding arguments. The problem is that k is not directly observed episode by episode. Instead, we will infer it from black-box observable execution counts over a window.

Operational observation layer

Let T:DxRO be Ashby's two-dimensional outcome table: disturbances D , responses R , and outcomes O . In Ashby's framing, regulation works by selection in this table, and the law of requisite variety can be read as a law relating information and selection: the amount of selection that can be performed is limited by the information available.

Let Ω be an externally chosen observation window, and let EΩ = ( e1 , e2 , , en ) be the non-overlapping episodes completed in that window, ordered by completion time. The sequence EΩ is the one-dimensional temporal trace observed in the window. This trace is distinct from Ashby's two-dimensional table T because it only records the episodes that completed in the window, not the entire table of possible outcomes. We refer to this trace as the execution channel divided into atomic time units.

We work with a deliberately narrow operational model. The regulator is a black box. We do not observe its staged selection process. Each episode begins with an observed disturbance Y=y , which is treated as fixed for that episode. The regulator then performs a sequence of internal or externally mediated discriminations until the response X is sufficiently fixed for the episode to close acceptably.

Over a fixed observation window, assume:

  • There is a single shared execution channel with fixed capacity, no idle time, and divided into atomic time units.
  • Each atomic time unit is fully used either to:
    1. perform one binary selection step, meaning the regulator receives (or generates) a binary discriminating signal whose realized value provides ≤ 1 bit of discriminating information relevant to determining which xX will be selected for the given disturbance y for the current episode, or
    2. emit one episode-closing commitment (an externally visible acceptable outcome). Outcome emission is atomic and contains no hidden selection.
  • Episode closure occurs only once the regulator has fixed the episode's selected response X (i.e., the closing commitment corresponds 1-to-1 with a finalized response choice, even if we do not inspect its content).
  • Every episode in the window terminates in exactly one counted closure.
  • There are no open episodes crossing window boundaries.
  • No episode closes outside the window.
  • No counted closure represents more than one episode.

Operational window accounting (black-box observable count-based proxy)

Let {et} = {e1,,en} denote the complete time-ordered sequence of events over a fixed window of n discrete time units, where each unit is fully utilized by exactly one atomic action on a single shared execution channel. Each interval is classified by the binary event-type indicator Bt which is 1 if the interval contains an episode-closing commitment and 0 otherwise.

The total number of counted gross closure-events in the window is

S = t = 1 n B t
while the complementary count Q = t=1 n ( 1 - Bt ) = n - S is interpreted, under the atomic shared-channel assumptions above, as the total number of binary selection-equivalent steps expended in the same window. Because the ledger counts only acceptable episode-closing commitments, S is not a count of arbitrary activities or arbitrary realized outcomes but a count of acceptable closures.

In this base section, B t marks a closure that the ledger presently counts as an acceptable episode closure. Later, when rework is introduced, the accounting will be refined into gross closures, wasted closures, and net closures, without changing the basic meaning of B t as an event-type marker.

Partition the event stream {bt} into the S consecutive non-overlapping episodes {wi}, where episode wi corresponds to responding to one disturbance realization Y = yi , consists of exactly one outcome oi that closes the episode and contains exactly ki selections preceding that outcome, By construction, the inferred total number of selections for the window satisfy Q=i=1Ski. The window begins immediately after an episode-closing outcome and ends on an episode-closing outcome.

If the minimum outcome duration is one unit of time, equal to the time it takes to make one selection, then each action occupies one unit of time t. Hence the sum of S and Q equals the length n=Tt of the time series. The total available time T is therefore: Qt+St=T. We define r=t-1 as the execution rate of the channel, and since n=Tt=Tt-1=Tr then Q+S=Tr. We now drop usage of n and instead use N as maximum action capacity (selections + outcomes) per window. N=Tr

Consequently, over any observation window with maximum action capacity N and observed number of acceptable outcomes S, the inferredtotal number of binary selection-equivalent steps expended in the same window is:

Q = N - S

(2)

Hence the empirical mean selection depth per acceptable closure is: k^=QS=NS-1 The quantity k^ is an observable count-based proxy for the latent expected binary decision depth E [ k ] required to fix a response and achieve acceptable closure.

The fraction of time intervals that contain realized outcomes (episode-closing commitments) is: θ=SN.

Exanple of Knowledge discovery process

We illustrate the knowledge discovery process in Fig. 1.

Fig.1 A Knowledge Discovery Process. A regulator faces a sequence of disturbances Y and selects responses X over discrete time intervals t. The total available time T is divided into n unit intervals, each of which is used either to perform one optimal binary selection (one internal node traversal in a decision tree) or to emit one realized outcome in O. Three realized outcomes o1, o2, and o3 are produced. For disturbance y1, the response is already determined by the stored structural coupling between X and Y (i.e., H X | Y = y1 = 0 ), so outcome o1 is emitted with zero selections. For disturbance y2, the regulator must search within the candidate response set Ω2 , traversing four binary decision nodes (z1z2z3z4) before identifying the response and emitting o2. For disturbance y3, the candidate set Ω3 requires two binary selections (z5z6) before outcome o3 is produced. Accordingly, the per-disturbance decision depths are: H1(X|Y=y1)=0 , H2(X|Y=y2)=4 , and H3(X|Y=y3)=2 . The expected number of selections per outcome is therefore E(k) = 2, which operationally estimates the latent conditional entropy H(X|Y) up to the standard +1 Shannon bound. This expected decision depth quantifies the Knowledge To Be Discovered required to produce an outcome.

Observable rework

For gross closure et, define the prior operative value vt as the most recent already-operative value for the same target, whether that value came from earlier in the window or from before the window. An episode et is an additional corrective selection iff it fixes a different value vt for a target τt that already had an operative value vu in the prior episode, whether that value came from earlier in the window or from before the window.

Gross closure-events are externally visible fixation commitments. If a later corrective selection overrules an earlier closure for the same decision target within the same window, the earlier closure is not counted as a separate episode in the window's net ledger. It is counted as observable rework.

Target and operative value. For every interval t with Bt = 1 , let τt denote the decision target fixed by the gross closure at t and vt denote the operative value fixed for that target by that gross closure. These quantities are defined only when Bt = 1 .

Additional corrective selection / supersession indicator A gross closure in window Ωt is counted in Wt iff, when it occurs, the same target already has an operative outcome-value assignment, and the current closure changes that assignment.

Define the retrospective supersession indicator:

ACS t = { 1 if Bt = 1 and target τt already had an operative value before episode t + and the closure at t changes that operative outcome-value assignment 0 otherwise }

The binary indicator Bt is defined as a gross closure marker, not a net marker. This is necessary because rework must remain visible in the execution trace. A closure can only later be classified as rework if it was first recorded as a gross closure event in the channel. Accordingly, rework classification is applied after gross closure marking. The indicator ACSt is therefore retrospective: it identifies those gross closures that are later superseded within the same bounded window by a later gross closure for the same target fixing a different operative value.

Define rework operationally as the observable count: W t = t = 1 n ACS e t A current-window gross closure is counted in Wt iff it fixes, for a target that already had an operative realized outcome in Ashby's fixed outcome space, a different operative outcome-value assignment than the one previously in force.

Here Wt is an observable execution-level count. It does not directly reveal the sign of the regulator's internal coupling change, and it should not be identified with net negative learning Lt = max(-ΔtI, 0). A nonzero Wt is evidence that corrective activity occurred, but it does not by itself imply ΔtI < 0. A window may contain observable rework and still end with net positive learning in the regulator's stored coupling.

Including W yields an operational equivocation rate per net outcome under non-retroactive windowing; W increases the effective number of binary discriminations required to identify the finally accepted outcomes.

What Wt shows is that some previously counted closure-events were not yet final relative to the accepted-outcome ledger. Their correction required additional discrimination before the window's net accepted closures were fixed. Therefore, in the present model, observable rework is not treated as free. It is evidence that the effective number of binary discriminations required to reach the net accepted closures in the window was larger than the gross closure count alone would suggest.

Net closures Then the number of net new accepted closures for the window is: S net = t = 1 N B t ( 1 - ACS t ) = S gross - W .

Where:

  • Bt records everything that closed in the window,
  • ACSt classifies which of those gross closures were later superseded,
  • Sgross counts all gross closures in the window.
  • Wt counts the number of corrective closures,
  • Snet is the net new accepted closures for the window.

In this model, Snet counts net new accepted closures, not merely closures that remain operative at window end. A corrective closure may survive as the currently operative value and still be subtracted from Snet , because it revises a matter that had already been fixed operationally.

Therefore the window-level episode count is mW = Snet,t = Sgross,t - Wt . Under this convention, observable rework does not merely increase burden while leaving the episode population unchanged; it reduces the net number of surviving episodes and thereby increases the effective average binary selection depth per net accepted closure.

Illustrative example. Consider again the VAR example of a football game. In Episode 1, the referee performs a staged selection process and closes the episode with response x1 to disturbance y1, yielding realized outcome o1 = T(y1, x1). with value v ( o2 ) = goal allowed . Since this closure fixes a target that was not yet operative in the current accounting horizon, Window 1 records: Sgross,1 = 1 , W1 = 0 , Snet,1 = 1 . Now suppose that in Episode 2, VAR reviews the same target τ. The earlier outcome o1 is already operative for that target when the new episode begins. 𝑜 VAR then performs an additional staged selection process, selects response x2, and closes with outcome o2 = T(y2, x2), where v ( o2 ) = no goal v ( o1 ) . This closure is counted as rework because it changes the operative outcome-value assignment of a target that had already been fixed operationally. Therefore Window 2 records: Sgross,2 = 1 , W2 = 1 , Snet,2 = Sgross,2 - W2 = 0 . The significance of W is not merely that an outcome changed. Its significance is that an additional corrective selection had to be performed after a prior selection had already fixed the matter operationally. Under the information-selection reading of Ashby's law, the amount of selection that can be performed is limited by the information available. Hence each counted rework episode is evidence of additional selection burden, and therefore of additional informational burden, beyond the initial fixing of response. In the VAR example, the final accepted state “no goal” is not obtained by one selection process but by the accumulated effect of two temporally separated staged selections acting on the same target: the referee's initial fixation and VAR's later corrective overrule. That is why the rework count W must be included in the operational model. Thus W counts those gross closures in the current window that consume bounded selection capacity not to add a net-new accepted fixation, but to revise a target that had already been fixed earlier.

Selection-equivalent debit under fixed capacity

Per window t, let:

  • Nt be anchored atomic action capacity in the window (all atomic action units in the window: non-closure discriminations plus all gross closure emissions, whether they survive or not).
  • Sgross,t be the number of accepted episode-closing outcomes observed in the window.
  • Wt be the number of those gross closures that are later corrected, overruled, withdrawn, replaced, or otherwise fail to survive as net accepted closures within the same window, with 0 ≤ Wt ≤ Sgross,t.
  • Snet,t be the number of net new accepted closures, defined by: Snet,t = Sgross,t - Wt .
  • Qt be the number of binary selections (non-closure atomic units) in the same window: Qt = Nt - Sgross,t .

Because observable rework outcomes consume the same constrained channel capacity while not contributing to net new closures, we treat them operationally as a selection-equivalent debit. This does not mean that rework outcomes are literally binary questions about X. Rather, it means that under fixed channel capacity it occupies one atomic action unit that could otherwise have been used either:

  • to perform one further discriminating selection, or
  • to emit an accepted closure that survives correction.

Since wasted outcomes consume the same constrained channel capacity while not increasing Snet , define the effective non-net-progress selection-equivalent effective debit:

Qeff,t = Qt + Wt

Using Qt = Nt - Sgross,t and Snet,t = Sgross,t - Wt , we obtain:

Qeff,t = ( Nt - Sgross,t ) + Wt = Nt - Snet,t

Thus every observed corrective closure increases the effective debit against the net accepted closures that survive correction within the window.

Define the observed effective selection-equivalent depth per net accepted closure from the execution window by:

k ^ eff = Qeff S net = Q + W Sgross - W = N Sgross - W - 1 = N Snet - 1

(3)

This quantity is still interpreted in the language of information and selection:

  • Qt counts ordinary non-closure discrimination work,
  • Wt counts the closure-events that did not survive as net accepted closures,
  • therefore Wt raises the effective average number of binary discriminations associated with each net accepted closure.

This quantity k ^ eff is an operational upper bound relative to the no-rework baseline, because Snet = Sgross - W Sgross , and therefore k ^ eff = N S net - 1 N S gross - 1 Under the single-channel counting model, k ^ eff is the observed effective binary question depth per net accepted closure that survives correction within the window, and it is always at least as large as the corresponding no-rework depth . Thus observable rework does not leave the effective question depth unchanged. It increases it mechanically and informationally: if a gross closure later fails to survive correction, then additional discrimination was effectively required before the window's accepted outcome was truly fixed. In that sense, rework raises the effective conditional uncertainty revealed by the execution stream.

When Wt = 0 , the expression reduces to the no-rework case: k ^ eff = N Sgross - 1

So observable rework does not change the general object of estimation; it changes the effective question depth required to reach the accepted outcomes that remain valid after correction within the window.

Operational upper bound

Under the assumptions that (i) each counted non-closure atomic unit contributes at most one bit of discriminating information relevant to fixing the selected response for the current disturbance, (ii) episode closure contains no hidden selection, (iii) episode closure occurs only after the response is fixed, and (iv) observable rework W records gross closures that do not survive as net accepted closures within the window, the quantity k ^ eff s treated as a black-box observable count-based proxy for the latent average decision depth E ¯ W [ k ] : k ^ eff = E ¯ W [ k ] + δW

The operational approximation error δW collects all departures between the latent decision-depth object and the observable count-based proxy, including finite-window sampling error, learning drift within the window, non-optimal within-episode search, hidden selection inside closure-events, batching, parallelism, and mismatch between observable rework and the true latent corrective search burden.

If these departures are bounded in magnitude by some window-dependent error term εW 0 , so that | δW | εW , then combining the latent decision-depth bound with the operational approximation gives H ¯ W - εW k ^ eff < H ¯ W + 1 + εW

Thus k ^ eff should be read as an operational observable count-based proxy for the latent window-average Knowledge To Be Discovered, not as a direct Shannon identity for conditional entropy.

In the idealized regime where the counting model is exact and the window is sufficiently large and stable so that δW 0 , this reduces to

H ¯ W k ^ eff < H ¯ W +1

(4)

The meaning of the upper-bound claim is therefore precise. It does not say that W directly measures internal ignorance, and it does not identify observable rework with negative learning. What it says is narrower:

  • gross closures that later fail to survive correction show that the accepted outcome was not yet fully fixed,
  • the corrective events consume bounded channel capacity,
  • therefore the observed average number of binary-discrimination-equivalent action units required per surviving accepted closure must increase.

So the observable count-based proxy remains faithful to information theory as an effective average question depth, while explicitly accounting for the fact that some provisional closures do not survive as net accepted closures within the window.

Knowledge To Be Discovered Conservative Estimator

Under the present assumptions:

  • each non-closure atomic unit contributes at most one bit of discrimination relevant to fixing the accepted closure,
  • closure-events contain no hidden uncounted discrimination,
  • observable rework W records closures that did not survive as net accepted closures within the window,
  • all such actions consume the same bounded atomic channel capacity,
for a bounded observation window with maximum atomic action capacity N, gross closure count Sgross, and observable rework count W, the operational estimator

H ^ W := k ^ eff = N Sgross - W - 1

(5)

is the observed effective binary decision-depth proxy per net surviving closure that survives correction within the window, under non-retroactive windowing and single-channel counting model.

H ^ W is an operational estimator for the latent window-average Knowledge To Be Discovered H ¯ W in the specific sense that it aims to approximate the latent average decision depth E ¯ W [ k ] which in turn upper-bounds H ¯ W within one bit under optimal binary questioning.

The interpretation is:

  • the regulator must discriminate among alternatives before an accepted closure can be fixed,
  • gross closures that are later corrected show that the earlier fixing was not yet final relative to the window ledger,
  • therefore the observed execution stream reveals a larger effective number of binary discriminations per surviving accepted closure.

The atomic time unit for the rates is chosen fine-grained enough that each unit reveals ≤ 1 bit of discriminating information (i.e., any higher-bit test is decomposed into multiple counted units).

To connect the estimator to the latent target, write H ^ W = E ¯ W [ k ] + δ W with | δ W | ε W capturing finite-window error and model mismatch.

Hence, in the idealized regime of optimal binary discrimination and representative large-window averaging,

H ¯ W - ε W H ^ W < H ¯ W + 1 + ε W
or
H ¯ W H ^ W < H ¯ W + 1
up to sampling and nonstationarity error.

Hence the estimator should be interpreted as a window-level operational proxy for latent average response-selection uncertainty, with a one-bit information-theoretic bracket only after passing through the latent decision-depth quantity and only up to the residual approximation term ε W . Accordingly, the one-bit bracket and the count-based proxy should be interpreted as statements about response-fixation decision depth under the stated counting assumptions, not as a direct theorem about success-equivalence classes of acceptable outcomes.

We rearrange the bounds to be:

H ^ W H ¯ W > H ^ W 1
or
N Sgross - W 1 H ¯ W > N Sgross - W 2

Both inequalities point the same way: Knowledge To Be Discovered is bounded within a unit interval.

The estimator is “conservative” in the following narrow operational sense: observable rework makes the operational estimator weakly larger than the corresponding no-rework baseline. That is the conservative direction built into the estimator: under fixed capacity, failures of gross closures to survive correction can only increase the observed action-units-per-net-closure ratio. This is an operational monotonicity statement. It does not by itself mean that H ^ W is always an exact upper bound on the true internal conditional entropy in arbitrary non-ideal systems.

Thus the conservative estimator remains anchored to the theory of information as an operational proxy for latent average response-fixation decision depth, with upper-side conservative behavior relative to the no-rework baseline. It remains faithful to the information-and-selection reading of Ashby's framework because it preserves the central operational idea: under bounded capacity, more corrective overrule means more effective questioning per surviving accepted closure. But the exact one-bit entropy bracket belongs to the latent optimal-questioning object, not directly to the raw count ratio.

The result is operational: it links an abstract information-theoretic quantity H ¯ W to directly observable execution counts ( N , S , W ) , without assuming access to the true distribution. It provides an operational upper bound on conditional entropy under idealized assumptions (optimal binary selection, single shared channel, no idle capacity). It does not claim to recover the true entropy exactly in arbitrary systems.

Increase in effective question depth due to observable rework

For decomposition purposes, define the corresponding no-observable-rework baseline while holding the gross observed closure count Sgross,t fixed:

H ¯ W base,t := Nt Sgross,t - 1

This baseline is a conditional comparison only. It asks: how much larger is the observed effective question depth when some gross closures fail to survive as net accepted closures within the same window? It is not a universal causal counterfactual about what would have happened in a different world.

Accordingly, define the increase in effective question depth attributable to observable rework, under the fixed-gross comparison, by:

Δ H ^ W , t , := H ¯ W net,t - H ¯ W base,t = Nt ( 1 Snet,t - 1 Sgross,t ) = Nt Wt Sgross,t ( Sgross,t - Wt )

This expression is always nonnegative, and for fixed Nt and Sgross,t, it increases monotonically with Wt. Thus, when more gross closures fail to survive correction, the observed effective question depth per net accepted closure must increase.

The interpretation is operational and information-theoretic:

  • Sgross,t counts all observed closure-events in the window,
  • Wt counts those that do not survive as net accepted closures,
Therefore the surviving accepted outcomes are supported by fewer net closures on the same bounded channel, which increases the average number of binary-discrimination-equivalent action units associated with each surviving accepted closure.

So the role of Wt here is not to represent internal ignorance directly, nor to determine whether learning was positive or negative. Its role is narrower and explicit: observable rework increases the effective question depth revealed by the execution stream, because some previously counted closures did not remain accepted and therefore additional discrimination was effectively required before the window's net accepted closures were fixed.

It should not be read as an identity between:

  • artifact-level observable rework Wt, and
  • internal net coupling loss Lt.

A window with Wt > 0 may still exhibit net positive learning in the internal ledger, and a window with ΔtI < 0 need not expose all of that loss through visible repair outcomes in the same window.

In summary, observable rework affects the estimator H ^ W implicitly through the reduction of Snet,t relative to Sgross,t. The quantity Δ H ^ W , t , isolates exactly that increase, under the fixed-gross comparison, as the increment in effective question depth attributable to observable rework.

Knowledge-Discovery Efficiency (KEDE) Metric

Now we generalize the Knowledge-Discovery Efficiency (KEDE) - scalar metric that quantifies how efficiently a system closes the gap between the variety demanded by its environment and the variety embodied in its prior knowledge[28].

We rearrange the formula (5) and insted of H ^ X | Y we use HX|Y for notation simplicity. and get the formula for Knowledge-Discovery Efficiency (KEDE) metric[28]:

KEDE = 1 1 + H X | Y = S-W N

(6)

KEDE is a scalar metric that quantifies how efficiently a system closes the gap between the variety demanded by its environment and the variety embodied in its prior knowledge[28]. KEDE is an acronym for KnowledgE Discovery Efficiency. It is pronounced [ki:d].

Efficiency means the smaller the average number of selections made per outcome the better. In other words - the less knowledge to be discovered per outcome the more efficient the knowledge discovery process is.

KEDE has the properties:

  • It is a function of the missing information H
  • Its maximum value corresponds to H equals zero i.e. there is no need to make selections, all knowledge is already discovered.
  • Its minimum value corresponds to H equals Infinity i.e. we have no knowledge to start with.
  • It is continuous in the closed interval of [0,1]. This makes it very useful to be used as a percentage. This is because we need to be able to rank knowledge discovery processes by efficiency. The best ranked knowledge discovery process will have 100% and the worst 0%. That is practical and people are used to having such a scale.

What does KEDE measure?

  • Regulation consumes execution capacity to cope with missing knowledge.
  • Knowledge discovery converts that consumed capacity into persistent internal variety, reducing future consumtion of the execution capacity.
  • KEDE measures the efficiency of this conversion: how much execution capacity is spent on discovery versus production, under a successful-adaptation regime.

KEDE effectively converts the knowledge to be discovered H(X|Y), which can range from 0 to infinity, into a bounded scale between 0 and 1.

KEDE is a measure of how much of the required knowledge for completing tasks is covered by the prior knowledge.

Due to its general definition KEDE can be used for comparisons between organizations in different contexts. For instance to compare hospitals with software development companies! That is possible as long as KEDE calculation is defined properly for each context. In what follows we will define KEDE calculation for the case of knowledge workers who produce textual content in general and computer source code in particular.

Anchoring KEDE to Natural Constraints

In our model, N is always the theoretical maximum action rate (selections + outcomes) in an unconstrained environment, and S is the observed outcome rate under specific conditions over a given interval.

A key question is how to assign a natural constraint to N. That is, what constitutes an appropriate reference value for the maximum action rate (selections + outcomes)?

We may turn to physics for an instructive analogy. A quantum (plural: quanta) represents the smallest discrete unit of a physical phenomenon. For instance, a quantum of light is a photon, and a quantum of electricity is an electron. In this context, the speed of light in a vacuum serves as a fundamental upper bound for N. However, identifying an analogous natural constraint for human activity—particularly knowledge work—presents greater challenges.

Consider the example of typing. Here, the quantum can reasonably be defined as a symbol, since it is the smallest discrete unit of text. A symbol may be a letter, number, punctuation mark, or whitespace character. To determine the appropriate bin width Δt, we refer to empirical data on the minimum time required to produce a single symbol. Typing speed has been subject to considerable research. One of the metrics used for analyzing typing speed is inter-key interval (IKI), which is the difference in timestamps between two keypress events. We see that IKI is defined equal to the symbol duration time t. Hence we can use the research of IKI to find the symbol duration time t. Studies have reported an average IKI of 0.238 seconds [26], yielding a maximum human typing rate of approximately N=1/t=1/0,238=4.2 symbols per second

A similar approach can be applied to tasks such as furniture assembly. In this case, a plausible quantum is a single screw tightened, since it represents a minimal, repeatable unit of outcome. We then identify Δt as the average time required to tighten one screw. Empirical studies report that this task typically takes between 5 and 10 seconds[34]. Using the upper bound, we estimate the maximum screw-tightening rate as N=1/t=1/10=0.1 screws per second.

This methodology offers a principled way to estimate N using domain-specific quanta and empirically grounded time durations, enabling the application of our model to a broad range of human tasks.

The next question concerns the appropriate definition of outcome for measuring S and N.

Both N and S can always be discretized—or “binned”—in a way that preserves the total information rate, regardless of whether the outcome arises from natural processes, human behavior, or machines. By choosing a bin width Δt small enough (e.g., milliseconds), the range of possible tangible outcomes within each bin shrinks dramatically. This reduced range leads to less uncertainty in each bin, which compensates for the smaller time interval. Yet the ratio

total outcome in bin Δt

remains an accurate measure of information rate.

As Δt becomes smaller, the measurements of S and N become more precise, as they reflect outcome over finer time intervals. But how small should Δt be? This dilemma is resolved by considering the granularity of outcomes associated with the outcome. The set E of outcomes can be thought of as the effects of the regulation process — the resulting states after the regulator responds to disturbances. In our model E is a sequence of {0,1}, where 0 = wrong outcome(failure to regulate) and 1 = acceptable outcome. So the presence of a concrete outcome leads to a natural binning of the outcomes, It also enables a clear distinction between signal (the entropy associated with producing the outcome) and noise (the residual variability unrelated to success or failure).

For example, two distinct symbols typed (e.g., ‘a' vs. ‘b') are clearly different outcomes. However, if one symbol is typed in 91 milliseconds and another in 92 milliseconds, this minute variation is inconsequential to the outcome. Such timing fluctuations are typically unintentional, irrelevant to task performance, and should not be considered part of the outcome. In practical terms, if the theoretical upper bound N is known—for instance, 4.2 symbols per second as derived from human typing speed, and the observed rate is S=1 symbol per second, then time should be partitioned into one-second bins. Each bin then yields a single outcome: either 1 (a symbol was successfully typed) or 0 (no symbol typed or incorrect input).

This binning principle generalizes beyond typing. Whether analyzing foot strikes in trail running (where negligible spatial change occurs over milliseconds) or the discrete moves in solving a Rubik's cube (where each turn resolves multiple potential states into a single action), binning ensures that no intermediate state need be modeled explicitly.

Applications

The knowledge-centric perspective builds on Ashby's Law of Requisite Variety by emphasizing that successful outcomes depend not only on a system's range of possible responses, but also on its ability to select the right response for each disturbance. This requires internal “system knowledge” that maps disturbances to appropriate actions. As Francis Heylighen proposed in his “Law of Requisite Knowledge,” effective regulation demands more than variety—it demands informed selection[29]. This knowledge-centric lens provides a foundation for analyzing how systems—biological, technical, or organizational—achieve control not just through options, but through understanding. The model we present operationalizes this perspective by estimating the informational requirements a system must satisfy to achieve its observed level of regulatory performance.

In what follows, we apply this knowledge-centric perspective to a range of domains, including motor tasks and manual assembly, industrial assembly lines, software development processes, speed of light in a medium, intelligence testing and sports performance. In each case, the model enables us to estimate, in bits of information, the amount of knowledge a system must lack to produce its observed level of performance. By quantifying the knowledge to be discovered H(X|Y), we assess how much uncertainty was there in the system's ability to select appropriate responses. This allows us to compare systems not by tangible outcomes, but by the hidden knowledge structures required to achieve them, offering a unified lens for analyzing adaptation, skill, and control across diverse contexts.

Tightening screws

We can apply our model to motor tasks such as furniture assembly. In this context, a natural unit of outcome — or “quantum” — is the tightening of a single screw.

Skilled workers engaged in manual assembly tasks can typically insert and tighten standard screws at a rate of 6'-12 screws per minute under optimal, repetitive conditions — such as those found in furniture construction or industrial assembly lines. In contrast, automated screw-tightening machines can achieve significantly higher rates, often between 30 and 60 screws per minute [34] More complex manual tasks, such as high-torque applications involving ratchets or Allen keys, typically reduce the rate to 2'-4 screws per minute due to the increased effort and precision required. In surgical or medical contexts, such as orthopedic screw insertion, accuracy and the avoidance of overtightening are paramount; here, rates often fall to 1'-2 screws per minute, or approximately one screw every 30'-60 seconds [46].

Context Typical Rate (screws/minute) Notes
Automated (machine) 30'-60 For comparison, not manual
Fast, repetitive tasks 6'-12 Assembly line, minimal torque required
High-torque/manual 2'-4 Metalwork, ratchets, Allen keys
Surgical/precision 1'-2 Orthopedic, high accuracy, low speed

The key observation is that rates decrease as torque, task complexity, or required precision increases. If we take the machine rate as the maximum possible outcome N and the observed human rate as S, we can estimate the average number of bits of information H(X|Y) that the human operator must process per action.

KEDE=SN H(X|Y)=1KEDE-1=NS-1=6012-1=4 bits/screw

This implies that the human must absorb approximately 4 bits of information, on average, to tighten a single screw under typical conditions.

The rate at which a person tightens screws depends on various factors, including:

  • Screw type and size
  • Material being fastened
  • Required torque
  • Tool used (screwdriver, ratchet, etc.)
  • Operator skill and fatigue
These constitute the disturbance variety D faced by the human operator. The operator, acting as a regulator, responds with selections from their internal repertoire of skills — the regulatory variety R.

This interpretation aligns with existing research, which suggests that task difficulty directly influences the amount of information a task imparts [47, 48]. When difficulty is appropriately matched to the individual's skill level, the task yields maximal informational value [49], and the time required reflects the interaction between task complexity and the individual's regulatory capacity [50].

Using our model, we transform a sequence of real-world actions in furniture assembly into a granular, time-based measure of regulatory capacity. This enables us to quantify — in bits — how much variety the individual must absorb in order to successfully complete the task.

Typing the longest English word

Let's use an example scenario to see Ashby's law applied to human cognition and knowledge work.

For that we'll have myself executing the task of typing on a keyboard the word “Honorificabilitudinitatibus”. It means “the state of being able to achieve honours” and is mentioned by Costard in Act V, Scene I of William Shakespeare's “Love's Labour's Lost”. With its 27 letters “Honorificabilitudinitatibus” is the longest word in the English language featuring only alternating consonants and vowels.

The way I will execute this task is to go to the "play text" or "script" of “Love's Labour's Lost”, look up the word and type it down. The manual part of the task is to type 27 letters. The knowledge part of the task is to know which are those 27 letters.

In order to track the knowledge discovery process I will put "1" for each time interval when I have a letter typed and "0" for each time interval when I don't know what letter to type.

I start by taking a good look at the word “Honorificabilitudinitatibus” in the script of “Love's Labours' Lost”. That takes me two time intervals. Then I type the first letters “H”, “o”, and “n”.I continue typing letter after letter: “o”, “r”. At this point I cannot recall the next letter. What should I do? I am missing information so I go and open up the script of “Love's Labours Lost” and I look up the word again. Now I know what the next letter to type is but acquiring that information took me one time interval. This time I have remembered more letters so I am able to type “i”,”f”,”i”,”c”,”a”,”b”,”i”. Then again I cannot continue because I have forgotten what were the next letters of the word, so I have to look it up again.in the script. That takes two more time intervals. Now I can continue my typing of “l”,”i”,”t”. At this point I stop again because I am not sure what were the next letters to type, so I have to think about it. That takes one time interval. I continue my typing with “u”,”d”,”i”. Then I stop again because I have again forgotten what were the next letters to type, so I have to look it up again in the script of “Love's Labours Lost”. That takes two more time intervals. Now I know what the next letter to type is so I can continue typing “n”,”i”.At this point I cannot recall the next letter. so I have to look it up again in the script. That takes two more time intervals. After I know what the next letter to type is I can continue typing “t”,”a”,”t”,”i”,”b”,”u”,”s”. Eventually I am done!

At the end of the exercise I have the word “Honorificabilitudinitatibus” typed and along with it a sequence of zeros and ones.



H o n o r


i f i c a b i



l i t


u d i



n i



t a t i b u s
0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1

In the table we have separated the manual work of typing from the knowledge work of thinking about what to type.

We made visible both the manual work and the knowledge discovery parts of a Knowledge Discovery process.

The first row of the table shows the knowledge I manually transformed into tangible outcome - in this case the longest English word. The second row of the table shows the way I discovered that knowledge. There is a "0" for each time interval when I was missing information about what to type next. There is "1" for each time interval when I had prior knowledge about what to type next. Each "0" represents a selection I needed to ask in order to acquire the missing information about what letter to type next. Each "1" represents prior knowledge.

We know that there is knowledge applied when we see the tangible outcome of the process. We know there was knowledge discovered when we see there was at least one selection made.

In the exercise above we witnessed the discovery and transformation of invisible knowledge into visible tangible outcome.

KEDE calculation

We can calculate the KEDE for this sequence of outcomes.

KEDE=SN=2737=0.73

We can also calculate the knowledge discovered H(X|Y) in bits of information.

H(X|Y)=NS-1=3727-1=0.37

We've turned a real-world sequence of action and hesitation into a fine-grained, time-based measurement of regulatory capacity — effectively measuring how much variety I needed to absorb with external help i.e. my knowledge discovered.

Measuring software development

In order to use the KEDE formula (6) in practice we need to know both S and N. We can count the actual number of symbols of source code contributed straight from the source code files. For N we want to use some naturally constrained value.

N is the maximum number of symbols that could be contributed for a time interval by a single human being.

In the below formula for N we want to use some naturally constrained value:

To achieve this, the following estimation is performed. We pick T = 8 hours of work because that is the standard length of a work day for a software developer.

To calculate the value of r we need to pick the symbol duration t.

The value of the symbol duration time t is determined by two natural constraints:

  1. the maximum typing speed of human beings
  2. the capacity of the cognitive control of the human brain

Typing speed has been subject to considerable research. One of the metrics used for analyzing typing speed is inter-key interval (IKI), which is the difference in timestamps between two keypress events. We see that IKI is defined equal to the symbol duration time t. Hence we can use the research of IKI to find the symbol duration time t. It was found that the average IKI is 0.238s [26]. There are many factors that affect IKI [6]. It was also found that proficient typing is dependent on the ability to view characters in advance of the one currently being typed. The median IKI was 0.101s for typing with unlimited preview and for typing with 8 characters visible to the right of the to-be-typed character but was 0.446s with only 1 character visible prior to each keystroke [7]. Another well-documented finding is that familiar, meaningful material is typed faster than unfamiliar, nonsense material[8]. Another finding that may account for some of the IKI variability is what may be called the “word initiation effect”. If words are stored in memory as integral units, one may expect the latency of the first keystroke in the word to reflect the time required to retrieve the word from memory[55].

Cognitive control, also known as executive function, is a higher-level cognitive process that involves the ability to control and manage other cognitive processes that permit selection and prioritization of information processing in different cognitive domains to reach the capacity-limited conscious mind. Cognitive control coordinates thoughts and actions under uncertainty. It's like the "conductor" of the cognitive processes, orchestrating and managing how they work together. Information theory has been applied to cognitive control by studying the capacity of cognitive control in terms of the amount of information that can be processed or manipulated at any given time. Researchers found that the capacity of cognitive control is approximately 3 to 4 bits per second[32][33], That means cognitive control as a higher-level function has a remarkably low capacity.

Based on the above research we get:

  1. Maximum typing speed of human beings to be r=1/t=1/0,238=4.2 symbols per second
  2. Capacity of the cognitive control of the human brain to be approximately 3 to 4 bits per second. Since we assume one question equals one bit of information we get 3 to 4 questions per second.
  3. Asking questions is an effortful task and humans cannot type at the same time. If there was a symbol NOT typed then there was a question asked. That means the question rate equals the symbol rate, as explained here.
Since the question rate needs to equal the symbol rate we consider that 4.2 symbols per second is a rate higher than 3 to 4 bits per second. We need to get a symbol rate between 3 and 4 symbols per second.

In order to get a round value of maximum symbol rate N of 100 000 symbols per 8 hours of work we pick symbol duration time t to be 0.288 seconds. That is a bit larger than what the IKI research found but makes sense when we think of 8 hours of typing. Having t of 0.288 seconds makes a symbol rate r of 3.47 symbols per second. That is between 3 and 4 and matches the capacity of the cognitive control of the human brain.

We define CPH as the maximum rate of characters that could be contributed per hour. Since r is 3.47 symbols per second we get CPH of 12 500 symbols per hour. We substitute T = h and r=CPH and the formula for N becomes:

where h is the number of working hours in a day and CPH is the maximum number of characters that could be contributed per hour. We define h to be eight hours and get N to be 100 000 symbols per eight hours of work.

Total working time consist of four components:

  • Time spent typing (coding)
  • Time spent figuring out WHAT to develop
  • Time spent figuring out HOW to code the WHAT
  • Time doing something else (NW)

Let us assume an ideal system where the time spent doing something else TNW is zero. Using the new formula for N the formula for H becomes

Note, that since N is calculated per hour so S also needs to be counted in an hour.

We see that the more symbols of source code contributed during a time interval the less missing information was there to be acquired. We want to compare the performance of different software development processes in terms of the efficiency of their knowledge discovery processes. Hence we rearrange the formula to emphasize that.

Sh×CPH=11+H

(7)

The right hand part is the KEDE we defined earlier. Thus, we define an instance of the metric KEDE - the general metric that we introduced earlier. This version of KEDE is for the case of knowledge workers that produce tangible outcome in the form of textual content:

KEDE=Sh×CPH

(8)

KEDE from (8) contains only quantities we can measure in practice. KEDE also satisfies all properties we defined earlier. it has a maximum value of 1 and minimum value of 0; it equals 0 when H is infinite; it equals 1 when H is zero; it is anchored on a natural constraint—the maximum typing speed of a human being.

If we convert the KEDE formula into percentages then it becomes:

KEDE=Sh×CPH×100%

(9)

We can use KEDE to compare the knowledge discovery efficiency of software development organizations.

Testing Intelligence

Today all measure intelligence by the power of appropriate selection (of the right answers from the wrong). The tests thus use the same operation as is used in the theorem on requisite variety, and must therefore be subject to the same limitation. (D, of course, is here the set of possible questions, and R is the set of all possible answers). Thus what we understand as a man's “intelligence” is subject to the fundamental limitation: it cannot exceed his capacity as a transducer. (To be exact, “capacity” must here be defined on a per-second or a per-question basis, according to the type of test.)[3]

We can also use our model to the testing of human and AI intelligence. We infer this capacity from performance under variety — i.e., how many different problems a system or a person can solve correctly.

The dominant mathematical models for testing intelligence by the number of answered problems are benchmark datasets like MMLU, GSM8K, MATH, and FrontierMath. These models measure intelligence by the raw count or percentage of correctly solved problems, with more advanced benchmarks designed to minimize guessing and require deep reasoning.

From the knowledge-centric perspective:

  • The disturbances are the questions Q={q1,q2,q3,...qn}
  • The person gives responses R={r1,r2,r3,...rn}
  • The outcomes are E={e1,e2,e3,...en} {0,1}
So: Intelligence is the capacity to consistently produce 1s in E, despite the variety in D.

Several mathematical models and benchmark datasets are used to evaluate intelligence—especially artificial intelligence (AI)—by measuring the number and complexity of math problems answered correctly. These models serve as standardized tests for both AI and, by analogy, human intelligence[52].

Massive Multitask Language Understanding (MMLU):

  • MMLU is a widely used benchmark that tests AI models on a broad range of subjects, including mathematics at various levels (high school, college, abstract algebra, formal logic).
  • The test is typically formatted as multiple-choice questions, and performance is measured by the percentage of correct answers out of the total number of questions
  • For example, advanced AI models have achieved up to 98% accuracy on math sections of MMLU, indicating high proficiency in standard math tasks but not necessarily deep reasoning

Grade School Math 8K (GSM8K)

  • GSM8K is a dataset of 8,500 high-quality, grade school-level word problems designed to test logical reasoning and basic arithmetic skills.
  • Evaluation is based on exact match accuracy: the number of problems answered exactly correctly divided by the total number attempted
  • This benchmark is used to assess step-by-step reasoning and the ability to handle linguistic diversity in problem statements.

MATH (Mathematics Competitions Dataset)

  • MATH consists of problems from high-level math competitions (e.g., AMC 10, AMC 12, AIME), focusing on advanced reasoning rather than rote computation.
  • Performance is measured by the percentage of correct answers, with human experts (e.g., IMO medalists) providing a reference for top-level performance
  • The dataset is challenging for both humans and AI, with LLMs typically scoring much lower than expert humans.

FrontierMath[53]

  • FrontierMath is a new benchmark featuring hundreds of original, expert-level math problems spanning major branches of modern mathematics.
  • Problems are designed to be "guessproof" and require genuine mathematical understanding, with automatic verification of answers
  • The benchmark is used to assess how well AI models can understand and solve complex mathematical problems, similar to human performance.

In human intelligence testing, Psychometric models such as IQ tests or psychometric approaches also use the number of correctly answered problems as a key metric. These tests are standardized, and the raw score (number of correct answers) is often converted into a scaled score or percentile.

As an example we will use the Exact Match metric as the evaluation method[52]. Given that each question in our benchmark dataset has a single correct answer and the model produces a response per query, Exact Match ensures a rigorous evaluation by comparing the extracted answer to the ground truth.

Let ŷi represent the extracted answer from the model's outcome for the ith question, and let yi be the corresponding ground truth answer. The Exact Match accuracy is computed as:

Exact Match (%) = i=1 N 𝟙 ( normalize ( ŷi ) = normalize ( yi ) ) N x 100

where:

  • N is the total number of evaluated questions.
  • 𝟙() is the indicator function, returning 1 if the extracted model response matches the ground truth after preprocessing, and 0 otherwise.
  • normalize() is a function that standardizes formatting, trims spaces, and normalizes numerical values.

The knowledge discovery efficiency of an LLM can be calculated as:

Exact Match accuracy=SN=KEDE
where S is the number of correct answers and N is the total number of evaluated questions..

Let's pick the case of the performance of GPT-4o on the MATH benchmark, which achieved a significantly lower accuracy of 64.88%, lagging behind its peer models[52]. Now, we can calculate the average knowledge discovered H(X|Y).

H(X|Y)=1KEDE-1=10064.88-1=1.54 bits/problem

Basketball Game

We can also use this model to assess the performance of a basketball player.

  • Timeframe is a basketball game.
  • We observe N total shot attempts.
  • S of them are successful (shot made).
  • We record a binary outcome sequence
    E{0,1}N
  • The empirical success rate:
    θ=SN
    is our observed probability of success.

Interpretation using Ashby's Law

The basketball shot is a regulation problem: the player must control their body and respond to the game environment to produce the desired outcome. The player is faced with a series of disturbances (D) in the form of different shots to make under different conditions. The player responds with a selection, drawn from their internal skills (regulatory variety R) in the form of different shooting techniques. Each shot is uncertain whether it will be successful. The outcome E is whether the shot is made (1) or missed (0).

Over N shots, the success rate

θ=SN
reflects how often the player's internal variety is sufficient to absorb the variety in the environment — an operational measure of regulatory success.

In this case, θ becomes a practical proxy for how often the regulator (player) has sufficient internal variety to absorb the disturbance presented by the game. However, it is important to note that this is a simplified model and does not account for all the complexities of basketball performance. For example, the player may have different success rates depending on the type of shot, the position on the court, or the level of defense. These factors can all affect the player's ability to regulate their performance and should be considered when interpreting the results. Thus, as explained here θ is a useful heuristics for P(E=1), but the full picture includes the quality of mapping, not just quantity.

Applying the Model

NBA keeps track of field goal attempts and makes for each player. The most field goal attempts by a player in a single NBA game is 63, achieved by Wilt Chamberlain during his legendary 100-point game against the New York Knicks on March 2, 1962 We take this as the natural constraint so N=63. We can also take the number of successful shots S=36, which is the most field goals made in a single game by a player[13].

We can calculate the KEDE for this sequence of outcomes.

KEDE=SN=3663=0.571

We can also calculate the knowledge discovered H(X|Y) in bits of information.

H(X|Y)=NS-1=6336-1=0.75

That means that the player needed to absorb 0.75 bits of information on average to make the shot.

We've turned a real-world sequence of basketball shots into a fine-grained, time-based measurement of a regulatory capacity — effectively measuring how much variety the player needed to absorb.

We can also use this model to assess the performance of a basketball team. In this case the success rate coincides with the field goal percentage (FG%) of the team which is the percentage proportion of made shots over total shots that a player or a team takes in games. There is a statistical distribution for NBA field goal percentage (FG%) [10]. Analysts and researchers often study the distribution of FG% across players or teams to understand scoring efficiency and trends[11]. The NBA record for the highest FG% in a single game by a team is 69.3%, set by the Los Angeles Clippers on March 13, 1998, when they made 61 of 88 shots[12].

For example, in the 2023-24 season, team FG% ranged from about 43.5% (lowest) to 50.6% (highest), with the league average typically falling in the mid-to-high 40% range[11]. if we take the average FG% of 45% , we can calculate the average knowledge discovered H(X|Y).

H(X|Y)=1KEDE-1=1FG%-1=10.45-1=1.22

That means that a team needed to absorb 1.22 bits of information on average to make a shot.

Assembly Line

We can also use this model to assess the knowledge discovery efficiency of an assembly line.

The assembly line is a system that transforms raw materials into finished products. The assembly line has a set of disturbances (D) in the form of different raw materials, machines, and processes. The assembly line responds with a selection, drawn from its internal structure (R) in the form of different machines, processes, and workers.

From a knowledge-sentric perspective most of the knpwledge discovery happens in the design phase of the assembly line. This is the planning for design, fabrication and assembly. This activity has also been called design for manufacturing and assembly (DFM/A) or sometimes predictive engineering. It is essentially the selection of design features and options that promote cost-competitive manufacturing, assembly, and test practices[51]. Thus most of the disturbances D are already absorbed by the design of the assembly line. That means when the workers have most of the knowledge built into the assembly line and the operational procedures.

Assembly line efficiency (AE) is the ratio of the outcome to the maximum possible outcome, often expressed as a percentage.

The efficiency of the assembly line can be calculated as:

AE=SN=KEDE
where S is the actual outcome and N is the maximum possible outcome.

We can assume that an assembly line is designed to produce a certain number of successful products (S) with a maximum rate of N products per hour. So for example, a shoe manufacturer has an actual outcome of 100 shoes per day, and a maximum potential outcome of 120 shoes per day. Their production line efficiency would be 83%. Now, we can calculate the average knowledge discovered H(X|Y).

H(X|Y)=1KEDE-1=1AE-1=10083-1=0.2 bits/shoe

To optimize the AE, companies can apply DFA guidelines, such as minimizing the number and variety of parts, standardizing the fasteners and connectors, and simplifying the assembly sequence and orientation[51].

Interpreting the results involves a comprehensive analysis of the data to understand where and why inefficiencies occur. In general, the higher the AE, the better the design. On the other hand, AE close to 100% might indicate under-utilised capacity. It's essential to compare high efficiency with industry capacity standards to determine if an increase in production is feasible and beneficial.

If AE is consistently below industry benchmarks, this could highlight several potential issues:

  • Machinery: It may indicate that machines are outdated, malfunctioning, or not suitable for the required tasks.
  • Labour Skills: Low efficiency might be due to workforce training gaps.
  • Process Design: Sometimes, the workflow or layout of the production line itself causes inefficiencies.

Speed of Light in Medium

We can also use this model to support an interpretation of Ashby's Law of Requisite Variety to assess the speed of light in a medium where the medium acts as a disturbance to photon flow. Here's how this perspective aligns with the physics of light-matter interactions:

  • Disturbance: The medium's atomic/molecular structure introduces spatial and electromagnetic inhomogeneities (e.g., refractive index variations, turbulence).
  • Control Mechanism: Photons' ability to "counteract" disturbances through wavelength compression and phase synchronization.
  • Requisite Variety: Photons require sufficient adaptability (e.g., frequency range, polarization states) to navigate the medium's complexity without scattering or losing coherence.

The speed of light in a vacuum is 299,792,458 m/s. In a medium, the speed of light is reduced by a factor n, called the refractive index defined as:

n=cv
where c is the speed of light in vacuum and v is the speed of light in the medium.

The refractive index is a measure of how much the speed of light is reduced in the medium. The higher the refractive index, the more the speed of light is reduced.

For example, the refractive index of water is 1.33, which means that the speed of light in water is:

v=299 792 4581.33225 000 000 m/s

The knowledge discovery efficiency of the speed of light in a medium can be calculated as:

KEDE=vc=1n
where v is the actual speed of light in the medium and c is the maximum possible speed of light in vacuum.

Now, we can calculate the average knowledge discovered H(X|Y) by a photon in water:

H(X|Y)=1KEDE-1=11n-1=n-1=1.33-1=0.33 bits/photon

Appendix

What learning could also do (but we are explicitly excluding)

Not every form of learning improves regulation H(E|Y) in Ashby's sense. Other possibilities include:

  1. Expanding action variety without selectivity

    Learning might increase H ( X ) (more possible actions, tools, behaviors) without reducing H ( X | Y ) .

    • The system becomes more capable in principle
    • But still does not know which action to take
    • Regulation does not improve

    This violates Ashby's requirement that variety must be constrained, not merely expanded.

  2. Improving buffering instead of knowledge

    Learning might increase buffering capacity K (delay, slack, tolerance), so disturbances are absorbed without better action selection.

    • Outcomes may improve
    • But I ( X : Y ) does not increase
    • Regulation improves without learning the mapping

    This is explicitly separated from knowledge in Ashby's extended formulation.

  3. Changing goals or success criteria

    Learning could redefine what counts as success E .

    • Apparent performance improves
    • But the structural coupling (mapping) is unchanged
    • Information-theoretically, nothing about H ( X | Y ) need change

    This is semantic drift, not cybernetic learning.

  4. One-off adaptation without structural retention

    The system may succeed through exploration Z without storing the result.

    • Regulation succeeds this time
    • Next encounter repeats the same uncertainty
    • No accumulation of I ( X : Y )

    This is regulation, not learning.

Cumulative Knowledge To Be Discovered

Using

H ( S ) = N S - 1
from (5) with constant N, the cumulative residual variety (C) as a function of performance level (S) has a clean closed form.

Cumulative w.r.t. S
Choose a baseline S0 > 0.
Define:

C ( S ; S0 ) = S0 S H ( u ) d u = S0 S ( Nu - 1 ) d u = N ln SS0 - ( S - S0 ) .

Key properties:

  • dC dS = H ( S ) = NS - 1
  • d2C d2S = - NS2 < 0 C is concave in S.

Domain: S ∈ (0, N]. Since H(S) > 0 for S<N, C(S;S0) increases with S (for SS0) and is finite as long as S0>0.

Useful normalizations:
Dimensionless form with S^=SN: C(S^;S^0)N=lnS^S^0-(S^-S^0).

Total cumulative up to completion S = N: C ( N ; S0 ) = N ln N S0 - ( N - S0 ) .

This can be thought of as the total knowledge-effort curve or “cumulative residual variety as a function of performance level” i.e. how much “knowledge work” has been consumed to reach performance level S.

Fig.2 Total Knowledge-Effort Curve Here we see the cumulative residual variety as a function of performance level.

  • Blue curve: instantaneous 𝐻(𝑆)=𝑁/𝑆−1H(S)=N/S−1 (residual variety ratio).
  • Green dashed curve: cumulative residual variety C(S) as we accumulate uncertainty over growing performance level S.
Each point on the curve says “At performance level S, there are H(S) bits of uncertainty to be eliminated for perfect regulation.” We can see how H(S) declines hyperbolically, while C(S) rises concavely.

Residual Variety

In the context of Ashby's Law of Requisite Variety, residual variety refers to the variety that the regulator fails to absorb. In other words, the remaining uncertainty or uncontrolled states in a system after a regulator has applied its available counter-actions. Information-theoretically, this can be understood in two ways:

  • The residual variety is the uncertainty H(X|Y) about the regulator's response given the disturbance i.e., the regulator's remaining uncertainty about what action to take given a known disturbance Y. It focuses on the input side - how much the regulator doesn't know about the disturbance that is hitting the system.
  • The residual variety is the uncertainty H(E) in essential variables. It focuses on the outcome side - how much uncertainty persists in what we care about the essential variable.

Both interpretations quantify how much uncertainty is left once the regulator has made its move, because H(X|Y) upper-bounds the achievable reduction in H(E) given the fixed table T. If the regulator perfectly counters every disturbance (full requisite variety), the residual variety in both forms would be zero.

Observability of Residual Variety

Now we focus on one important consideration, which is that E (essential variable) values are observable. These are the outcomes we can measure and care about - system performance, outcome quality, stability measures, etc. Being observable, we can empirically estimate H(E) by collecting data on how E varies given different regulator states X This makes H(E) (residual variety) a measurable quantity in practice.

H(X|Y) presents observability challenges, as disturbances Y may not be directly observable i.e. they could be internal system dynamics, environmental factors we can't measure, or complex interactions we can't decompose. Even if some disturbances are observable, the full set Y might include hidden or latent factors that we cannot directly measure or quantify. This makes H(X|Y) potentially unobservable or only partially estimable and often theoretical or abstract.

This creates an important asymmetry in cybernetic analysis:

  • H(E) is observable and measurable via H(B), allowing us to empirically assess how well the regulator is performing in controlling outcomes.
  • H(X|Y) may not be fully observable, making it difficult to quantify the regulator's knowledge of disturbances it's regulating.

This is why H(E) is often the more practically useful measure - it tells us what we can observe about system performance. That is reflected in the literature as well, where work focuses on observable outcomes rather than unobservable disturbances[29,30,35,36,37,38,39,40,41,42,43,44,45].

How to cite:

Bakardzhiev D.V. (2025) Knowledge Discovery Efficiency (KEDE) and Ashby's Law https://docs.kedehub.io/knowledge-centric-research/kede-ashbys-law.html

Works Cited

1. Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal. 1948;27(3):379-423. doi:10.1002/j.1538-7305.1948.tb01338.x

2. Ashby, W.R. (1956). An Introduction to Cybernetics; Chapman & Hall,

3. Ashby, W. R. (2011). Variety, Constraint, And The Law Of Requisite Variety. 13, 18.

4. MacKay, D. M. (1950). Quanta! aspects of scientific information. Philosophical Magazine; 41, 289-311;

5. Cover, T. M. and Thomas, J. A. (1991), Elements of Information Theory, John Wiley and Sons, New York. page.95 in 5.7 SOME COMMENTS ON HUFFMAN CODES

6. Wheeler, J. A. (1990). Information, physics, quantum: The search for links. In W. H. Zurek (Ed.), Complexity, entropy, and the physics of information (Vol. 8, pp. 3'-28). Taylor & Francis.

7. Yaneer Bar-Yam.(2004) Multiscale variety in complex systems. Complexity, 9(4):37{45,

8. Ashby, W.R. (1991). Requisite Variety and Its Implications for the Control of Complex Systems. In: Facets of Systems Science. International Federation for Systems Research International Series on Systems Science and Engineering, vol 7. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-0718-9_28

9. Shannon, C. E. Communication theory of secrecy systems. Bell System technical Journal, 28, 656-715, 1949

10. Kubatko J, Oliver D, Pelton K, et al. A starting point for analyzing basketball statistics. J Quant Anal Sports 2007; 3: 1'-22.

11. Sports Reference LLC. "NBA League Averages." Basketball-Reference.com - Basketball Statistics and History. https://www.basketball-reference.com/leagues/NBA_stats_totals.html.

12. Bucks post highest single-game field-goal percentage by any team in 21st century https://sports.yahoo.com/article/bucks-post-highest-single-game-040313061.html

13. https://www.statmuse.com/nba/ask/most-field-goals-made-record-in-a-game-nba-player

14. Lewis, G. J., & Stewart, N. (2003). The measurement of environmental performance: an application of Ashby's law. Systems Research and Behavioral Science, 20(1), 31'-52. https://doi.org/10.1002/sres.524

15. Norman, J., & Bar-Yam, Y. (2018). Special Operations Forces: A Global Immune System? In Springer Unifying Themes in Complex Systems IX (pp. 486'-498). Springer International Publishing. https://doi.org/10.1007/978-3-319-96661-8_50

16. Norman, J., & Bar-Yam, Y. (2019). Special Operations Forces as a Global Immune System. In Springer Evolution, Development and Complexity (pp. 367'-379). Springer International Publishing. https://doi.org/10.1007/978-3-030-00075-2_16

17. O'Grady, W., Morlidge, S., & Rouse, P. (2014). Management Control Systems: A Variety Engineering Perspective. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.2351099

18. Love, T., & Cooper, T. (2007). Digital Eco-systems Pre-Design: Variety Analyses, System Viability and Tacit System Control Mechanisms. 2007 Inaugural IEEE-IES Digital EcoSystems and Technologies Conference, 452'-457. https://doi.org/10.1109/dest.2007.372013

19. Love, T., & Cooper, T. (2007). Complex built‐environment design: four extensions to Ashby. Kybernetes, 36(9/10), 1422'-1435. https://doi.org/10.1108/03684920710827391

20. Bushey, D. B., & Nissen, M. E. (1999). A Systematic Approach to Prioritizing Weapon System Requirements and Military Operations Through Requisite Variety. Defense Technical Information Center. https://doi.org/10.21236/ada371943

21. Jones, H. P. (2018). Evolutionary stakeholder discovery: requisite system sampling for co-creation.

22. Grimm, D. A. P., Gorman, J. C., Robinson, E., & Winner, J. (2022). Measuring Adaptive Team Coordination in an Enroute Care Training Scenario. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 66(1), 50'-54. https://doi.org/10.1177/1071181322661074

23. Becker Bertoni, V., Abreu Saurin, T., & Sanson Fogliatto, F. (2022). Law of requisite variety in practice: Assessing the match between risk and actors' contribution to resilient performance. Safety Science, 155, 105895. https://doi.org/10.1016/j.ssci.2022.105895

24. Tworek, K., Walecka-Jankowska, K., & Zgrzywa-Ziemak, A. (2019). Towards organisational simplexity — a simple structure in a complex environment. Engineering Management in Production and Services, 11(4), 43'-53. https://doi.org/10.2478/emj-2019-0032

25. Chester, M. V., & Allenby, B. (2022). Infrastructure autopoiesis: requisite variety to engage complexity. Environmental Research: Infrastructure and Sustainability, 2(1), 012001. https://doi.org/10.1088/2634-4505/ac4b48

26. van der Hoek, M., Beerkens, M., & Groeneveld, S. (2021). Matching leadership to circumstances? A vignette study of leadership behavior adaptation in an ambiguous context. International Public Management Journal, 24(3), 394'-417. https://doi.org/10.1080/10967494.2021.1887017

27. Ulrik, S., & Isabella, A. (2023). Variety versus speed: how variety in competence within teams may affect performance in a dynamic decision-making task.

28. Bakardzhiev, D., Vitanov, N.K. (2025). KEDE (KnowledgE Discovery Efficiency): A Measure for Quantification of the Productivity of Knowledge Workers. In: Georgiev, I., Kostadinov, H., Lilkova, E. (eds) Advanced Computing in Industrial Mathematics. BGSIAM 2022. Studies in Computational Intelligence, vol 641. Springer, Cham. https://doi.org/10.1007/978-3-031-76786-9_3

29. Heylighen, F., & Joslyn, C. (2001). Cybernetics and Second Order Cybernetics. In R. A. Meyers (Ed.), Encyclopedia of Physical Science and Technology, Eighteen-Volume Set, Third Edition (pp. 155-170). Academia Press. http://pespmc1.vub.ac.be/Papers/Cybernetics-EPST.pdf

30. Schwaninger, M., & Ott, S. (2024). What is variety engineering and why do we need it? Systems Research and Behavioral Science, 41(2), 235'-246. https://doi.org/10.1002/sres.2964

31. AULIN‐AHMAVAARA, A.Y. (1979), "THE LAW OF REQUISITE HIERARCHY", Kybernetes, Vol. 8 No. 4, pp. 259-266. https://doi.org/10.1108/eb005528

32. Wu, T., Dufford, A. J., Mackie, M. A., Egan, L. J., & Fan, J. (2016). The Capacity of Cognitive Control Estimated from a Perceptual Decision Making Task. Scientific Reports, 6, 34025.

33. Abuhamdeh S (2020) Investigating the “Flow” Experience: Key Conceptual and OperationalIssues. Front. Psychol. 11:158.doi: 10.3389/fpsyg.2020.00158

34. Automatic Screw Tightening Machine and Its Hidden Features

35. Keating, C. B., Katina, P. F., Jaradat, R., Bradley, J. M., & Hodge, R. (2019). Framework for improving complex system performance. INCOSE International Symposium, 29(1), 1218-1232. https://doi.org/10.1002/j.2334-5837.2019.00664.x

36. S. Engell (1985). An information-theoretical approach to regulation.

37. K. Kijima, Y. Takahara, B. Nakano (1986). ALGEBRAIC FORMULATION OF RELATIONSHIP BETWEEN A GOAL SEEKING SYSTEM AND ITS ENVIRONMENT.

38. W. Kickert, J. Bertrand, J. Praagman (1978). Some Comments on Cybernetics and Control. IEEE Transactions on Systems, Man and Cybernetics.

39. S. Engell (1985). Information-theoretical bounds for regulation accuracy. IEEE Conference on Decision and Control.

40. Hui Zhang, Youxian Sun (2003). Bode integrals and laws of variety in linear control systems. Proceedings of the 2003 American Control Conference, 2003.

41. R. Conant (1969). The Information Transfer Required in Regulatory Processes. IEEE Transactions on Systems Science and Cybernetics.

42. S. Engell (1987). Analysis of Regulation Problems based on Real-Time Rate-Distortion Theory. American Control Conference.

43. Hui Zhang, Youxian Sun (2003). Information theoretic limit and bound of disturbance rejection in LTI systems: Shannon entropy and H/sub /spl infin// entropy. SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

44. N. C. Martins, M. Dahleh (2008). Feedback Control in the Presence of Noisy Channels: “Bode-Like” Fundamental Limitations of Performance. IEEE Transactions on Automatic Control.

45. Hui Zhang, Youxian Sun (2003). H/sub /spl infin// entropy and the law of requisite variety. 42nd IEEE International Conference on Decision and Control (IEEE Cat. No.03CH37475).

46. Tsuji, M., Crookshank, M., Olsen, M., Schemitsch, E. H., & Zdero, R. (2013). The biomechanical effect of artificial and human bone density on stopping and stripping torque during screw insertion. Journal of the mechanical behavior of biomedical materials, 22, 146'-156. https://doi.org/10.1016/j.jmbbm.2013.03.006

47. Akizuki, K., & Ohashi, Y. (2015). Measurement of functional task difficulty during motor learning: What level of difficulty corresponds to the optimal challenge point?. Human movement science, 43, 107'-117. https://doi.org/10.1016/j.humov.2015.07.007

48. Bootsma, J. M., Hortobágyi, T., Rothwell, J. C., & Caljouw, S. R. (2018). The Role of Task Difficulty in Learning a Visuomotor Skill. Medicine and science in sports and exercise, 50(9), 1842'-1849. https://doi.org/10.1249/MSS.0000000000001635

49. Akizuki, K., & Ohashi, Y. (2013). Changes in practice schedule and functional task difficulty: a study using the probe reaction time technique. Journal of physical therapy science, 25(7), 827'-831. https://doi.org/10.1589/jpts.25.827

50. Goldhammer, F.; Naumann, J.; Stelter, A.; Tóth, K.; Rölke, H.; Klieme, E.: The time on task effect in reading and problem solving is moderated by task difficulty and skill. Insights from a computer-based large-scale assessment - In: The Journal of educational psychology 106 (2014) 3, S. 608-626 - URN: urn:nbn:de:0111-pedocs-179679 - DOI: 10.25656/01:17967; 10.1037/a0034716

51. Boothroyd, G., and P. Dewhurst, "DESIGN FOR ASSEMBLY", Dept. of Mechanical Engineering, University of Massachusetts, Amherst, Massachusetts, 1983.

52. Jahin, A., Zidan, A. H., Bao, Y., Liang, S., Liu, T., & Zhang, W. (2025). Unveiling the mathematical reasoning in deepseek models: A comparative study of large language models. arXiv preprint arXiv:2503.10573.

53. FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

54. Francis Heylighen, Cybernetic Principles of Aging and Rejuvenation: The Buffering- Challenging Strategy for Life Extension, Current Aging Science; Volume 7, Issue 1, Year 2014, . DOI: 10.2174/1874609807666140521095925

55. Ostry, D. J. (1980). Execution-time movement control. In G. E. Stelmach & J. Requin (Eds.), Tutorials in motor behavior (pp. 457-468). Amsterdam: North-Holland.

56. Siegenfeld, A. F., & Bar-Yam, Y. (2025). A Formal Definition of Scale-Dependent Complexity and the Multi-Scale Law of Requisite Variety. Entropy, 27(8), 835. https://doi.org/10.3390/e27080835

57. Umpleby, S. A. (2009). Ross Ashby's general theory of adaptive systems. International Journal of General Systems, 38(2), 231–238. https://doi.org/10.1080/03081070802601509

Getting started