The Mathematics of Knowledge Discovery Efficiency (KEDE)

This is a short read. Here is the detailed version.

Abstract

Introducing KEDE, a powerful metric for optimizing software development. KEDE analyzes source code repositories and quantifies the knowledge gap developers face when starting tasks. This knowledge gap impacts their happiness and productivity.

With values ranging between 0 and 100, higher KEDE scores indicate better standing. Managers can use KEDE to compare projects, teams, and departments.

KEDE measures the efficiency of knowledge discovery, which is the foundation of high productivity, rather than productivity itself.

KEDE showcases the efficiency of developers in acquiring and applying missing knowledge. This insight helps teams improve processes, ultimately boosting happiness, productivity, and success.

KEDE definition

A Knowledge Discovery Process transforms invisible knowledge into visible, tangible output.

Think of it as a black box, which may contain senior and junior developers potentially using AI tools like ChatGPT. To learn more about it, please refer to this article.

To explain the tangible output, we can use an analogy from physics where a quantum is the smallest discrete unit of a physical phenomenon. In this context, tangible output comprises symbols produced..The quality of the output is assumed to meet target standards.

Inputs represent the knowledge developers lack before starting a task i.e. the missing information or knowledge that needs to be discovered, which is measured in bits.

The amount of missing information is measured in bits using Claude Shannon's Information Theory[1], and is the average number of binary "Yes/No" questions asked to gain the knowledge required to produce the output.

To quantify the knowledge developers didn't have before starting a task, we introduce a new metric called KEDE (KnowledgE Discovery Efficiency).

KEDE=11+H

(1)

KEDE quantifies the knowledge software developers didn't have prior to starting a task, since it is this lack of knowledge that significantly impacts the time and effort required. KEDE is inversely proportional to the missing information required to effectively complete a task, and has values in the closed interval (0,1].

Calculating Missing Information

To calculate KEDE for a knowledge discovery process, we need to find the missing information H in bits. For that we adopt the positivist credo that science should be based on observable facts—that is, we must derive our conclusions from observable quantities we can measure. In practice, the tangible output we can measure is the computer code produced. Therefore, we infer the average number of questions, denoted as H, asked in a time interval solely from the observable quantities, which is the number of symbols produced.

We define N as the maximum number of symbols that could be produced in a time interval, assuming that the minimum symbol duration is one unit of time and is equal to the time it takes to ask one question.

This notion is supported by the research on natural constraints:

  1. Asking questions is an effortful task that precludes simultaneous typing[3]. If a symbol was not typed, then a question was asked. This implies that the question rate is equivalent to the symbol rate, as explained here.
  2. The maximum typing speed for humans is approximately 4.2 symbols per second[8][9][10].
  3. The cognitive control capacity of the human brain is around 3 to 4 bits per second, and since we equate one question to one bit of information, this translates to 3 to 4 questions per second[11].

To align the symbol rate with the cognitive capacity, we set a maximum symbol rate N at 100,000 symbols per 8 hours of work, resulting in a symbol duration time t of 0.288 seconds. This yields a symbol rate r of 3.47 symbols per second, fitting within the cognitive control range of the human brain.

Therefore, we generalize the following relation for the sum of questions Q, symbols S and the maximum symbol rate N

Q+S=N

(1)

Here Q represents the total number of questions asked in a time interval, and S is the total number of symbols produced for the same time interval.

The KEDE theorem is derived from the aforementioned principles:

H=NS(1-W)-1

(2)

Here, H denotes the amount of missing information, N represents the maximum possible symbols that could be produced within a given time frame, S is the actual number of symbols produced in that time frame, and W stands for the probability of waste.

This theorem provides a way to compute the missing information (measured in bits), which is not directly observable, using quantifiable entities.

On the left side of the equation, we have intangible quantities (bits of information) i.e the the missing information

On the right side, we find tangible, measurable quantities. S quantifies the actual productive output, while N benchmarks this against the theoretical maximum output in a given time period. The term (1-W) adjusts this ratio by considering the probability of wasteful efforts.

By equating these tangible measures with the intangible concept of information, the theorem bridges the gap between what can be observed and measured directly (the actual work output and efficiency) and the theoretical concepts of information.

The proof of the theorem provides the necessary mathematical foundation to validate this relationship, ensuring that we can quantify the 'missing information' in a practical, real-world context.

The KEDE theorem isn't tied to the absolute values of typing speed or cognitive capacity. Instead, it's based on the ratio between the manual work done and the manual work that could theoretically be done given the cognitive constraints. The value of N, or the maximum symbol rate, is used as a constant to represent an idealized, maximum efficiency over a standard work interval (such as an 8-hour workday). The specific values used for maximum typing speed and cognitive capacity are not the core components of the theorem itself; they are parameters that give context and allow for the application of the theorem to real-world scenarios. Changes in research would prompt adjustments in the constant N but would not necessitate a change in the structure or application of the theorem itself.

Measuring KEDE in software development

To apply the KEDE theorem to software development, we need to define the values of S and N.. In software development, we can count the actual number of symbols typed S straight from the source code files. N represents the maximum number of symbols that a single human being can contribute within a time interval. We define the value of N as:

N=h×CPH

Here h is the number of working hours in a day and CPH is the maximum number of symbols that could be contributed per hour.

In order to obtain a maximum symbol rate CPH of 100,000 symbols per 8 hours of work, a symbol rate r of 3.47 symbols per second, an eight-hour workday, and a CPH of 12,500 symbols per hour are defined..

When we substitute in (2) this formula for N, it becomes:

H=h×CPHS(1-W)-1

(3)

When we substitute (3) in (1) and convert the KEDE equation into percentages, it becomes:

KEDE=S(1-W)h×CPH×100%

(4)

    KEDE has the following properties:
  • Minimum value of 0 and maximum value of 100.
  • KEDE approaches 0 when the missing information is infinite, which is the case when humans create new knowledge, as exemplified by intellectuals like Albert Einstein and startups developing new technologies like PayPal.
  • KEDE approaches 100 when the missing information is zero, which is the case for an omniscient being, such as God.
  • KEDE is higher when software developers apply prior knowledge instead of discovering new knowledge.
  • anchored to the natural constraints of the maximum possible typing speed and the cognitive control of the human brain, supporting comparisons across contexts, programming languages and applications.

For an expert full-time software developer who mostly applies prior knowledge but also creates new knowledge when needed, we would expect a KEDE value of 20.

Calculating the Balance between Individual Capabilities and Work Complexity

The output of a knowledge discovery process has only two possible outcomes: symbols S and questions Q), with probabilities KEDE and (1-KEDE), respectively. For calculating the balance between questions and symbols we use Shannon's formula[1]

Balancep1, p2=-i=12pilog2pi

(5)

In this case, p1 = KEDE and p2 = (1-KEDE) and the Balance function of one variable is:

Balance(KEDE)=-KEDE×log2KEDE-(1-KEDE)log2(1-KEDE)

(6)

Figure below shows the function Balance(KEDE).

Balance as a function of KEDE

The balance function is always positive, concave (or concave downward), and has a maximum value at KEDE = 1/2. It is zero at both KEDE = 0 and KEDE = 1.

We assume that the number of questions Q reflects the complexity of the work, and the number of symbols produced S reflects the individual capabilities of a developer. When they are in balance the developer is in a state called flow.

Flow is characterized by a balance between the challenges of software development and the individual's capabilities. According to Csikszentmihalyi, it occurs at the boundary between boredom and anxiety, and is an optimal work experience[4][5].

The maximum value of the balance function is one when KEDE is equal to 1/2, as this indicates a balance between questions and answers. When KEDE is equal to 0, the developer may be in a state of anxiety, as the challenges are too great. On the other hand, when KEDE is equal to 1, the developer may be in a state of boredom, as the challenges are too low. In general, values of KEDE less than 1/2 indicate a lack of balance and a tendency towards anxiety, while values greater than 1/2 indicate a lack of balance and a tendency towards boredom. In both cases, the level of balance is less than in the case of KEDE=1/2.

What is the value of knowing KEDE?

The value of knowing KEDE is that it allows for quantifying the human capital of any organization, as well as Indicators such as Knowledge Discovery Efficiency (KEDE), Collaboration, Cognitive Load, Happiness (Flow State), Productivity(Value per Bit of information Discovered), and Rework (Information Loss Rate). They present a multidimensional view of the developer work experience and the software development process efficiency.

Works Cited

1. Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal. 1948;27(3):379-423. doi:10.1002/j.1538-7305.1948.tb01338.x

2. Drucker , Peter F, “Knowledge-Worker Productivity: The Biggest Challenge,California Management Review, vol. 41, no. 2, pp. 79–94, Jan. 1999, doi: 10.2307/41165987.x

3. Kahneman D. (1973). Attention and Effort. Englewood Cliffs, NJ: Prentice-Hall

4. Csikszentmihalyi, M. 1990. Flow: the psychology of optimal experience. United States of America: Harper & Row.

5. Csikszentmihalyi, M 1975. Beyond Boredom and Anxiety: The Experience of Play in Work and Games. San Francisco: Jossey-Bass

6. Dhakal, V., Feit, A. M., Kristensson, P. O., & Oulasvirta, A. (2018). Observations on Typing from 136 Million Keystrokes. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (pp. 1-12). Association for Computing Machinery. https://doi.org/10.1145/3173574.3174220

7. Wu T, Dufford AJ, Mackie MA, Egan LJ, Fan J. The Capacity of Cognitive Control Estimated from a Perceptual Decision Making Task. Sci Rep. 2016 Sep 23;6:34025. doi: 10.1038/srep34025. PMID: 27659950; PMCID: PMC5034293.

8. Shaffer, L. H. (1973). Latency mechanisms in transcription. In S. Kornblum (Ed.), Attention and performance IV (pp. 435-446). New York: Academic Press.

9. Ostry, D. J. (1980). Execution-time movement control. In G. E. Stelmach & J. Requin (Eds.), Tutorials in motor behavior (pp. 457-468). Amsterdam: North-Holland.

10. Dhakal, V., Feit, A. M., Kristensson, P. O., & Oulasvirta, A. (2018). Observations on Typing from 136 Million Keystrokes. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (pp. 1-12). Association for Computing Machinery. https://doi.org/10.1145/3173574.3174220

11. Wu T, Dufford AJ, Mackie MA, Egan LJ, Fan J. The Capacity of Cognitive Control Estimated from a Perceptual Decision Making Task. Sci Rep. 2016 Sep 23;6:34025. doi: 10.1038/srep34025. PMID: 27659950; PMCID: PMC5034293.

Getting started