Understanding the details of Knowledge Discovery

Through the lens of Information Theory.

Total unconditional missing information

"Lack of knowledge…that is the problem. we should not ask questions without knowledge. If we do not know how to ask the right question, we discover nothing." ~ W. Edwards Deming

In order to correctly define a problem, we must consider all possible definitions, which form the problem space (as shown in the image below).

The gold coin represents the correct definition for the problem we are addressing. We need to find the gold coin, i.e., correctly define our problem, as we believe a correct definition is necessary for solving a problem.

Finding the correct definition is an example of a process known as "knowledge discovery,".

Figure 1

This process takes some knowledge to be discovered as input, or in other words, "what do we need to know?" The result of knowledge discovery is the knowledge that is discovered. In our case that would be the correctly definition of our problem. Knowldge discivery works by gaining infirmation i.e. reducing missing information.

Missing information, also known as "uncertainty", is mathematically defined for a "random variable" using Shannon's formula. However, it is important to remember that physical systems are not purely mathematical; they are complex and often messy. When we observe a physical object through a measurement device, we are essentially looking at it through a "mathematical looking glass" with a finite resolution. This means that the missing information of the object is not inherent to the object itself, but rather defined by the measurement device we use to examine it. A measurement device means we already know something. It specifies what questions we can ask about a system.

In order to find the gold coin (i.e., correctly define the problem) in the problem space, we must establish a measurement device. This process, which is an example of "knowledge discovery," is illustrated in the image below.

Figure 2

In the example depicted in the figure, we go through the three states of knowledge discovery in order to learn about a single problem. At the beginning of the process, we are not sure where the gold coin we are looking for is located within the problem spaces. However, as we progress through the states, we are able to narrow down the possibilities.

Knowledge discovery involves three stages:

  1. We don't know anything about the problem space.

    At the start of the knowledge discovery process, we have little to no understanding of the problem space. It is like a cloud, where we cannot determine the number of alternative definitions or possible states. If we are not even aware of how much we don't know, we are dealing with "unknown unknowns." In this initial stage, we must first gather information and try to clarify the boundaries of the problem space.

  2. We know nothing about the problem space except how many states it has.

    When we have very little information about a probability distribution and cannot make any other assumptions, we must assume that each state has an equal probability of occurring. This reflects a maximal level of uncertainty. This principle, known as the "maximum entropy principle," states that the probability distribution that best represents our current understanding of a system is the one with the largest amount of missing information, or Hmax[4][5]. This is because, in the absence of other information, our uncertainty must be at its maximum.

    On Fig. 2, the knowledge to be discovered Hmax left after the transition from State 1 to State 2 can be expressed as: "partitioning the problem space of 'unknown unknowns' using some knowledge." This process can also be represented by the following formula:

    where I2 represents the knowledge (measurement device) we used to partition the problem space of "unknown unknowns" into N states.

    I2 represents the knowledge we gained in the second stage. Information gain is a measure of the reduction in missing information that occurs when objects are classified into different classes. It is frequently used in the construction of decision trees from a training dataset by evaluating the information gain of each variable and selecting the variable that maximizes the information gain, which in turn minimizes missing information and divides the dataset into the most effective groups for classification. Information gain is calculated by comparing the missing information of a dataset before and after a transformation[6].

    If we have no other knowledge about a system except for the number of possible states it can be in, the maximum missing information Hmax is equal to the logarithm of the number of possible states.

    where N represents the number of states.

    In this particular case presented on Fig. 2, we have divided the problem space into four parts. Therefore, we can calculate Hmax as being equal to Hmax=2 bits of information.

  3. We know the states and the probabilities of each state.

    In the third stage of knowledge discovery, we are aware of the possible states and the probability of each state. This missing information, represented by H(X) is conditional on some new knowledge, which we will refer to as I3. H(X) can be calculated using Shannon's formula.

    On Fig. 2, the knowledge to be discovered H(X) left after the transition from State 2 to State 3 can be quantified as: "what remains to be discovered equals what we don't know minus what we know." This process can also be represented by the following formula:

    where I3 represents the knowledge we used to assign probabilities to the partitions of the problem space, based on a particular non-uniform probability distribution.

After reaching State 3, we can start reducing the missing information by eliminating the partitions that do not contain the gold coin. This process is explained in the following sections.

Once we have narrowed it down to a single partition, we can consider that partition as a new problem space at State 1. We can then iteratively follow the same process of knowledge discovery (State 1 -> State 2 -> State 3) until we find the gold coin (i.e., correctly define our problem)."

Joint missing information

Let us quantify our uncertainty i.e., how much we do not know about X by the N1 number of states it can be in. Imagine that there is another system Y, and that one can be in N2 different states. How many states can the joint system (X and Y combined) be in? For each state of the N1 of X, there can be N2 number of states. So the total number of states of the joint system must be N1 x N2. But our uncertainty about the joint system is not N1 x N2. The uncertainty i.e. missing information adds up, it doesn't multiply. Actually, the below inequality hold[2]:

where H(X,Y) is defined on the joint probability of occurrence of X and Y

The equality holds when the two random variables are independent:

In the figure below, the problem space is represented by X and our prior knowledge is represented by Y.

At this point, X and Y are not connected i.e. are independent. This means we have not yet tried to apply our prior knowledge Y to the current problem X.

The last two results imply that if we have two experiments (or two games) with independent outcomes, then the missing information about the outcomes of the two experiments is the sum of the missing information about the outcomes of each individual experiment.

On the other hand, if there is dependence between the two sets of outcomes, then the missing information about the compound experiment H(X,Y) will always be less than the missing information about the two experiments separately. In this case, we use conditional probabilities.

Conditional missing information

"Without theory, experience has no meaning. Without theory, one has no questions to ask. Hence without theory there is no learning."- W. Edwards Deming[3]

To reduce missing information, we need to gain more information about a system. One way to do this is by using a theory. Theories can often constrain probability distributions in such a way that we have certain expectations about them before making any measurements. In other words, a theory can shape our prior knowledge.

To formally introduce the concept of conditional missing information, we need to discuss conditional probabilities. A conditional probability represents the likelihood of an event occurring, given that another event has already occurred. To quantify the dependence between two events that happen simultaneously, we can examine two random variables. These variables may or may not be independent.

Probability distributions are initially uniform, meaning we have no knowledge about the variable except possibly the number of states it can take on. Once we gain information about the states, the probability distribution becomes non-uniform, with some states being more likely than others. This information is reflected in the form of conditional probabilities. We can only truly know that a particular state is more or less likely than the random expectation if we have some other knowledge about the system at the same time.

Any probability distribution that is not uniform (same probability for all states) is necessarily conditional[1].

We view all probabilities as being conditional on one or more hypotheses, rather than being absolute, so they should not be interpreted as representing actual physical objects. Probabilities are subjective in the sense that they depend on the evidence or state of knowledge used to formulate them, but they are objective in that any rational observer would assign the same probabilities. As a result, any probability distribution and the missing information it contains are largely anthropomorphic.

In the figure below, we see the missing information of X after we have applied our prior knowledge Y.

The parts of the figure labeled 5, 6, and 7 represent the conditional missing information H(Y|X). The parts labeled 2, 3, and 4 represent the conditional missing information H(X|Y), or the missing information that remains to be discovered after we have applied our prior knowledge Y to problem X.

In the formula below, conditional missing information (i.e., knowledge to be discovered) H(X|Y) is defined as the difference between the missing information H(X). minus the mutual missing information I(X:Y).

we also obtain the inequality[2]:

This means that the missing information on X can never increase by knowing Y. Alternatively, H(X/Y) is the average uncertainty that remains about X when Y is known. This uncertainty is always less than or equal to the uncertainty about X. If X and Y are independent, then the equality sign holds.

It is possible to refer to H(X|Y) as "observed entropy", as it represents the uncertainty or missing information in the random variable X, given that the value of another random variable Y is known. It quantifies the remaining missing information in X, after taking into account the information that is known from Y.

The "observed entropy" is a term commonly associated with the information that an observer has, it reflects how much knowledge or understanding the observer has about the problem or the distribution. In this sense, the conditional entropy H(X|Y) can be seen as the observed entropy of X, given the value of Y, as it reflects the missing information in X, after taking into account the information that is known from Y,

It's worth noting that the "observed entropy" is a subjective concept as it depends on the prior knowledge, understanding and perspective of the observer, while conditional entropy H(X|Y) is an objective concept and it depends only on the probability distribution of X and Y.

In information theory, the concept that is most closely related to "knowledge to be discovered" between two random variables X and Y, is the concept of conditional entropy, denoted H(X|Y). The conditional entropy is a measure of the uncertainty of one random variable X, given that the value of another random variable Y is known. It is defined as the expected value of the entropy of X, given the value of Y. In other words, the conditional entropy represents the amount of missing information in X, given that the value of Y is known. Ot, it represents the amount of missing information in X that remains after taking into account the information in Y. It's worth noting that the conditional entropy can be derived from the mutual information between X and Y, by using the equation H(X|Y) = H(X) - I(X;Y)

If the value of Y has a high entropy, it means that the information in Y does not help much to reduce the uncertainty of X, and therefore H(X|Y) will be high. On the other hand, if the value of Y has a low entropy, it means that the information in Y is useful to reduce the uncertainty of X, and therefore H(X|Y) will be low.

Mutual information

Let's refer back to the game of hiding a gold coin. This time we'll represent the eight bozes of equal size as the random variable X, whihch is an uniform distribution with 8 possible values and its true entropy is 3 bits. Let's have Y to be another random variable thet represents a uniform distribution with 4 states that are present in X. It means that Y represents a subset of the possible outcomes of X, and it can be seen as a "prior knowledge" about X.

It's also important to note that, having prior knowledge doesn't mean that we know the complete information of the problem or the distribution, In this case, we still have missing information that can be found in X, but Y gives a starting point to understand X.

As we discussed previously, "prior knowledge" is the information and understanding that an individual has before engaging in a certain task or problem. It represents the information that can be used to reduce the uncertainty of the problem or the distribution, and it gives a starting point to understand the problem or the distribution.

Let's use another scenario, where Y is a uniform distribution with 8 states, 4 of which are present in X and 4 of which are not.

In this scenario, it's important to note that 4 of the states of Y are not present in X, meaning that Y also contains extraneous information that does not pertain to the problem or the distribution represented by X. Therefore, it would be more accurate to refer to Y as "prior information" or "partial prior knowledge" rather than "prior knowledge" of X. It's important to take into account the relationship between X and Y to know how to use this prior information effectively.

Mutual information is a measure of the amount of information that is shared between two random variables. It is defined as the reduction in uncertainty about one variable given that we know the value of the other variable.

It is possible to refer to mutual information as "prior knowledge", as it quantifies the amount of information that is shared between X and Y, and it can be seen as the reduction in the uncertainty of X, when the value of Y is known. In that sense, it represents the information that is gained by knowing the value of Y.

The mutual information I(X:Y) as preseneted in the figure below,.

The shared part labeled 1=8 is the mutual information I(X:Y), which represents what Y knows about X and vice versa. Note that I(X:Y) is defined symmetrically with respect to X and Y. It tells us exactly where our information is coming from: the variable Y. If the area of I(X:Y) is equal to the combined area of parts 2, 3, and 4, then we were able to remove 1 bit of information from H(X),

The remaining missing information to be discovered H(X|Y) is conditional on Y.

It could be more accurate to say that mutual information I(X;Y) represents the "knowledge shared" between X and Y. It tells us how much information is shared between X and Y, but we still need to make use of this information to "discover knowledge".

Sometimes, I(X:Y) is referred to as the average amount of information conveyed by one random variable X on the other random variable Y, and vice versa. Thus, we see that I(X:Y) is a measure of an average correlation i.e., it is a measure of the extent of dependence between X and Y. The mutual information is always non-negative. It is zero when the two experiments are independent.

Mutual information and information gain are essentially the same concept, though the names may be used in different contexts. Technically, they calculate the same quantity when applied to the same data, so mutual information is sometimes used as a synonym for information gain.

Multivariate mutual information

Multivariate mutual information is a measure of the amount of mutual dependence between multiple random variables. It is a generalization of the concept of mutual information, which is used to quantify the dependence between two random variables. In the case of four dependent variables, it quantifies the degree of association between all possible pairs of the variables, as well as the association between all possible combinations of three and four variables.

The full range of their possible information-theoretic relationships appears in the information diagram (I-diagram) below:

Where, we are interested only in the bellow listed quantities:

  • I(X:Y:Z:U) is the joint mutual information of X,Y,Z and U
  • I(X:Z|Y,U) is the conditional mutual information between X and Z given Y and U. It quantifies the dependence between X and Z given that the variables Y and U are known.
  • I(Y:Z:X|U) is the conditional multivariate mutual information between three random variables Y, Z and X, given another random variable U. It's a measure of the amount of mutual dependence between Y, Z and X given U. In mathematical terms, I(Y:Z:X|U) can be defined as the difference between the joint entropy of Y, Z, X given U and the sum of the individual entropies of Y, Z, X given U. Mathematically it can be defined as I(Y:Z:X|U) = H(Y,Z,X|U) - H(Y|U) - H(Z|U) - H(X|U)
  • I(U:Z:X|Y) is the conditional multivariate mutual information between three random variables U, Z and X, given another random variable Y. It quantifies the dependence between U, Z, and X given that the variable Y is known.

It is important to note that, this representation of conditional multivariate mutual information is based on the assumption that all variables are discrete.

Lost information

Consider an example. Suppose we have eight boxes of equal size, with a gold coin hidden in one of them. The goal is to locate this gold coin by asking binary (yes/no) questions. I start by dividing the eight boxes into two groups of four (a left and right group) and inquire if the coin is in the left group. The response is affirmative, leading me to divide the left group into two smaller groups of two boxes each (left and right), and I ask the same question. Again, the answer is affirmative. At this point in time I learn that my initial binary question was answered incorrectly; the coin was not in the left group after all. This means I need to start my questioning from the beginning due to this misinformation.

In the scenario we've described, the incorrect answers to my questions have led to a loss of information. I initially gained information from the answers to my questions, which allowed me to reduce the number of boxes where the coin could be. However, when I found out that those answers were incorrect, the information I gained turned out to be false, and I had to discard it and start over. This is a form of information loss.

In terms of information theory, this could be seen as an increase in the missing information, or uncertainty, of the system. When I received the answers to my questions, I reduced the missing information by narrowing down the possible locations of the coin. But when I found out the answers were incorrect, the missing information increased again because the possible locations of the coin again became as uncertain as in the beginning.

So in this context, the "lost information" could be quantified as the increase in remaining missing information resulting from the discovery that the initial answers were incorrect.

Let's frame our example in terms of two random variables X and Y, mutual information I(X;Y), and conditional missing information H(Y|X).

Let's define:

  • X as the true location of the gold coin.
  • Y as the answers to my questions.
  • missing information of X, H(X): This is the total uncertainty or missing information of the system before any questions are asked. In our case, with 8 boxes and the coin equally likely to be in any box, the missing information is maximum and can be calculated as H(X)=log28=3 bits
  • Mutual Information, I(X;Y): This is the amount of information gained about X (the location of the coin) by knowing Y (the answer to a question). If the answer to a question is correct, it reduces the number of possible locations of the coin by half, so each correct answer provides 1 bit of information, reducing the remaining uncertainty H(X|Y) of X by 1 bit.
  • Conditional missing information, H(X|Y): This is the remaining uncertainty about X (the location of the coin) after knowing Y (the answer to a question). If an answer is incorrect, it means that the information gained from that answer was false, and the missing information of X given that answer is actually higher than previously thought. Each incorrect answer increases the conditional missing information H(X|Y) by 1 bit, because it adds back one bit of uncertainty about the location of the coin.
  • The relation between the three quantities is: I(X:Y)=H(X)-H(X|Y)

Initially, I have no information about X, so the missing information H(X) is at its maximum (log2 of the number of boxes, in this case, log28 = 3 bits of information).

When I ask a question and receive an answer, I gain some information about X. This is represented by the mutual information I(X;Y). If the answers are correct, each question reduces the uncertainty of X by 1 bit (since each question divides the number of possible locations of the coin by 2).

However, if I find out that an answer was incorrect, the information I gained from that answer turns out to be false. This could be represented as an increase in the conditional missing information H(X|Y), the remaining uncertainty about X given the knowledge of Y. In this case, the conditional missing information increases by 1 bit for each incorrect answer, because each incorrect answer adds back one bit of uncertainty about the location of the coin.

When an answer is correct, it reduces the uncertainty about X, and this reduction in uncertainty is quantified by the mutual information. When an answer is incorrect, the supposed reduction in uncertainty turns out to be false, and the actual uncertainty about X is higher than what the mutual information suggested. This discrepancy could be thought of as the "lost information".

The "lost information" is the information I initially thought I gained from the incorrect answers, but which turned out to be false.

So, in this context, the "lost information" due to an incorrect answer could be quantified as an increase in the conditional missing information H(X|Y). For each incorrect answer, I "lose" 1 bit of information, because I have to add back one bit of uncertainty about the location of the coin. The "lost information" can be thought of as the difference between the reduction in uncertainty suggested by the mutual information and the actual reduction in uncertainty, which is less due to the incorrect answer.

This can be represented as:

Hlost=H(X|Yincorrect)-H(X|Ycorrect)

Where:

  • H(X|Yincorrect) is the conditional missing information of X given the incorrect answer, which is higher because the incorrect answer adds uncertainty.
  • H(X|Ycorrect) is the conditional missing information of X given the correct answer, which is lower because the correct answer reduces uncertainty.

This equation gives a measure of the "lost information" due to incorrect answers.

Let's try to incorporate mutual information into the picture. In the context of the gold coin example, the total information received (H) would be the sum of the information I gained from the correct answers and the information I initially thought I gained from the incorrect answers.

The equation H = Mutual Information + Information Lost is a general conceptual equation that describes the total information received (H) as the sum of the mutual information (the information successfully conveyed) and the information lost (the information intended to be conveyed but not successfully received due to errors, noise, or other forms of interference).

Let's break it down:

  1. Information from Correct Answers (Mutual Information from Correct Answers): Every time I asked a binary question and received a correct answer, I gained mutual information. This mutual information I(X;Y_correct) reduced the uncertainty about the location of the coin (X). For example, when I asked if the coin is in the left group of 4 boxes and the answer was "yes", I gained 1 bit of information that reduced the number of possible locations for the coin from 8 to 4.
  2. Information from Incorrect Answers (Apparent Mutual Information from Incorrect Answers): When I received an incorrect answer, I initially thought I were gaining mutual information. This apparent mutual information I(X;Y_incorrect) did not actually reduce the uncertainty about the location of the coin, but I initially believed it did. For example, when I asked if the coin is in the left group of 2 boxes and the answer was "yes" (but this was incorrect), I initially thought I had gained mutual information that reduced the number of possible locations for the coin from 4 to 2.
  3. Lost Information: The "lost information" is the apparent mutual information I initially thought I gained from the incorrect answers, but which turned out to be false. This can be thought of as negative mutual information, because it represents a decrease in the actual mutual information about the location of the coin. This lost information is equal to the apparent mutual information from incorrect answers I(X;Y_incorrect) minus the actual mutual information (which is zero for incorrect answers).

Based on the previous explanation, the "lost information" can be quantified as the difference between the apparent mutual information I initially thought I gained from the incorrect answers and the actual mutual information. Since the actual mutual information from incorrect answers is zero (because they do not reduce uncertainty about the coin's location), the "lost information" is simply the apparent mutual information from incorrect answers.

If we denote the apparent mutual information from incorrect answers as I apparent, the equation for the lost information H lost would be:

Hlost=Iapparent+Iactual

Since Iactual = 0 for incorrect answers, this simplifies to:

Hlost=Iapparent

This equation quantifies the "lost information" as the apparent mutual information I initially thought I gained from the incorrect answers.

As an example, if the total missing information H(X) = 3, this means that there are 2^3 = 8 possible outcomes (or locations for the coin), which is consistent with our example.

If I experienced 1 wrong answer to a binary question, this means that I initially thought I had reduced the number of possible outcomes by half. In terms of missing information, this would correspond to a reduction of 1 bit of missing information, since each binary question reduces the missing information by 1 bit.

However, since this answer was incorrect, this apparent reduction in missing information did not actually occur. Therefore, the "lost information" Hlost would be equal to the apparent reduction in missing information that I initially thought I had achieved, which is 1 bit.

So, in this case, Hlost = 1 bit. This represents the amount of information that I initially thought I had gained from the incorrect answer, but which turned out to be false.

Perceived missing information is the sum of the actual missing information of a system H(X) and the additional missing information perceived due to incorrect or misleading information Hlost. It represents the total amount of uncertainty perceived about a system, taking into account both the actual uncertainty and the additional uncertainty introduced by incorrect or misleading information.

Mathematically, we could express this as:

Hperceived=H(X)+Hlost

where:

  • H(X) is the actual missing information of the system, representing the actual number of questions needed to locate the gold coin, which is akin to the mutual information I(X;Y), and not the inherent uncertainty about the state of the system. This is because the mutual information I(X;Y) quantifies the amount of information gained about one random variable (in this case, the location of the gold coin) by observing another random variable (in this case, the answers to my questions). So, if I were lucky and found the gold coin with just one question, then H(X) would be 1 bit, even though the average minimum number of binary questions needed to locate the gold coin in 8 boxes would be 3 bits.
  • Hlost is the "lost information" due to incorrect or misleading information, representing the additional perceived uncertainty introduced by this incorrect or misleading information.

In the context of our example, Hperceived would represent the total amount of uncertainty I perceived about the location of the coin, taking into account both the actual uncertainty and the additional uncertainty introduced by the incorrect answer.

So, if H(X) = 3 bits (representing 8 possible locations for the coin) and Hlost = 1 bit (representing the incorrect answer), then the total perceived missing information would be Hperceived = 3 bits + 1 bit = 4 bits. This would represent a perceived uncertainty of 16 possible locations for the coin, even though there are actually only 8 possible locations.

The term "perceived missing information" is not a standard term in Information Theory or other scientific fields. The concept of "perception" introduces a subjective element that is typically not present in the mathematical formulations of Information Theory, which deals with quantifiable measures of information, uncertainty, and missing information.

However, in broader contexts, the idea of "perceived missing information" or "perceived uncertainty" could potentially be used to describe situations where the subjective perception or understanding of a situation differs from the objective, quantifiable uncertainty. This might be relevant in fields like psychology or decision theory, where the focus is on understanding human perception and decision-making processes[7][8].

Information Loss Rate

Hlost, as we've defined it, represents the amount of information that was thought to be gained but was actually not due to incorrect answers. In the context of searching for a gold coin, a lower Hlost would indicate a more efficient or accurate process, because it means less information was lost due to incorrect answers.

We can use Hlost as a metric to compare two different processes searching for a gold coin. Remember, this metric assumes that all other things are equal, such as the total number of questions asked and the initial missing information (total number of possible locations for the coin). If these factors vary between the two processes, we need to take them into account when comparing the processes. For that, we need to normalize the Hlost value to make a fair comparison.

Let's consider the gold coin example where we have 8 boxes and I am asking binary questions to find the coin. Let's say I asked a total of 3 questions. The first two questions were answered incorrectly, and the third question was answered correctly. So, we have Hlost = 2 (from the two incorrect answers) and total number of questions asked was 3.

I asked a total of 3 questions. The first two questions were answered incorrectly, and the third question was answered correctly. Each question reduces the perceived missing information by 1 bit (since they are binary questions), so after asking 3 questions, I have reduced the perceived missing information by 3 bits.

The fact that the first two questions were answered incorrectly doesn't change the amount of perceived missing information reduction, because from my perspective at the time I asked those questions, I believed I were reducing the missing information. It's only in hindsight, after learning that those answers were incorrect, that I realize some of that perceived missing information reduction was actually "lost information".

In our definitions, Hperceived is the sum of the actual missing information H(X) and the additional perceived missing information due to incorrect or misleading information (Hlost).

In aour scenario, where I asked 3 questions and the first two were answered incorrectly, H(X) would be 1 bit (since I found the coin after 3 questions, the actual missing information was reduced to 1 bit), and Hlost would be 2 bits (from the two incorrect answers). So, Hperceived would be H(X) + Hlost = 1 bit + 2 bits = 3 bits.

In the context of our gold coin example, H(X) represents the actual number of questions needed to locate the gold coin, which is akin to the mutual information I(X;Y) in this scenario.

This is because the mutual information I(X;Y) quantifies the amount of information gained about one random variable (in this case, the location of the gold coin) by observing another random variable (in this case, the answers to my questions).

So, if I were lucky and found the gold coin with just one question, then H(X) would be 1 bit, even though the average minimum number of binary questions needed to locate the gold coin in 8 boxes would be 3 bits.

Thus, normalizing by the perceived missing information Hperceived instead of the total number of questions could provide a more accurate measure of efficiency in the context of our gold coin example.

This is because Hperceived takes into account both the actual number of questions needed to locate the gold coin (H(X)) and the additional perceived missing information due to incorrect or misleading information (Hlost).

By normalizing by Hperceived, we're essentially measuring the proportion of the total perceived missing information that was due to incorrect or misleading information. This gives a measure, which can be useful for comparing different processes even if they involve different numbers of questions or different initial entropies.

"Information Loss Rate" is a suitable term for the ratio Hlost / Hperceived. It accurately describes the concept we're trying to convey: the rate at which information is lost in relation to the total perceived information.

Thus, for our example scenario, the Information Loss Rate (L) would be:

L = Hlost / Hperceived = 2 / 3 ≈ 0.67

This means that, on average, I lost about 0.67 bits of information per question asked due to incorrect answers. A lower value would indicate a more accurate process, because it means less information was lost per question asked.

Here's how we can calculate Information Loss Rate (L):

  1. Calculate Hlost for each process For each process, calculate Hlost based on the number of incorrect answers and the apparent reduction in missing information that was initially thought to have been achieved with each incorrect answer.
  2. Calculate Information Loss Rate (L) Divide Hlost by the total perceived missing information for each process. By normalizing by Hperceived, we're essentially measuring the proportion of the total perceived missing information that was due to incorrect or misleading information. A lower value would indicate a more efficient process, because it means a smaller proportion of the perceived missing information was due to incorrect or misleading information. The formula would be:

    L = HlostHperceived

    This normalized metric allow us to compare the efficiency of different processes even if they involve different numbers of questions or different initial entropies. The initial missing information (Hinitial) for each process would typically be H(X), which is the actual (objective) missing information of the system representing the average minimum number of binary questions needed to reduce uncertainty to zero.

Knowledge Discovered

Let's delve into the world of knowledge acquisition and decision-making through a focused scenario: a person is tasked with typing the first letter of a word. This situation offers a window into how decisions are influenced and made based on available knowledge and external information sources. In this context, we define key variables to navigate the process:

  • X: A binary random variable representing the outcome of the decision process. It has two possible states: knowing (1) signifies that the person knows the first letter of the word to type, while not knowing (0) indicates the opposite. X embodies the immediate goal of the decision-making process—identifying the correct letter to type.
  • Y: Represents the person's existing knowledge base, encompassing all prior information relevant to making the decision about X. Y influences the initial state of certainty or uncertainty regarding X, acting as the foundation upon which decisions are based. Initially, Y holds a certain level of information that may or may not be sufficient to resolve X.
  • Z: An external source of information, such as a book, from which the person can obtain answers to binary questions that directly pertain to X. Each interaction with Z is aimed at reducing the uncertainty about X. Z is a mechanism through which Y is enhanced, enabling a more informed decision-making process.

Given that X has two states (know or don't know the letter), initially, there is a certain level of entropy H(X) associated with X. In theory, H(X) represents the initial uncertainty about X before considering any existing knowledge (Y) they might have. It's a measure of how uncertain the situation is in general, based on all possible outcomes of X. H(X) is conceptually the total uncertainty before any interaction or knowledge acquisition process begins. H(X) is more about the potential variety and unpredictability in X itself, rather than the person's specific state of knowledge at any given moment.

The mutual information I(X;Y) quantifies how much Y reduces the person's uncertainty about X due to the information obtained from Y It also captures the variability in the utility of Y. In instances where no questions are needed, Y has effectively reduced all uncertainty about X by itself, suggesting high mutual information for those instances. Conversely, when questions are needed, Y has not fully reduced uncertainty about X, indicating lower mutual information for those specific instances. The overall I(X;Y) would reflect an average across all instances, considering both when Y is and isn’t sufficient by itself.

Conditional Entropy H(X|Y) quantifies the uncertainty that remains about X after considering the existing knowledge Y. It reflects the uncertainty with the preliminary knowledge Y before augmentation through Z. In practical terms, this means observing whether, in a specific instance, they need to ask further questions (indicating some remaining uncertainty about X) or if Y alone was enough to make the decision (no remaining uncertainty about X). Thus, H(X∣Y) can be understood or estimated based on the person's experience in each instance — how often and under what conditions they find Y sufficient or insufficient. The person will experience H(X∣Y) directly through the necessity of asking questions to reduce uncertainty. Over many instances, our observations about when and how often Y is sufficient could inform an estimate of H(X∣Y), and indirectly, the nature of H(X) itself by aggregating our experiences across different contexts or states of knowledge (Y).

Dynamic Process of Knowledge Augmentation

Initially, Y may or may not be sufficient to resolve X i.e. not knowing the letter to type, As the person encounters uncertainty about X they engage in a process of querying Z to obtain specific information that addresses this uncertainty. Each query to Z is formulated as a binary question, the answer to which directly reduces the uncertainty about X by one bit of information. The key aspect of this scenario is that as the person receives answers from Z, their knowledge base (Y) is incrementally augmented, enhancing their capability to make an informed decision about X.

  • Incremental Increase in I(X;Y): With each piece of information acquired from Z and integrated into Y, there is a corresponding increase in the mutual information I(X;Y). This reflects a reduction in the uncertainty about X due to the enhanced state of Y.
  • Reduction of H(X∣Y) H(X∣Y) quantifies the remaining uncertainty about X given the current state of knowledge Y. As Y is augmented with each answer from Z, the uncertainty about X decreases. Therefore, the process of augmenting Y effectively reduces H(X∣Y) because Y becomes increasingly informative about X.

This process highlights a dynamic view of knowledge acquisition, where Y is not static but evolves through interaction with Z. The evolving state of Y and its impact on decision-making about X can be modeled as a sequence of updates, each contributing to a gradual increase in I(X;Y) and decrese in H(X∣Y). Eventualy the uncertainty about X is reduced to zero.

Let's look at a numerical example of this knowledge acquisition process. Say we observe the person and notice they asked 3 binary questions to determine the letter they need to type,

Initial State (Before Asking Questions):

  • Entropy (H(X)): We don't know the value of H(X), because it depends on the context of the decision to be made (i.e., the selection of the letter).
  • Conditional Entropy H(X∣Y) Initially we don't know what the value of H(X∣Y) is.
  • Mutual Information (I(X;Y)) Initially the mutual information between X and Y is unknown.

After Asking Questions (and successfully typing the letter):

  • Entropy (H(X)): We still don't know the value of H(X).
  • The mutual information I(X;Y) increases by 3 bits, because 3 binary questions were asked gaining 1 bit of information each. That reflects the total amount of information gained by Y that was necessary to eliminate all uncertainty about X. However, we still don't know the value of I(X;Y), we know only the delta.
  • The conditional entropy H(X∣Y) is 0 bits, indicating no remaining uncertainty about X given the information obtained from Y.

The initial value of H(X∣Y) can be inferred from the necessity to ask questions (and thus gain information) until H(X∣Y)=0. If gaining 3 bits of information was necessary to eliminate all uncertainty about X (i.e., knowing which letter to type), and after this information gain, H(X∣Y)=0 bits (indicating no remaining uncertainty about X given Y), then it is reasonable to conclude that the initial value of H(X∣Y) was 3 bits. We reiterate that the reduction in H(X∣Y) is directly related to the information gain measured by I(X;Y).

The process we've described suggests that the total amount of information needed and gained to resolve the uncertainty was equal to the initial conditional entropy. Counting the number of questions asked allows us to make the concept of conditional entropy H(X∣Y) concrete and measurable.

Knowledge Discovered calculation for the longest English word

Let's explain how to measure conditional entropy H(X∣Y) for the whole word “Honorificabilitudinitatibus” from the exercise presented here. At the end of the exercise we have the word “Honorificabilitudinitatibus” typed and along with it a sequence of zeros and ones:

0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1

In this sequence:
  • A "1" indicates an interval where the person knew what letter to type next, implying no additional information from the external source (Z) was needed at that moment. This suggests that their existing knowledge (Y) at that point was sufficient to make a decision about X (the letter to type).
  • A "0" indicates an interval where the person did not know what letter to type and had to query Z to obtain the necessary information. Each "0" represents a binary question asked to Z, which directly reduces the uncertainty about X by one bit of information, contributing to the conditional entropy H(X∣Y).

To measure the conditional entropy H(X∣Y) for typing the entire word:

  1. Count the "0"s: Each "0" in the sequence represents a binary question asked, which corresponds to a bit of information needed to resolve uncertainty about the next letter to type. Counting the number of "0"s gives us the total number of binary questions asked throughout the process of typing the word.
  2. Calculate H(X∣Y): The total number of "0"s (binary questions asked) quantifies the total conditional entropy H(X∣Y) for the word. This is because the conditional entropy represents the remaining uncertainty about X (each letter to be typed) given the existing knowledge (Y) before each decision is made. The process of asking questions and receiving answers effectively reduces this uncertainty.

The sequence contains 10 instances of "0". This implies that 10 binary questions were needed to resolve the uncertainty about what letters to type for the entire word. Therefore, the total conditional entropy H(X∣Y) for typing the word, given the existing knowledge (Y) and the process of querying Z, is 10 bits.

Calculating the Information per Symbol

If we assume that typing a symbol takes as much time as asking a binary question, and you have a sequence where there are 10 instances of asking questions (represented by "0"s) and 27 symbols (letters) that need to be typed, then the total time intervals taken to type the word would indeed be the sum of these two activities. Thus, typing the word in this scenario took 37 time intervals.

Given that it took 10 bits of information to type 27 symbols (letters of the word), dividing the total bits of information by the number of symbols gives the average information per symbol for the typing process. H = 10  bits 27  symbols 0.37 bits per symbol.

This means, on average, each symbol in the word required about 0.37 bits of information to resolve the uncertainty about what letter to type next, given the existing knowledge and the process of querying an external source Z.

This calculation is based on the total conditional entropy H(X|Y) = 10 bits needed to type all 27 letters of the word, where the conditional entropy represents the total uncertainty resolved or information gained through the process of asking questions.

This metric provides insight into the efficiency of the knowledge acquisition process in this specific scenario. It indicates how much information, on average, was needed per letter to type the entire word, reflecting the average reduction in uncertainty or the average information gain per symbol, given the initial knowledge state Y and the assistance from querying Z.

Importantly, this calculation of information per symbol specifically reflects the conditional entropy H(X|Y) aspect and the process of augmenting existing knowledge Y to resolve uncertainty about the sequence X, not the mutual information between the sequence and the knowledge base per se.

Symbol Rate

The symbol rate refers to the number of symbols (letters) typed per time interval. Given that 27 symbols are typed over 37 time intervals (since typing a symbol and asking a binary question each take one time interval), the symbol rate \(R_{\text{symbol}}\) is calculated as: R symbol = 27  symbols 37  time intervals The symbol rate (symbols typed per time interval) for typing the word is approximately 0.73, meaning that, on average, about 73% of each time interval was dedicated to typing a symbol.

Information Rate

The information rate, on the other hand, refers to the amount of information processed or communicated per unit of time. In the context of your scenario, it would relate to the number of binary questions asked (which represent gaining information to reduce uncertainty about what to type) per total time intervals. Since 10 binary questions were asked over the same 37 time intervals, the information rate \(R_{\text{info}}\) would be: R info = 10  bits of information 37  time intervals The information rate calculation you mentioned as \(10/27\) appears to misunderstand the denominator; it should be based on the total time intervals (37) rather than the number of symbols (27) for it to represent the rate of information gained over time. The information rate (information gained per time interval, represented here by the binary questions asked) is approximately 0.27. This suggests that, on average, about 27% of each time interval was dedicated to gaining the information needed to reduce uncertainty about what to type next.

These rates provide insight into how the time was allocated between typing symbols and resolving uncertainties through asking questions in the process of typing the word.

Appendix A - Information theory with I-diagrams

If we treat our random variables as sets (without being precise as to what the sets contain), we can represent the different information-theoretic concepts using an Information diagram or short: I-diagram. It’s essentially a Venn diagram but concerned with information content.

This intuition allows us to create and visualize more complex expressions instead of having to rely on the more cumbersome notation used in information theory.

Yeung’s “A new outlook on Shannon's information measures” shows that our intuition is more than just that: it is actually correct! The paper defines a signed measure M that maps any set back to the value of its information-theoretic counterpart and allows us to compute all possible set expressions for a finite number of random variables. The variables have to be discrete because differential entropy is not consistent. (This is not a weakness of the approach but a weakness of differential entropies in general.)

M is a signed measure because the mutual information between more than two variables can be negative. This means that, while I-diagrams are correct for reading off equalities, we cannot use them to intuit about inequalities (which we could with regular Venn diagrams).

How many atoms are there?

Intuitively, each atom is exactly a non-overlapping area in an I-diagram.

For variables, there are 2n-1 meaningful atoms as we choose for each variable if it goes into the atom as part of the mutual information or as variable to be conditioned on (so two options for each of the variables). The degenerate case that conditions on all variables can be excluded.

How can we compute the atoms?

Every set expression is a union of some these atoms and their area is a sum where each atom contributes with a factor of 0 or 1. Such a binary combination is a special case of a linear combination. All joint entropies are well-defined and thus we can compute the information measure of every atom, which we can use in turn to compute the information measure for any other set expression.

Works Cited

1. Adami C. (2016). What is information?†. Philosophical transactions. Series A, Mathematical, physical, and engineering sciences, 374(2063), 20150230. https://doi.org/10.1098/rsta.2015.0230

2. Ben-Naim, A. (2008), A Farewell to Entropy: Statistical Thermodynamics Based on Information, World Scientific Publishing, Singapore

3. Deming, W. Edwards, (2000). The New Economics for Industry, Government, Education, 2nd Edition,The MIT Press. Cambridge, Mass

4. Jaynes, E. T. (1957). "Information Theory and Statistical Mechanics" (PDF). Physical Review. Series II. 106 (4): 620–630. Bibcode:1957PhRv..106..620J. doi:10.1103/PhysRev.106.620. MR 0087305.

5. Jaynes, E. T. (1957). "Information Theory and Statistical Mechanics II" (PDF). Physical Review. Series II. 108 (2): 171–190. Bibcode:1957PhRv..108..171J. doi:10.1103/PhysRev.108.171. MR 0096414.

6. Quinlan, J.R. Induction of decision trees. Mach Learn 1, 81–106 (1986). https://doi.org/10.1007/BF00116251

7. Corso, K. B., & Löbler, M. L.. (2011). Understanding the subject's behavior in the interaction with a decision support system under time pressure and missing information. JISTEM - Journal of Information Systems and Technology Management, 8(3), 581–604. https://doi.org/10.4301/S1807-17752011000300004

8. Garcia-Retamero, R., & Rieskamp, J. (2009). Do people treat missing information adaptively when making inferences? Quarterly Journal of Experimental Psychology, 62(10), 1991–2013. https://doi.org/10.1080/17470210802602615

Getting started