The game of hiding a coin
In general, “knowledge” and “information” are vague terms. In order to better show the difference between knowledge and information let's consider the following game.
Suppose we have eight boxes as shown below. There is a coin hidden in one of the boxes.
1  2  3  4  5  6  7  8 








In order to find the coin we are allowed to ask binary questions i.e. questions with “Yes” or “No” answers. The boxes are with equal size hence there is a probability of 1/8 of finding the coin in any specific box.
There are many ways we can decide what binary questions to ask. Here are two extreme and welldefined strategies.
Brute force strategy  Optimal strategy 

1) Is the coin in box 1? No, it isn't.  1) Is the coin in the right half of the eight boxes, that is in one of boxes 5, 6, 7, or 8? Yes, it is. 
2) Is the coin in box 2? No, it isn't.  2) Is the coin in the right half of the remaining four boxes, that is in boxes 7 or 8? Yes, it is. 
3) Is the coin in box 3? No, it isn't.  3) Is the coin in the right half of the remaining two boxes, that is in box 8? Yes, it is. 
4) Is the coin in box 4? No, it isn't. 

5) Is the coin in box 5? No, it isn't. 

6) Is the coin in box 6? No, it isn't. 

7) Is the coin in box 7? No, it isn't. 

Below is an animated example of Brute force strategy when we search for a gold coin hidden in 1 of 8 boxes.
When using the brute force strategy with the first question we remove 1/8 of the possible coin locations. With the second question we remove 1/7 of the remaining possible coin locations. With the seventh question we remove 1/2 or 50% of the remaining possible coin locations.
Below is an animated example of Optimal strategy when we search for a gold coin hidden in 1 of 8 boxes.
Using the optimal strategy with each question asked we remove 1/2 or 50% of the possible coin locations.
Applying the brute force we have a small 1/8 probability of finding the coin with one question but we may need to answer up to seven questions. With the optimal strategy we need to ask exactly three questions. If we compare the two strategies we see that we need to ask at least one question and at most seven questions. The average number is three questions.
Many investigations were carried out on the way children play the popular “20questions” game to find out the required information with the minimal number of questions. The results are that the firstgraders (aged about 6) almost always chose the “brute force strategy.” From the thirdgraders (aged about 8), only about one third used “brute force strategy” and the rest used a kind of "optimal strategy." Of the sixthgraders (aged about 11), almost all used a kind of "optimal strategy." Clearly the young children leap hurriedly to ask specific questions, hoping to be instantly successful. We see that at quite an early age, children intuitively feel that certain strategies of asking questions are more efficient than others; that is, efficient in the sense of asking the minimum number of questions to obtain the required information. The older children invest in thinking and planning before asking[2].
Missing information
"Information is not knowledge. Let's not confuse the two.” ~ W. Edwards Deming
Now we are ready for definitions from the information theory.
In using the term information in information theory, one must keep in mind that it is not the value of the information itself, nor the amount of information carried by a message,that we are concerned with, but the length of the message that carries the information.
“Missing information” is defined by the Information Theory introduced by Claude Shannon[1].
We can express the amount of missing information H in terms of the distribution of probabilities. In this example, the distribution is: {1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8} because we have 8 boxes with equal sizes.
For calculating H we use Shannon's formula[1]:
The amount of missing information H in a given distribution of probabilities (p_{1}, p_{2},...p_{n}) is an objective quantity. H is independent of the strategy we choose of asking questions i.e. to acquire it. In other words, the same amount of information is there to be obtained regardless of the way or the number of questions we use to acquire it.
H only depends on the number of boxes and their size. The larger the number of boxes, the larger the average number of questions we need to ask in order to find the coin. The choice of strategy allows us to acquire the same amount of information by less or more than the average number of questions. The optimal strategy guarantees that we will get all the missing information by the average number of questions.
We gain missing information by asking binary questions. If an answer to a binary question allows us to exclude some of the possible locations of the coin then we say that we acquired some “missing information”. If the question allows us to exclude 50% of the possible locations we say that we acquired 1 bit of information.
The anthropomorphic nature of missing information
"Even at the purely phenomenological level, entropy is an anthropomorphic concept. For it is a property, not of the physical system, but of the particular experiments you or I choose to perform on it." ~ Edwin Jaynes [5]
While Jaynes' statement originally pertained to thermodynamics entropy, it also applies to missing information, as the two are closely related[5].
Both experimental and theoretical definitions lead to the conclusion that missing information is an anthropomorphic concept  not in the sense that it is somehow nonphysical, but rather because it is determined by the particular set of variables, parameters, and degrees of freedom defining an experiment or observation. We can always introduce as many new degrees of freedom as we like[5].
Once we recognize the anthropomorphic nature of missing information, it is easier to understand the next logical observation: that for each physical system, there corresponds an infinite number of information systems, defined once again by the variables we choose to use and monitor.
Thus, missing information is technically infinite, but made finite only by the finiteness of our measurement device (experiment, observation, interview) used to query the system.
Knowledge
"All knowledge is in response to a question. If there were no question, there would be no scientific knowledge." ~ Gaston Bachelard, The Formation of the Scientific Mind
We define information as "answers without the questions", The answers might be stored in a book, a video, DNA, source code etc.
We define knowledge as the combination of questions and their answers, while information is just the answers without the questions. In this view, knowledge represents the context and the meaning behind the answers, whereas information is the raw data or facts themselves. Information is a necessary ingredient for knowledge, but knowledge is more than just the sum of the information that it contains. This distinction, can help to understand that knowledge is always embedded in some context or questions, and that is what gives it value, while information alone is valueless, because it has no meaning, context or questions in regard of. It is also worth noting that while knowledge represents the understanding of certain facts, it can also be used to generate new questions and thus information, which can lead to further growth of knowledge.
We define "missing information" as "questions without answers," in contrast to the definition of "knowledge" as "questions and their answers." This distinction highlights the idea that knowledge is complete, having both questions and answers, while missing information refers to the gaps in knowledge, the things that are unknown or not yet understood. This distinction also corresponds to the idea that knowledge is an accumulation of information, and that every new question that is answered contributes to this accumulation, and any question that hasn't been answered yet, represents missing information.
We define "ignorance" as "no questions and no answers" as the opposite of the definition of "knowledge" as "questions and their answers." In this context, "ignorance" refers to a lack of curiosity or interest in a particular subject or topic, and a lack of awareness or understanding of it. It is the state of being uninformed and unaware, having no questions to ask and no answers to seek. While "knowledge" refers to an accumulation of information that results from asking questions and seeking answers, ignorance refers to a lack of motivation or capability to do so. However, "ignorance" shouldn't be viewed as a negative thing, it is just a state of not having knowledge about a particular subject or thing. And also, one can still have knowledge about a variety of things and still be ignorant about some other things.
Name  Questions  Answers 

Ignorance 
No 
No 
Information 
No 
Yes 
Missing information 
Yes 
No 
Knowledge 
Yes 
Yes 
In order to understand what knowledge is, let's refer to the game of hiding a gold coin. If I know that the coin is in the right half of the eight boxes while you do not know where it is, I have less missing information on the location of the coin than you have. If I know the exact box where the coin is while you do not know where it is, I have zero missing information on the location of the coin than you have. Notice the word "know". This knowledge is clearly subjective — you're missing more information than me.
True and observed missing information
In information theory, when you acquire new information about a probability distribution, it reduces the uncertainty of the distribution for you, but it does not change the actual distribution itself.
For example, the probability distribution of the game of hiding a gold coin has 3 bits of missing information. If you acquire 1 bit of that missing information, then for you, the distribution now has 2 bits of missing information, but for someone else who is not aware of the information you have acquired, the distribution still has 3 bits of missing information. This difference in the perception of the missing information is referred to as the "observed missing information" versus the "true missing information" in information theory.
Observed missing information is a measure of the missing information that is perceived by an observer given the information that the observer has acquired. The conditional missing information H(XY) can be seen as a measure of the "observed entropy" of X, given the value of Y, as it represents the remaining missing information in X, after taking into account the information that is known from Y. It gives us an idea of how much the knowledge of Y helps to understand X. So, if an observer has acquired some information about a distribution and the observed entropy is 2 bits, it means that the observer is aware of some information about the distribution, but there are still 2 bits of missing information that the observer is not aware of.
This information that the observer is aware of can be referred to as "prior knowledge", as it represents the understanding and awareness that the observer has about the distribution before engaging in a task to acquire all of the missing information.
The "true missing information" of a probability distribution is the amount of missing information it has when all possible information is considered. True missing information is not subjective, it is an objective measure of missing information for a certain distribution or problem, but it's not possible to measure it in practice, it's just a theoretical concept.
True missing information represents the total unconditional missing information H(X) in a random variable X, regardless of any other information that might be available. It's a measure of the best possible understanding of the information in X.
The "observed missing information" is the amount of missing information that is still present to an observer that has acquired new information. For the new person that is not aware of the information you have acquired, the observed missing information is equal to the true missing information. For you the observed missing information is 2 bits because that's the amount of missing information you perceive after you acquired new information, while the true missing information remains 3 bits, as the missing information for the distribution from a general point of view.
"Observed missing information" is subjective because it depends on the observer's prior knowledge, understanding and perspective. Thus, the amount of missing information that is perceived by the individual will be different from that of another individual who has a different set of information. It can also change with the acquisition of new information.
However, that kind of subjective missing information is not the subject matter of information theory. To define the true missing information, we have to formulate the problem as follows: “Given that a coin is hidden in one of the M boxes, how many questions does a person need to ask to be able to find out where the coin is?”
In this formulation, the true missing information is builtin in the problem. It is as objective as the given number of boxes, and it is indifferent to the person who hid the coin in the box.
It is not entirely correct to say that "prior knowledge" is the difference between "true missing information" and "observed missing information". The difference between "true missing information" and "observed missing information" is their mutual information. It quantifies the amount of information that they have in common, how much information is shared between them, but it doesn't necessarily reflect the level of understanding or the level of awareness of an observer.
Knowledge is subjective and conditional
"There is no neutral observation. The world doesn't tell us what is relevant. Instead, it responds to questions. When looking and observing, we are usually directed toward something, toward answering specific questions or satisfying some curiosities or problems." ~ Teppo Felin
Individuals receive information about something by asking binary questions. This way they may or may not reduce their observed missing information. At least they have a chance to reduce it. If the observed missing information of a distribution is 0 bits, it means that there is no remaining missing information in X, given the information they have. It means that the observer has a complete understanding of the distribution.
It's important to note in mind that the observed missing information is a subjective. The same distribution could have different observed missing information for different observers depending on their prior knowledge, understanding and perspective they have.
Additionally, true missing information is a theoretical concept that represents the best possible understanding of a certain distribution, and it's usually impossible to achieve it in practice. It could be that even if your "observed missing information" is 0 bits, there might still be some missing information that you are not aware of, so even if you have complete knowledge from your perspective, it might not be the case from a theoretical perspective.
So, if your observed missing information of distribution is 0 bits, it means that you are not aware of any missing information in the distribution, and you have complete knowledge of the distribution given certain assumptions, but it doesn't mean that you have a complete understanding of the distribution without considering any other information. You might know everything about the distribution given certain assumptions, but other information that is not taken into account could still be missing.
The assumptions among other things include the specific questions you've had. The questions you choose to ask shape the way you think about a problem or a distribution, and they can influence the kind of information you gather and the way you interpret it. The questions you ask can reveal patterns, relationships, and connections that you would not have noticed otherwise, and they can help you understand and make sense of the information you have.
If you add new specific questions the observed missing information may change.
The various Interpretations of the quantity H
The value H has the same mathematical form as the entropy in statistical mechanics. Shannon referred to the quantity H he was seeking to define as “choice,” “uncertainty,” “information” and “entropy.”[1] The first three terms have an intuitive meaning.
The term "choice" is commonly understood to refer to the number of alternative options available to a person. In the context of the game, we have to choose between n boxes to find the gold coin. When n = 1 we only have one box to choose from and therefore have zero choice. As n increases, the amount of choice we have also increases. However, the "choice" interpretation of H becomes less straightforward when considering unequal probabilities. For example, if the probabilities of eight boxes are 9/10, 1/10, 0, . . . , 0, it is clear that we have less choice than in the case of a uniform distribution. However, in the general case of unequal probabilities, the “choice” interpretation of H H not satisfactory. Therefore, we will not use the “choice” interpretation of H[3].
The meaning of H as the amount of uncertainty is derived from the meaning of probability. We can say that H measures the average uncertainty that is removed once we know the outcome of a random variab;e.
The “missing information” interpretation of H is intuitively appealing since it is equal to the average number of questions we need to ask in order to acquire the missing information[3]. If we are asked to find out where the coin is hidden, it is clear that we lack information on “where the coin is hidden.” It is also clear that if n=1 i.e we know where the coin is, we need no information. As n increases, so does the amount of the information we lack, or the missing information. This interpretation can be easily extended to the case of unequal probabilities. Clearly, any nonuniformity in the distribution only increases our information, or decreases the missing information. All the properties of H listed by Shannon are consistent with the interpretation of H as the amount of missing information. For this reason, we will use the interpretation of H as the amount of missing information.
The basic idea of information as number of questions
The relation between the Missing Information and the number of questions is the most meaningful interpretation of the quantity H. By asking questions, we acquire information. The larger the Missing Information, the larger the average number of questions to be asked. There is a mathematical proof that the minimum average number of binary questions required to determine an outcome of a random variable is between H(X) and H(X) + 1[4].
The amount of missing information in a given distribution (p_{1}, . . . , p_{n}) is independent of the strategy, or of the number of questions we ask. In other words, at the end of questioning the same amount of missing information is obtained regardless of the way or the number of questions one asks to acquire it.
The same missing information H can come from different search spaces each with a different number of states N. On the other hand, totally different information H might come from search spaces each with the same number of states N. It all depends on the probability of occurrence of each of the N states in a search space. In general, the below inequality holds for any random variable with N outcomes:
The equality holds when each state has the same probability of occurring. For example we are given the 26 letters (27 if we include the space between words) from the English alphabet. The frequency of occurrence of each of these 27 letters is different and known. The total missing information for this distribution is 4.138 bits. This means that in the “best strategy,” we need to ask, on average, 4.138 questions. However, for the English language, log_{2}(N) = log_{2}(27) = 4.755. In this case the equality doesn't hold because the letters are with a different probability of occurrence.
In general, when we have n equally sized boxes the average number of questions needed to find the coin location is:
In our case because we have 8 equally sized boxes then n = 8 and H is:
Which is what we indeed achieved using the optimal strategy.
In the case of 16 boxes H is:
And for 32 boxes H is:
So far we looked at the case of even number of equally sized boxes. This gives an upper limit on the number of questions. When the number of boxes is odd or the boxes are not equally sized we can only reduce the number of questions necessary to obtain the missing information. For example below we have a coin hidden in three unequally sized boxes where box #3 is twice as big as #1 and #2.
1  2  3 

1/4  1/4  1/2 
The probabilities of the boxes are 1/4, 1,4, 1/2. In this case H is:
The value of H is 1.5 because if the coin is in box #3 and we ask is it in box #3 thus excluding 50% of the possible locations we will find the coin with only one question. But if the coin is in box #1 and with each question we exclude 50% of the possible locations we will have to ask two questions. That's why the average number of questions is (1+2)/2=1.5
What is one bit of information?
In the case of 2 boxes we will need to ask only one question. The value of the missing information H in this case is one. as calculated below:
With p equals 1/2 we acquire exactly 1 bit of information.
This bit can be represented by one binary digit. Hence, many people confuse one bit of information with a binary digit. However bits are not binary digits. The word bit is derived from binary digit, but a bit and a binary digit are fundamentally different types of quantities. A binary digit is the value of a binary variable, whereas a bit is an amount of information. To mistake a binary digit for a bit is a category error analogous to mistaking a one litre bottle for a litre of water. Just as a one litre bottle can contain between zero and one litre of water, so a binary digit (when averaged over both of its possible states) can convey between zero and one bit of information.
Uncertainty and precision
Shannon initially used “uncertainty” for his “missing information” measure. Uncertainty is the opposite of precision. If uncertainty is reduced by one bit, precision is doubled, or increased by one bit.
For example, an observer initially believes a distance is between 4 and 6 m. If after an observation the observer believes it to be between 4.5 and 5.5 m, then the observation has halved the range of uncertainty about the length in question. That is because the initial uncertainty was U=64=2m, and after the observation was U = 5.54.5 = 1 m. Since half of the possible values of the uncertainty were removed then the observation has provided one bit of information about the thing measured.
Works Cited
1. Shannon CE. (1948), A Mathematical Theory of Communication. Bell System Technical Journal. ;27(3):379423. doi:10.1002/j.15387305.1948.tb01338.x
2. BenNaim, A.,(2012), Discover Entropy and the Second Law of Thermodynamics: a Playful Way of Discovering a Law of Nature, World Scientific, Singapore, Singapore
3. BenNaim, A. (2008), A Farewell to Entropy: Statistical Thermodynamics Based on Information, World Scientific Publishing, Singapore
4, Cover, T. M. and Thomas, J. A. (1991), Elements of Information Theory, John Wiley and Sons, New York.
5. Jaynes ET. 1965 Gibbs vs. Boltzmann entropies. Am. J. Phys. 33, 391–398. (doi:10.1119/ 1.1971557)
Getting started