The internal workings of a knowledge discovery process
Knowledge discovery process
The process of knowledge discovery starts with identifying the knowledge to be discovered, which can be thought of as the question "what do we need to know?" This represents the gaps in understanding or information that an individual or group believe they lack. By recognizing what is unknown, i.e., the questions without answers, the process of knowledge discovery aims to fill in these gaps by finding answers through research, investigation, and other means. To learn more about the Knowledge Discovery Process, please refer to the this article.
The tangible output represents the knowledge discovered through the knowledge discovery process, encapsulating the new information that has been acquired. In contrast, the input represents the missing information or knowledge that needs to be discovered.
We do not address the question of the quality of the tangible output, assuming that it meets appropriate standards.
We consider the knowledge discovery part as a black box, and we don't know how it produces tangible output in response to input questions. We do not consider whether the black box is operated by senior or junior developers.
In the remainder of this article, we will explore how the knowledge discovery actually takes place inside the black box.
The process of knowledge discovery is a journey of exploration and investigation, as the individual seeks to fill in the gaps in their understanding and gain the knowledge necessary to complete a task.
The knowledge required to successfully complete a task, referred to as "knowledge to be discovered," refers to the remaining missing information that the individual has about the task before engaging in it. The knowledge to be discovered is what we think we don't know, given what we think we do know aka "prior knowledge". However, it is not possible to accurately measure either the prior knowledge or the knowledge to be discovered in advance as demonstrated by the concept of ""it from bit"". Instead, the amount of knowledge discovered can be quantified after the task has been completed by counting the number of questions asked, providing insight into the subjective knowledge that needed to be discovered.
The distinction between knowledge to be discovered and knowledge discovered is that the former is subjective and conditional, while the latter is objective and real.
The knowledge discovered during a task refers to the information that is gained by an individual through their engagement in that task, and is the difference between their prior knowledge and the knowledge required to successfully complete the task. Prior knowledge represents the understanding and information that an individual already possesses before beginning the task, while the knowledge required to complete the task represents the information that still needs to be learned in order to achieve success. The knowledge discovered is the information that bridges this gap and allows the individual to successfully complete the task.
The outcome of the knowledge discovery process is the knowledge discovered, which is information that has been verified and confirmed through investigation or experience. This knowledge can be considered real as it has been tested and validated. However, the knowledge to be discovered, which represents the gap in understanding or the information that is believed to be missing, may not always be real. The distinction between knowledge to be discovered and knowledge discovered can be seen in games such as "20 questions" and ""Surprise 20 questions" where the knowledge to be discovered is an object or concept that is not certain until it is guessed or revealed.
The diagram below illustrates the internal workings of a knowledge discovery process.
The input to the process is the knowledge required to successfully complete the task at hand, which is represented as a cloud with a gold coin inside. The knowledge discovery process aims to uncover the gold coin, which represents the knowledge that is essential to complete the task.
When prior knowledge is applied and used to group the possible states and decide on the number of states in each class, we can say that an understanding of the knowledge to be discovered has been gained. This knowledge to be discovered is represented by missing information, measured in bits. The result is the remaining knowledge to be discovered, which still contains some missing information.
If the remaining missing information is greater than zero, the discovery process continues by applying prior knowledge and gathering new information from the world through various methods such as active perception, monitoring, testing, experimenting, exploring, and inquiry, as explained here.
If the remaining missing information is zero, the discovery process is considered complete, and the required knowledge has been discovered. The knowledge discovered is confirmed, and the process can be stopped.
If the task is completed successfully, it can be inferred that the knowledge discovered during the process is equal to the knowledge that was deemed necessary to complete the task. In other words, all the information and understanding required to complete the task has been obtained.
Additionally, knowledge is a human construct that does not have a scientific foundation. To measure it, we utilize the language and tools of information theory. This means that we interpret knowledge through the perspective of information theory. In the table below, we have related human terms to concepts from information theory.
|Human perspective||Information Theory lens||Human perspective||Information Theory lens|
Remaining knowledge to be discovered
Knowledge to be discovered to complete the task
The first column of the table lists various subjective elements from both human and information theory perspectives. The second column of the table contains knowledge discovered, which is the information that can be observed and verified through objective reality. We can measure knowledge discovered (or the information gained) by counting the number of questions asked.
Below is an animated example of calculating Knowledge Discovered when we search for a gold coin hidden in 1 of 64 boxes. In all six cases, the required knowledge is 6 bits. This means that, on average, we need to ask six binary questions to find the gold coin. You can see that the knowledge to be discovered reflects the difference between the required knowledge and the prior knowledge.
Acquiring knowledge from the world
"Lack of knowledge…that is the problem. You should not ask questions without knowledge. If you do not know how to ask the right question, you discover nothing." ~ W. Edwards Deming
It has always been the case that the most difficult task is to find the right questions to ask. To ask the right question you need to know half the answer.
Missing information quantifies how much we do not know about something. However, to quantify how much we do not know, we have to have some knowledge about the thing we do not know. The things we could potentially know or ask questions about any object are infinite.
An epistemic action is any action taken to gather information from the world. This could include any act of active perception, monitoring, checking, testing, verifying, experimenting, exploring, enquiring, or looking.
The value of new information has been extensively studied in economics, where it is referred to as the "value of information," or "how much an agent is willing to pay for obtaining that information?" An example of this would be how much you would pay for the information "Each of the houses in this city costs one million dollars."
An epistemic action is the act of seeking information from a source. Sources of information include:
- direct experience (such as perceptive evidences);
- information provided by other people;
- reasoning (about other beliefs);
- categorization (reasoning about classes and similarities).
We can view observing the world as the equivalent of receiving a message. Observation has the same property of changing the observer's probability distribution over the observable states of the world and creating meaning in the observer's mind as receiving a message. The key difference is that the world has no intention of being observed, so the receiver of the message must proactively create it.
For example, before looking out the window, you might say there is a 50% chance it is raining. After looking out the window, you know whether or not it is raining. Regardless of the outcome of your observation, you have gained one bit of information. The probability distribution is now likely to be 100% or 0%. What matters is how the observation changes the observer's understanding of the thing observed. The observation has provided one bit of information about the thing measured.
The game of twenty questions is a useful example of acquiring knowledge.
The game of twenty questions
"Thus, twenty skillful hypotheses will ascertain what two hundred thousand stupid ones might fail to do. The secret of the business lies in the caution which breaks a hypothesis up into its smallest logical components, and only risks one of them at a time. What a world of futile controversy and of confused experimentation might have been saved if this principle had guided investigations into the theory of light!" ~ Charles Sanders Peirce 
The game "20 questions" is an old game that gained popularity in the late 1940s when it was used as the format for a successful weekly radio quiz program. In the traditional game, the inquirer leaves the room while the remaining people agree on an object - a person, place, or thing. The inquirer then comes back and has to guess what the object is by asking successively questions that can be answered with a simple "yes" or "no". If the inquirer cannot guess the object after asking 20 questions, the respondents have stumped the inquirer.
The traditional version of the 20 questions game has a deterministic solution. Using Shannon's formula, we can calculate that with 20 yes/no questions, one should be able to screen a multitude of 220 ~ 106 alternative words.
Let's suppose the object to be guessed is "Abraham Lincoln's stove pipe hat". The initial clue is "sugarloaf with animal associations".
For two teams of competent players presented with the initial clue the game might go as follows:
|First team||Second team|
|1) Are the animal associations human? Yes.||1) Is the object useful? Yes.|
|2) Male or female? Male.||2) Is it an item of dress? Yes.|
|3) Famous or not? Famous.||3) Male or female? Male.|
|4) Connected with the arts? No.||4) Worn below or above the belt? Above.|
|5) Politician? Yes.||5) Worn on the head or not? Head.|
|6) USA or other? USA.||6) Is it a famous hat? Yes.|
|7) This century or not? Not.||7) Winston Churchill's hat? No.|
|8) Twentieth or Nineteenth century? Nineteenth.||8) Abraham Lincoln's hat? Yes.|
|9) Connected with the civil war? Yes.||
|10) Lincoln? Yes.||
|11) Is the object Lincoln's hat? Yes.||
To consistently excel at 20 Questions or knowledge work, you need a good mix of both:
- Prior knowledge (experience, instincts, and expertise about a specific subject matter)
- Strategy (for learning as much as possible with each question you ask in the game)
The choice of strategy allows us to acquire the same amount of information with more or fewer questions than the average number. The optimal strategy is to ask each question in a way that divides the remaining objects into approximately equal probability halves. For example: "male or female?", "worn below or above the belt?". The optimal strategy ensures that we will get all the missing information in the average number of questions.
It's worth noting that you can still play the game without the optimal strategy, but it will be suboptimal. Without subject matter knowledge, however, the game cannot be played at all. For example, if your opponent is thinking about Abraham Lincoln's stovepipe hat, but you have never heard of Abraham Lincoln.
Expertise in the relevant subject matter is critical.
This is also true for success in knowledge work.
It from bit
"Not until you start asking a question, do you get something. The situation cannot declare itself until you've asked your question. But the asking of one question prevents and excludes the asking of another.” ~ John Archibald Wheeler 
So far, we have applied the Shannon formula to cases where the object to be found is known in advance. By asking binary questions of one bit each, we looked for and found "it" - the hidden object. If the object is "it" and the result is "bit," we can say that we've got "bit from it".
However, in the reality of knowledge work, the "it" is not known in advance. If the knowledge worker is a software developer and the "it" to be delivered is a software program, the software developer needs to acquire missing information from various sources. At the minimum, the business should answer questions about the requirements and user manuals should answer questions about the technology to be used. Ideally, each question brings back one bit of information. The software developer needs to maintain a coherent frame of all the answers received."
We see that the "it" emerges from many bits of information. The software itself is constructed from the answers provided by participants at all levels. We can say that we've got "it from bit" - a phrase coined by John Wheeler. "It from bit" symbolizes the idea that every item in the physical world has knowledge as an immaterial source and explanation at its core. Reality arises from the posing of binary "yes'/"no" questions. 
John Archibald Wheeler, who coined the term 'black hole', drew attention to the connections between physics and information theory. He likened the job of a physicist to someone playing the "surprise" version of the game "20 questions
"Surprise Twenty Questions" game
In the "Surprise Twenty Questions" game, the inquirer leaves the room while the respondents, unbeknownst to the inquirer, do not agree on an object. When the inquirer re-enters the room, they try to guess the object by asking a series of questions that can only be answered with a "yes" or "no". However, the group has decided to play a trick on the guesser - there was no object agreed upon to start with! The first person to be questioned will only think of an object and answer the question after the inquirer asks their question. Each person after that will do the same, making sure their response is consistent with the immediate question and all previous answers. A complex vortex of decision making is set up, a logical but unpredictable chain of ifs and thens. Yet somehow, this steady improvisation leads, though not always, to a final answer that everyone can agree on, despite the odds.
"...they had agreed in advance that this would be one where no word was agreed upon to start with. Every answer, however, would have to be consistent with all the answers that had gone before. So it was really harder for the people playing the game than it was for me. The point was the word “cloud” that had been produced really came more out of the questions that were asked than out of anything that they had agreed upon before the thing started."
"However, the power I had in bringing the particular word "cloud" into being was partial only. A major part of the selection lay in the 'yes' or 'no' replies of the colleagues around the room. ... In the game, no word is a word until that word is promoted reality by the choice of questions asked and answers given. "
At any given moment, there are many possible objects that are compatible with the answers already given. The set of possible objects is in a state of messy coherence. The inquirer can find things, but often finds things they didn't know they needed. It is like a partially constructed spider's web of connections that becomes visible during the questioning. Each successive question selects a subset among the possible objects, but the possible answers to the question are determined by the possible objects that remain.
Changes in the inquirer's cognitive state will alter the respondents, and changes in the respondents will likewise ripple into the inquirer's cognitive state. The inquirer and the respondents are working together as a larger cognitive system because they are able to affect each other and therefore satisfy the 'mutual manipulability criterion', which specifies that two entities that can reciprocally alter each other's state belong to the same system.
To make this work, the respondents have to ensure that their combined answers still define at least one possible real object. T heir answers must be coherent with each other, requiring logic, context, vision, and cognition that is common to all respondents. If this requirement is met, then gradually, over a varying number of questions, an object finally emerges. This object is discovered together by all persons present during the questioning process - an object that was not selected ahead of time and could not have been predicted. The result is truly deterministic, but only in hindsight, only in retrospect.
Surprise Twenty Questions" game is a beautiful example of how humans discover knowledge.
1. Castelfranchi, C., Lorini, E. (2003). Cognitive Anatomy and Functions of Expectations. IJCAI03 Workshop on Cognitive modeling of agents and multi-agent interaction, Acapulco, Mexico.
2. Box, G. E. P., Hunter, J. S., & Hunter, W. G. (2005). Statistics for Experimenters: Design, Innovation, and Discovery, 2nd Edition (2nd edition). Wiley-Interscience.
3. Peirce, C.S.. (1998). The Essential Peirce, Volume 2: Selected Philosophical Writings (P. E. Project, Ed.). Indiana University Press.
4. Pezzulo, G., Lorini, E., & Calvi, G. (2004). How do I Know how much I don't Know? A cognitive approach about Uncertainty and Ignorance. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 26, No. 26).
5. Wheeler, J. A. (1990). Information, physics, quantum: The search for links. In W. H. Zurek (Ed.), Complexity, entropy, and the physics of information (Vol. 8, pp. 3–28). Taylor & Francis.
6. Oral history interview with John Archibald Wheeler, 1967 April 5. by Wheeler, John Archibald, 1911-2008 [Online]. Available: Transcript
7. [Online]. Available: Do Our Questions Create the World?
8. P.C.W. Davies and J.R. Brown, The ghost in the atom, Cambridge University Press, 1986.
9. Garner, W.R. (1962). Uncertainty and Structure as Psychological Concepts, New York, Wiley.
How to cite:
Bakardzhiev D.V. (2022) Knowledge discovery. https://docs.kedehub.io/kede/kede-knowledge-discovery.html