Datalinks, An All-Purpose Model of Digital Information Processing
This is the second draft of this essay.
Modern information systems, such as the World Wide Web, do reasonably well at communicating information between humans, but they are poor at making that information accessible to computers. Without machine comprehensibility, it becomes much harder to do tasks such as searching for and extracting particular information, automatically summarizing it, or using it for automated decision-making. While many of these problems have been solved to a reasonably-acceptable degree through brute force of engineering, especially using modern machine learning techniques for natural language processing, these solutions are narrowly-scoped and aren't very reusable. I believe it should be possible to create a universal model for representing most factual human knowledge, largely by applying technologies that have already been developed in relatively obvious ways, combined with a model of computation that treats code as a kind of knowledge.
In this essay, I will present "Datalinks" (a reference to the future-Internet/rebranded Civilopedia in the excellent Sid Meier's Alpha Centauri, and a literal name based on its fundamental structure), a rough theoretical model for an all-purpose system for representing and processing arbitrary information based on a graph model, with inspiration taken from logic programming (such as Prolog), knowledge bases like Wikidata, and other data models like RDF. Much of this emerged from a conversation with Nate Cull, who has written their own summary of their takeaways from the conversation, which are different and focus much more heavily on these concepts as the basis for a programming environment, including a lot about syntactic considerations, which Datalinks doesn't address. Some concepts were also contributed by Sophie, who convinced me to use at least some machine learning and reminded me of the importance of sourcing.
Knowledge and questions
The fundamentals of Datalinks are a very simple data model, consisting of a graph built out of subject-predicate-object triples. Each triple, known as a fact, is an edge in the graph, and consists of references to three entities, which are nodes. The predicate entity specifies the nature of a directional relationship between the subject entity and the object entity. Predicates are self-describing, having associated facts that explain their meaning, usage, related predicates (such as the reverse), etc. Facts are themselves entities, and can have facts about themselves to an unlimited degree of recursion. For example, the fact that an entity has a particular name would itself be described by a fact specifying what language that name is in. The language, then, would have more facts about it, identifying the language, explaining its grammar, etc. Most parts of this are very similar to Wikidata and RDF, with some expansion to increase self-definition, and some inspiration taken from classical, "Lisp era" AI systems.
Datalinks builds on this using "fill-in-the-blanks" logic programming for native data processing. A programmer, or software accessing the system, writes a partially-populated
Predicates and computation
It is important to understand that a predicate is a computable relation (not, strictly speaking, a function, since multiple facts can have the same subject and predicate with different objects). Instead of actually storing all true facts, we can instead predefine questions that derive the objects from information related to the subject, associated with the predicate by yet more facts, transforming the predicate from a data label to executable code. We can invoke this code by asking questions that involve the associated predicate, identically to a simple data lookup. Since a fact has only one subject, one particularly used to abstract models of computing such as the lambda calculus might expect use of currying, but it is probably simpler to just use the graph structure, building an entity with facts representing each part of the input. This doesn't mean currying wouldn't be possible; any part of a fact can be a placeholder, and predicates are first-class, so higher-order programming is possible by asking a question with a placeholder predicate, which can be derived (or merely identified) based on other facts stated about that predicate in the question graph.
Practicalities
While Datalinks doesn't in theory need to have more than knowledge graphs, questions, and computable predicates, it would be very difficult to use in practice if it didn't have certain features. The object (and possibly the subject, though the usefulness of that is suspect) of a fact can be, instead of an entity, at least a text string, a boolean, an integral or floating-point number, or a media file (image, audio, video, etc.), in order to allow meaningful data to be stored without complex and computationally- and storage-intensive encoding schemes (nth-character-is capital-A [n-is 0], Church numerals). Additionally, predicate questions can be written in external code to allow defining primitive predicates and those requiring access to external functionality.
Applications
There are many possible applications of a general-purpose knowledge model, as the many attempts to build one have found. We will consider a few to demonstrate the potential effectiveness of the Datalinks model. These applications would not necessarily be built as separate systems in practice; in fact, they would likely benefit from sharing the same underlying knowledge base and having a high degree of integration with one another.
Encyclopedia
One of the most obvious applications for general-purpose knowledge modeling is an encyclopedia that makes that knowledge human-accessible. By extensively (but not necessarily exhaustively) describing the grammar of one or more natural languages, and assigning names in those languages to every entity in Datalinks, it would be possible to automatically generate (extremely dry and uncreative) prose by walking the graph, building up a textual expression of the facts as they are encountered based on the patterns defined in the grammar description. By categorizing and prioritizing predicates and facts, passages of generated text could be organized into a sensible order and broken into sections and possibly into paragraphs, producing encyclopedia-like output. The level of detail of different parts of the output could by dynamically adjusted by changing how far the system will "wander" from the original subject, i.e. how many facts are followed before backing up and walking down a different branch of the graph.
Obviously, this system would need a lot of tuning before it would produce good output, and a lot of support information would need to be created and maintained to enable it to work at all. Additional mechanisms would likely need to be introduced to control "tangents" into irrelevant topics, to improve its ability to sequence and break up text, and possibly to reduce the "dryness" of the output text to make reading it more bearable. Machine learning systems, likely similar to but structurally different from powerful generative transformers like GPT-3, would likely be extremely beneficial in producing concise and readable output text. Since these systems would not need to memorize facts themselves, they could be significantly smaller than GPT-3 while achieiving similar or superior results.
Inputting information could also be done using natural language processing. If a user enters a single fact at a time, the system, again probably using machine learning tools like those used in modern NLP, would be reasonably able to parse the input into facts, prompting with its own restatements of them to ensure they are interpreted correctly and allow corrections. Synonyms would present the most trouble, but it might be possible to improve the "guesses" by using the currently-displayed article, and other facts entered at the same time, as context and by comparing entered facts to the existing graph to see how well they "fit" with what was known prior to the current entries.
When all of this is combined, an interesting possible editing experience emerges: A user looks something up, receiving an article containing the existing knowledge on the subject, then simply starts typing, entering everything they can think of on that subject. Each sentence is rewritten and then automatically slotted into an appropriate place in the article, possibly merging with other existing statements, with sections and paragraphs "budding off" from the existing ones as they grow from the new entries. As the system makes mistakes, the user selects the proper interpretations from drop-down lookup lists. Everything external to the subject that is mentioned automatically becomes a link to those entities, with no user effort involved.
Search by description
Another potential application is finding something one doesn't know the name or "proper search term" for based on other information one does know. If the user enters enough facts about an entity (perhaps talking about a placeholder like "it", entered into the same text-interpretation subsystem as used for the encyclopedia), these can be built into a question graph which can be searched to find entities with similar subgraphs around them. Even when names don't match up, it might be possible to search for sufficiently complicated questions based solely on the "shape" of the graph, looking for entities that have similar arrangements of related ideas without necessarily needing to identify the entities at all.
As an idealized example:
User:
It's a book.
It's by Terry Pratchett and another author.
The main characters are an angel and a demon.
System:
"Good Omens" by Terry Pratchett and Neil Gaiman
Inference and maintenance
It is likely that humans would have considerable trouble entering facts into Datalinks in a consistent and exhaustive fashion. To reduce the difficulty of achieving a satisfactory level of detail, it would be possible to build systems based on both classical analysis techniques and machine learning to analyze the knowledge base and infer missing facts based on similarities and patterns in the existing graph, then feed these back into the graph to fill in the gaps. Machine learning systems are already capable of drawing impressive inferences from relatively unstructured data; given well-ordered data it would be reasonable to expect considerably improved performance, though inferences drawn by them would likely need to be checked by humans to avoid a cascade of spurious neural network imaginings building up rapidly and polluting the knowledge base.
To enable this type of analysis to work effectively, a system of confidence and source tracking would be required. It should be possible for the system to automatically deduce the reliability of sources by comparing them against one another for consistency, with some manually-entered raliability information to provide a nucleus; given a model of information confidence and source reliability, the system would then be able to assign confidence levels to inferred facts based on the sources and confidence levels of the information it uses to make inferences, which could be used to filter inferred facts and flesh out the knowledge base with interpolated information without polluting it with unreliable guesses. With this in place, it would also be possible to scan through the existing knowledge base to find and excise unreliable information, again probably with human oversight.
Improvements needed
As this essay currently stands, there are some unsatisfying areas that could do with improvement. In particular:
- The answer mechanism is underdeveloped. As it stands, the answer has to be completely finished, with no provision for partially answering a question, or for speculative answers. Additionally, the result is a set of graphs, which isn't as elegant as it could be.
- There isn't any clear way to modify the knowledge graph while staying within the model.