Knowledge Libraries - Business and IT Glossary

AI systems for problem solving, including Cognitive Computing systems, require a base collection of knowledge or corpus. The corpus is a digital representation of all that is known about a particular domain, such as all the works of Shakespeare, or all of the defining characteristics of disorders that are codified in the Diagnostic and Statistical Manual of the American Psychiatric Association. This knowledge must be represented in a consistent form to allow the system to use it to draw inferences and make decisions, and to be able to update the corpus when appropriate.

The data required for corpora in some domains, such as medical diagnostics, insurance claim codes, and regulatory filings, are already available in text form from government and professional association sources. Packaging this data for use by AI/cognitive systems—with or without additional metadata—is a natural extension to the conventional content publishing model and is in progress for several domains.

Common knowledge—the data that helps us interpret natural language in context—has utility across industries and is generally more difficult to codify. The Cyc knowledge base, which contains over 630,000 concepts with 38,000 types of relationships, has been in development for decades and is now commercially available. Open source projects like WordNet, which catalogs words, synsets, and senses in English, can give application developers a jumpstart on building robust solutions with natural language capabilities.

Representative Vendors and Projects: Cognitive Scale, CyCorp, and WordNet.