Cubicon Platform
Semantic Net Architecture
Radically Simplified Network Architecture
Radically Simplified Network Architecture
Overcoming Semantic Web Difficulties
Conclusion
Appendix A
Radically Simplified Network Architecture
The Cubicon Platform is a radically simplified network architecture that enables true semantic functionality. It can automatically classify billions of HTML Web pages to create an intelligent Semantic Net that can make search much more precise as well as provide vast new capabilities. Cubicon is a new global infrastructure poised to emerge as an overlay if the WWW consisting of a Topic Map grid that references Web page resources.
Cubicon is a context-based architecture that uses a higher order set of system abstractions above objects that enable semantic discourse to occur by both humans and machines in a more effective manner. Such a higher-order is a prerequisite for building a vibrant next-generation Internet environment. Cubicon effectively captures information semantics (meaning of things) and provides advanced systems technology to enable transformation into knowledge.
Automated classification of HTML pages are not sufficient if the classification is made at the surface level or by metadata tagging systems that fail to capture various levels of meaning possible from content within documents. With Cubicon, classification occurs through a set of mechanisms designed to lift the semantics from the normal use of text. This systematic approach to capturing and organizing intrinsic meaning behind words takes several steps beyond the current semantic extraction and ontology methods.
Overcoming Semantic Web Difficulties
As Alex Iskold states in his Semantic Web: Difficulties with the Classic Approach article, "The original vision of the semantic web as a layer on top of the current web, annotated in a way that computers can "understand"... This vision is certainly grandiose and intriguing. Yet, for the past decade it has been a kind of academic exercise rather than a practical technology.
The classic semantic web approach in a nutshell: The idea is to represent information using mathematical graphs and logic in a way that can be processed by computers. To express meaning, the classic semantic web approach also advocates the creation of ontologies, which describe hierarchical relationships between things. ... The W3C has mapped out a set of tools and standards that are needed to make it happen, two of which are the XML-based languages RDF and OWL that are designed to be flexible and powerful."
He summarizes four technical challenges:
1. Representational Complexity
2. Natural Language Problem
3. Bottom-Up Assumption
4. Standards Issue
Cubicon effectively addresses these challenges through fundamental computer science innovations that represent a clean slate architecture for a Semantic Net. The following sections clarify Iskold's observations and describe Cubicon's innovative resolutions.
1. Representational Complexity
"The W3C RDF and OWL specifications are complicated. Even for scientists and mathematicians these graph-based languages take time to learn and for less-technical people they are nearly impossible to understand. Because the designers were shooting for flexibility and completeness, the end result is documents that are confusing, verbose and difficult to analyze.”
Cubicon is based upon an iconic language that slays complexity through an automated multi-dimensional visualization of systems architecture. It combines text with color graphics, motion and sound to leverage human cognition in a direct icons-to-bits transformation. When capturing knowledge, Cubicon will directly enable domain experts (as opposed to programmers) to express and reason complex Semantic Nets of hyperdata in a community setting: a flexible, complete, scalable and automated environment. This shift from programmer to domain expert represents a departure from current trends.
This art shows a screen shot from CubeStudio advanced IDE. This perspective is the primary interface a domain expert will use to declare topics and their characteristics. Map generation is performed through three mechanisms.
A Topic Map schema automatically clusters around a center study topic. Any topic that has a direct association with the study topic will appear in the top focus group circle (dark blue). Additional degrees of separation appear in subsequent clockwise groups. A map (possibly consisting of thousands of topics) can be edited to display only a subset through user-defined filters. Selection of the study topic will open up a frame that will depict its association, category and occurrence characteristics.
A shared topic always appears in two or more topic map contexts controlled by different entities. A foreign topic acts on behalf of a principle topic that is declared in an origin Topic Map controlled by the same entity. An alien topic acts on behalf of a principle topic declared in an origin Topic Map controlled by a different entity. Thus, either a foreign or alien topic is a proxy that stands in for a principle topic. A proxy topic has no (resource) occurrences itself and is a portal into the origin Topic Map.
2. Natural Language Problem
“People argue that RDF and OWL are for machines only, so it does not matter that people might find them hard to look at. (Though as a side note, the advantage of XML representation is precisely that people can look at it, mainly for debugging purposes.) But even assuming that RDF and OWL are for machines only, the question arises: how are these documents to be created?”
The issue here is not only creation, but also one of evolution. Any knowledge representation must constantly morph to reflect changing domain relationships and environmental characteristics.
RDF is a language that represents information about resources on the World Wide Web (WWW). OWL is built on top of RDF and can machine-process Web information. OWL can be used to explicitly represent the meaning of terms in vocabularies and the relationships between those terms.
RDF and OWL have two fundamental problems that inhibit them from becoming the infrastructure for a Semantic Net (Web):
Language-based vs. value-based. RDF and OWL represent a triple relationship in a fixed string that does not separate the syntax (term/name representation) from the semantics (meaning) of a concept. The fixed nature of these representations prohibits them from updating to new terms or names. This issue has just recently been recognized by the W3C, and their researchers are currently working on a numerical (value-based) representation to define a language-independent structured knowledge exchange. Such an exchange would enable representational freedom if it were possible through extension of these specifications.
Weak semantic model. RDF and OWL are both represented in XML, whereas a Web page is represented in HTML. XML makes no clear distinction between structured datum, unstructured information strings, and concepts. HTML makes even less distinction between these abstractions. This semantic ambiguity makes it difficult for users to develop clear and concise domain statements and complicates machine processing. An RDF triple graph is an assertion that describes a relationship between Web resources. A triple is composed of subject, object and predicate parts that denote the relationship and is expressed as a Universal Resource Identifier (URL). Therefore, resources must be intertwined with the RDF triple and OWL ontology assertions around them. This complication leads to a complexity explosion when this base technology is scaled to represent global knowledge.
The Cubicon architecture brings sharp relief between resources that are represented in structured datum and unstructured information strings, in contrast to concepts that are represented in Topic Maps. A Topic Map is superimposed on top of the Web and reference page occurrences through existing URI and XPointer mechanisms. In addition, a Web page can be translated into a cleartext document (See Effective Linking in this document) making resources more clearly defined.
Cubicon provides three mechanisms to generate a target Topic Map: Topic Discovery, Concept Deduction, and Machine Heuristics. These mechanisms rely upon the Context Registry hosted by Cubicon Corporation.
Context Registry
Similar to UDDI - The Context Registry is similar in concept to the UDDI (Universal Description, Discovery and Integration Specification). UDDI purports to create a standard interoperable platform that enables companies and applications to quickly, easily and dynamically find and use Web services over the Internet. While this service directory capability is desirable, it only represents a portion of the required infrastructure for a Semantic Net environment.
Resource Termspace/Namespace Identities - In contrast to the UDDI, the Context Registry maintains root entity identifiers (EIDs) on a global basis. All other context and content identifiers are built from this ownership base. Select entity and community knowledge is automatically propagated back into the Context Registry through CubeNet. CubeNet is a distributed set of repositories and a registry that maintain global context.
Component Ontology - There are 38 core component classifications. Component knowledge propagated from Community Repositories is organized for search based upon a meta ontology architecture and provides the ability to discover context resources.
Discovery and Negotiation - Entities and community context resources can be discovered through an ontology search. These advanced capabilities provide dynamic service negotiation and exchange between producer and consumer entities.
Component Trading - The application of DRM (Digital Rights Management) enables IP trading that emulates a physical supply chain where all source authors are compensated for their embedded components.
Topic Discovery. This mechanism enables the discovery of topics for the sole purpose of linking to other Topic Maps. A topic is based upon a subject declared in a particular source community genealogy. A domain expert can search other entity Topic Maps by topic preferred termspace or include synonym termspace in multiple natural languages. We introduce 'termspace' as opposed to 'namespace' to denote the distinction between concept terms in theory and component names in practice. An entity controls access to their Topic Maps. Termspace match can be exact or based upon a regular expression for sophisticated pattern matching. Further qualifiers will constrain a search to a particular topic form, term affinity or trade name. Search results will list matched Topic Maps and particular topics. The search mechanism enables a display of the subject's genealogy for comprehension of its template origin and meaning. A domain expert would have the option to declare either a shared or proxy topic within the target Topic Map. A shared topic is defined as a subject with a unique local context, whereas a proxy topic is a subject with an association to a principle topic.
Concept Deduction. A concrete topic is a reified abstract subject. Reification is to regard or treat an abstraction as if it has concrete or material existence. A topic has association, role, category and occurrence characteristics. All these concepts are declared as universal knowledge templates in genealogy schema within a community's repository. Schema represents the taxonomy of an inorganic, organic or abstract system based upon the spectrum of general to specific concepts. A prototype template is a genesis concept, or can be a copy that is linked from another community. This 'ancestor' can be specialized into 'descendent' progeny templates. Two templates can be joined to synthesize an entirely new concept.
The Context Registry will be continuously crawling through CubeNet, gathering and correlating concept relationships from and between unlimited numbers of Community Repositories. A target Topic Map continuously sends identification of its underlying concepts to the Context Registry that infers and correlates patterns from this knowledge stream. Discovering inference relationships are not logically derived, but assumed by the probable manner in which other Topic Maps have related the concepts within their domains. This summary appears in the Concept Deduction Frame. Only those community concepts that the entity is either a member of or is open source to are returned to their CubeStudio. The domain expert can select a particular concept and either copy or link to its source template. This concept reuse capability builds cohesion into the evolving global knowledge space as more and more communities adopt the same base ideas.
Machine Heuristics. We expect that as the Cubicon ecology evolves, domain experts will develop heuristic behavior that will use both topic discovery and concept deduction mechanisms to self-evolve a Topic Map. These mechanisms will be made available through special operation icons that can be declared in control-flow methods. This method behavior will be embedded into a Topic Map and enable it to declare topic and characteristic concepts based upon reflective trial-and-error methods.
3. Bottom-Up Assumption
“Because there are vast amounts of existing information that need to be transformed, the classic semantic web approach is a bottom-up approach. Annotating information on the web-scale is a daunting task. If it is to be done by a centralized entity, then there will need to be Google-like semantic web crawler that takes pages and transforms them into RDF.”
Cubicon Corporation believes a pragmatic top-down approach is needed to leverage existing information found in billions of HTML Web pages and other unstructured as well as structured information resources. The central challenge will be how to make sense of these information resources relevant to the context of a Topic Map and link it as occurrences. This task will require both human- and machine-driven approaches to scale on an economic basis. This paper only covers unstructured linking mechanisms.
There are two fundamental issues with annotating existing unstructured information resources:
Effective Linking. A RDF triple mechanism references a Web page through a URL and can also resolve down to a particular element fragment within a Web page consisting of a text section, image or even a text passage through the extended syntax of XPointer. Few developers use this resolution mechanism due to its complexity. Classifying Web pages requires an advanced level of resolution for effective semantic linking when the actual resource is embedded within a document. A similar situation can be found within billions of other unstructured document types, including MS word files.
Overlapping Hierarchy. All of these file types face the overlapping hierarchy problem (See Appendix A in this document for details) that haunts markup languages. The markup approach treats every document as having a single natural representation as a logical hierarchy of nested text passages. But, in the real world, document representation is not that simple. This art titled Overlapping Problem of Markup Languages provides an example of this complexity. The two <paragraph> markups are disrupted by the <analytic> markup that brackets a text passage encompassing both paragraphs. This encompassing annotation is not possible using XPointer since this encoding mechanism has no straightforward way to deal with this complexity.
Cubicon meets complex encoding requirements that overcome the overlapping hierarchy problem by separating script from markup information within an article. A document page can contain more than one article. An article is imaged in a region and is limited to script that appear as cleartext symbols along with only paragraph (_) and line feed (•) markup codes. A line feed is considered embedded within a paragraph.
A cleartext type is a shared community resource. It is declared with a number of 'markups' represented as facets that define syntactical and semantic dimensions about the script that meet specialized encoding requirements. A cleartext block is declared within a composite that references a common type. This common reference model enables a type to be shared by multiple cleartext articles located in different contexts.
Within an article, multiple block collections of markups (facet records) maintain associative pointers into the script either on a range or POSITION basis. Facets are classified into four forms: SYMBOL, FORMAT, EMBED and ANALYTIC. The development of only these forms came about through deep analysis of the manner in which markup can be generically classified. Markup overlays the script with no overlapping conflicts between facet record collections.
The 'facet' term replaces the traditional SGML/XML markup 'element' term and reflects the true nature of the encoding hierarchy. An element is embedded into a text and considered a component part of a document. On the other hand, a facet is an aspect of the cleartext that characterizes the encoded meaning as a superimposed layer of markup records.
A Cleartext article would mirror a HTML Web page and be stored as an extension to a Topic Map. Each annotation is represented by a particular (resource) occurrence as a characteristic of a topic.
A cleartext page provides a semantic-based, behavior-driven document model that can be shared seamlessly between dynamic and static states across the Internet. It is an unstructured document model that will provide interoperability between PDF, MS Word and HTML content representations. Translation requires a parser for each of these models. Cubicon will be used to create these translator components. Alternatively, we might expect an open source project may organize to develop bridges using traditional programming languages. An example of such a project is SourceForge.Net project OpenXML/ODF Translator Add-in for Office that supports a bridge between Microsoft and open document models.
Automatic Classification. In summary, a topic is a subject concept that can be identified by its preferred and synonym termspace. Termspace is not unique to any one particular topic, whereas a topic is unique and has global identity. A topic can associate with other topics within a map and other maps. The topic endpoints of an association can have defined roles. Association and role types are based upon concepts sourced in a particular community and can also be shared with other communities. A topic can have occurrences of both internal and external content resources.
A resource binding agent is behavior within a topic that examines termspace, association/roles and content resources for the purpose of automatically extending the occurrence of additional resources. This extension usually first requires the classification of the resource's content by matching document text with a predicate pattern. A match that fits the heuristics will generate a new analytic facet that links the annotated document passage or image back to the topic. A resource binding method can be shared with other community members and easily adapted for use in other topics.
4. Standards Issue
"A distributed or self-organizing approach to the problem seems the most promising, but it runs into the classic technology issue of standards or the even more ancient human problem of common language. The history of technology is full of Tower of Babel examples - separate distributed systems that do not talk to each other."
Cubicon represents systems in an iconic language that is ideal for representing knowledge on a global, natural language-neural basis. The platform technology represents a meta-standard that can be utilized to represent standards for virtually any kind of domain discourse that can be processed within the Semantic Net. Cubicon is based upon a finite set of executable components that provide an icons-to-bits transformation spanning human discourse down through microprocessor architectures.
The strategy for protecting this intellectual property is a combination of trade secrets, copyrights and patents. We envision that Cubicon executable designs will some day be a permissible medium for patent representation and direct submission.
Cubicon Corporation must strike a balance between proprietary and open source interests. Providing the technological mechanisms that automate community development leverages the open source market phenomenon by enabling viable new software business models. Simultaneously, technology control must be maintained to provide both a substantial income stream to sustain the build-out of the Cubicon environment and user community and means to maintain unpremeditated compatibility between IP components. Designed Source is our designated name for this open and protected market model.
Designed Source is based upon the observation that the software industry lacks a common method of expressing system requirements. These requirements include specifications, standards, protocols and programs. Heterogeneous computing requires that individuals not necessarily involved in the coding process must be involved in systems development. Calling the open source community 'open' is an oxymoron since it is only limited to expert level programmers. For the vast remainder of the population, a system expressed in Java or 'C' is obtuse and only captures a small portion of its original design intent. Designed Source will greatly expand the system developer population into knowledge workers who can effectively collaborate. This advanced collaboration model will provide a more expressive infrastructure than afforded through current symbolic programming languages.
Conclusion
Cubicon has developed an infrastructure architecture that will effectively serve as a high level backbone for a Semantic Net.
The Internet's current infrastructure currently poised on top of TCP/IP has significant deficiencies that need to be addressed before it can become a unified global communications/computing medium. The OSI Reference Model 4 through 7 layers is based upon a legacy stack of countless incremental technologies. Cubicon has developed a new architecture that addresses the fundamental issues of complexity management, interoperability, security, productivity, agility and robustness that are presently impeding the Internet from reaching its full potential into the emerging Knowledge Age.
This overview describes an outline of Cubicon's platform capabilities that will radically simplify network architecture and enable true semantic functionality.
Appendix A
Overlapping Hierarchy Problem
The SGML community has a long history of developing methods to markup documents to capture semantic information embedded in strings of text. New methods to manage complexity need development as text-encoding requirements become more sophisticated. As XML has supplanted SGML, this newer text encoding technology should evolve methods to handle these sophisticated requirements for diverse communities of interest. But, SGML never effectively solved the 'overlapping hierarchy problem' encountered in complex text encoding. Therefore, it is difficult to imagine how XML 2.0 can address this emerging encoding requirement with any degree of sophistication or efficiency.
Text processing theorists and standards developers of the descriptive markup approach encode fundamentally differently from the literary and linguistic encoding community. The descriptive markup approach treats each document as having a single natural representation as a logical hierarchy of objects. What text objects might occur is a function of the document type, they all have their own set of objects and grammars that specifies the syntactical relations between objects. A descriptive markup document is given a specific type definition that, among other things, constrains all instances of that type to be hierarchical structures of text objects or elements. Examples include:
Book: front matter, back matter, body, chapter, section, paragraph, extract, list, footnote ...
Article: title, author, affiliation, abstract, section, subsection, paragraph, extract ...
Letter: sender address, recipient address, salutation, body, paragraph, close, scrivener initials, enclosure note ...
Poem: title, stanza, line ...
Script: cast list, performance history, title, stage directions, act, scene, line ...
The descriptive markup view of text objects is determined by document type or category of elements. On the other hand, the literary and linguistic encoding community view of text objects is based upon an organizational principle determined by the analytical or methodological perspective on the text. Some examples of such perspectives and typical elements they contain are:
Grammar: verb, noun, pronoun, adjective, adverb, preposition, conjunction, and interjection ...
Dramatic: act, scene, stage directions, speech ...
Prosodic: poem, verse, stanza, quatrain, couplet, line, half line, foot ...
Narrative: preparatory, villainy, insufficiency, reaction, victory (Propp) ...
Rhetorical: proem, narrative, arguments, subsidiary remarks, peroration (Korax of Syracuse) ...
Discourse: opening, check, topic changing, ending ...
Axiomatic: Primitives, axioms, definitions, theorems, proofs, counterexamples, definienda, definientes, clauses ...
Any of these perspectives has a plausible claim to be the logical structure of the text - for instance they all fit the notion of content object both as suggested by the gloss "having to do with meaning and communicative intention" and as contextually defined by the arguments given above in support of the descriptive markup view. But, because there is no single logical hierarchy that contains all of these perspectives, we can no longer claim that "text is an ordered hierarchy of content objects". Once the class of logical elements in a given text is expanded to include all of these different perspectives, overlapping objects will inevitably beset us: There is no unique hierarchy of content objects that is the text.
Cubicon Clean Slate
Semantic Use Cases
Big Ten Technology Innovations
Open Source vs. Designed Source
Semantic Net Architecture
First Release Capability
Enabling a Semantic Net Environment
An Effective Parallel Programming Architecture
Planning for a Deep Semantic Net
408.621.4709