An Effective Parallel Programming
Architecture for the Emerging
Manycore Computing Age

The Cubicon Architecture has undergone many years of development and careful vetting at the architecture level and it is now ready for full development. The Cubicon Architecture solves the manycore programming problem.

Background
How Will Massive Parallelism Be Programmed?
Cubicon Semantic Net Environment
1. Hardware Transparency
2. Fused Hardware/Software Model
3. Semantic Computing
4. Secure Meta Architecture
5. Contextual Processing
6. Multi-paradigm Iconic Programming Language
7. Communications/Computing Convergence
Conclusion




 Background

In his seminal article The Free Lunch Is Over, Herb Sutter of Microsoft concludes that the major processor manufacturers and architectures have run out of steam. Traditional approaches to boosting CPU performance are not scalable. Instead of driving clock speeds and straight-line instruction throughput ever higher, developers are instead turning en masse to hyperthreading and multicore and manycore architectures. This changing face of hardware is altering the way future software must be developed. The next great challenge is in managing a relentless increase in system complexity.

A View from Berkeley delineates the state of parallel computing, as it exists today and where it needs to go in our manycore future. This UC Berkeley report discusses the fusion of the convergence of computing that is taking place between the embedded and HPC (High Performance Computing) markets. Once at opposite ends of the computing spectrum, embedded computing and HPC are being brought together by their common needs of energy efficiency, low-cost hardware building blocks, software reuse and high-bandwidth data.

The real concern is how massive parallelism will be programmed. Berkeley researchers believe that neither the sequential nor multicore programming models provide the right approach. A central precept for manycore programming is that the model should be independent of the number of processors. Removing the dependency between the application and the processor/core count provides for automatic application scaling as succeeding microprocessor generations increase core density. This would be a huge step in the right direction, bringing us back to the good old days when application performance automatically increased with every processor clock speed bump.

A related concern is productivity. A programming model must allow the software developer to balance the competing goals of productivity and implementation efficiency. Here, unfortunately, there is no consensus. But, the Berkeley report does some hand waving about human-centric programming, expanding data types, and providing support for different types of parallelism, but the authors recognize that there's usually a "tradeoff” between ease of programming and runtime performance.


 -- The Problem --
How Will Massive Parallelism Be Programmed?

The recently announced industry funded Universal Parallel Computing Research Centers (UPCRCs) is comprised of two five-year efforts (one with U.C. Berkeley and the other with the University of Illinois). The distinguished David Patterson heads the project and an equally honored faculty who will be leading their equally brilliant graduate students. One could hope for a breakthrough in parallel programming however it is the opinion of CoreTalk that improvements will only be incremental. Can AMD and Intel afford to wait five years only to hear that the challenge remains unanswered?

Academia continues continue to see the solution space from conventional systems engineering perspectives by suggesting incremental changes to legacy software languages. These traditional approaches have stretched the ingenuity of both industry practitioners and academics alike. The fundamental issue here is credibility since no existing software language has the rigor to satisfy what is needed for parallel programming.

While the U.C. Berkeley report primarily identifies the traditional HPC patterns (scientific and engineering subtasks termed the 13 "dwarfs”), they do not adequately address the human scaling issues to manage the exploding program complexity in designing and maintaining advanced applications at Net-scale. Revolutionary software architecture is required to meet the emerging manycore computing environments needed to effectively deal with these new supercruncher applications. The cognitive dimension is clearly in play. Humans need to understand the nature of the medium through which they communicate and on which service exchanges occur. The relentless increase in program complexity must be matched with a simplification of architecture.

Traditionally, the hardware industry has relied upon increasing clock speeds and straight-line instruction throughput to solve the scaling and performance issues. The software industry has supported the hardware industry by providing the languages and tools required to support the hardware developments. The next revolution in hardware (e.g., manycore) must now be lead by software that is based on entirely new architecture.

The software industry has reached an inflection point where revolutionary architecture is required to match the emerging performance gains soon to be available in manycore architectures.


 -- The Solution --
Cubicon Semantic Net Environment

Aside from ease of programming and runtime performance demands, next-generation software architecture must effectively address coherent accelerated platform requirements. Each issue is followed with a short explanation of the manner in which Cubicon addresses these requirements.

Cubicon will dramatically change and improve how humans interact with the Internet. Semantics literally means the "meaning of things.” Thus, a Semantic Net Environment is one in which the context of information and data is passed between computers for the benefit of human understanding. Known applications of great importance are the semantic desktop, social networking, intelligent routing, and multicore/manycore programming – each one of these areas is challenged by the limitations of their existing architectures. The required simplification is achieved by an objective synthesis of lessons learned and by the maturity of cognitive science and computer technology.


 1. Hardware Transparency

Structured parallel applications must be developed independent of an underlying hardware environment consisting of many execution cores. This requirement dictates that an application must automatically deal with shared heterogeneous states. Exclusive locks and asynchronous message passing mechanisms provided by imperative languages have not proven to scale. Functional languages, such as Haskell and Erlang, manage states through a side-effect-free style of program development at a cost of obscuring a solution space into a paradigm foreign to mainstream software developers. While useful, the functional systems model has also not proven to scale for many classes of applications.

Recent functional language research has illuminated an effective means to deal with shared heterogeneous states through STM (Software Transactional Memory) architecture. STM works efficiently in an object paradigm only when data encoding mechanisms share consistent patterns. When the NEXT computer, the ancestor of Apple"s OS X, was designed, some degree of common data pattern consistency was built into the developments toolset and hardware. This work did not go far enough. STM provides a scalable mechanism for general heterogeneous parallel systems representation, but lacks the timing precision required for deterministic RTOS (Real Time Operating Systems) environments. For this requirement, a synchronous reactive computational model is required such as that demonstrated in Esterel.

Cubicon"s STM Mechanism
Cubicon uses a generic STM approach that provides a cognitively transparent virtual model overlay on manycore architecture. It provides a native mechanism for developing distributed, interactive applications much like AJAX (Asynchronous JavaScript and XML) but is automated within a design framework. A session maintains "shared” object locks that trigger sharing client refresh following a write event in a highly efficient manner. The program"s design style is aligned to conventional programming practices that developers currently employ except that the abstractions are iconic vs. symbolic. An iconic representation is characterized by a representation of the world in terms of mental images, whereas a symbolic representation is the way people represent their world in symbols.

Automation enables an arbitrary number of clients to interact with a shared processing server space (called a transactor module grove) where state arbitration is being performed at the core level, independent of OS thread allocation. State arbitration is mediated through a closure and according to the growth of matrix, structure and composite grove data that can directly leverage NUMA (Non-Uniform Memory Space) architecture. A grove evolves by capturing the regularity of data passage between Cubicon Context Engines, building out a projection of a closed FSM (Finite State Machine) whose generating atomic and molecular components are made universally available. Grove data is a projection from stable community genealogy based on globally unique template components. A transactor module grove evolves under the coherence of a specific community genealogy that provides a prototype/progeny model for component evolution. This advanced abstraction model overcomes the fragile base class problem that has plagued object-oriented languages since the early days of Scheme and Smalltalk.

Module Grove

Cubicon Synchronous Reactive Mechanism
This mechanism provides deterministic timing of computed events at the elementary operation level. Deterministic temporal order is required for computing reliability and stability. The synchronous reactive model is a convention whereby all instructions (elementary operations) are parallel reactive processes and have equal durations based on a virtual system-wide clock. In other words, all related states transform in exactly one virtual cycle. Normalization over a finite state set is enforced by the Cubicon environment.


 2. Fused Hardware/Software Model —

The emerging manycore environment demands a fusion between hardware and software technologies to create a new symbiotic relationship that dramatically simplifies the programming model.

The next section outlines a unifying framework based on the Cubicon contextual processing model that redefines the software/hardware contract. This model is an inverse projection and thus a unification of numerous programming and complexity management theories. The model is a direct product, combinatorial composition, of other models, followed by a simplification. The simplification occurs due to a bypass in which text-based programming techniques are replaced by direct icon-to-bits binding.

Traditional Constraint Management
At the lowest level of information transformation, a processor requires perfect dependency constraints on instruction and data streams. Both are fed into one or more EUs (Execution Units). A superscalar processor performs constraint management dynamically by hardware circuits, whereas a VLIW (Very Long Instruction Word) processor performs this task statically by a software compiler. The goal of both processor architectures is to be able to perform ILP (Instruction Level Parallelism) in order to increase overall processor throughput. The fundamental difficulty with current stream management is that architectures rely on source languages such as C++, Java or C#. They provide poor support for concurrent or parallel system constructs.

Furthermore, a source program must first be transformed into intermediate compiled type objects for a particular processor"s instruction set. There is a fundamental impedance mismatch between these source languages and the processor instruction sets. This mismatch dramatically increases concerns with complexity.

Cubicon Constraint Management
Cubicon applies dependency constraints to a Context Processor at the time a designer declares a system resource and not later at compile or execution time. These constraints are applied on six discrete computation layers. Explicit constraints are reflected in four parallel execution levels: service dialog, asynchronous message, process-flow and control-flow.

Cubicon Computation Layers

Molecular and atomic level data transformations are performed concurrently in an implicit manner without designer intervention. Specialized cores within the cache subsystem perform matrix, structure and composite transformations on molecular data. Separately, atomic data transformations are performed by EUs. The output of a declared system are four direct "distilled” type objects that represent optimized binary code and data that are fed directly into a Context Processor without compilation.

In summary, genealogy expression is embodied in Context Processor design principles and supports standard re-use of iconic design components as well as to recognize variations (mutation). This provides for innovation as well as a high degree of reuse. Software reuse is optimal and can accommodate evolution in the software/hardware contract over time. Use-based utility and feedback will measure pressure on the compositional completeness, thus opening genealogy specifications only when inadequacy is determined.


 3. Semantic Computing —

The next evolution in computing, Semantic Computing, is being driven by the need for computers to independently perform tedious work involved in finding, sharing and combining structured context with vastly reduced human intervention on a Net-scale basis. Also, advanced signal processing and pattern recognition capabilities must be reflected in hardware to extract, access, transform and synthesize the semantics of concepts, multimedia and unstructured context.

Structured Context Processing
The ability to consolidate knowledge enables rapid development of dialog between people and organizations, providing an efficient and secure manner to conduct transactions. This automated service infrastructure will enable heterogeneous systems to effectively communicate and initiate rapid adoption of SOA (Service-Oriented Architecture).

Unstructured Context Processing
Cubicon supports knowledge creation through structured particular data and information as well as the formation of universals expressed as concepts. These concepts come into existence in two ways:

1) The first is to create a concept is driven by stationary and traveling agent interactions between machines. Community genealogy and preexisting associations between concepts within a topic map structure these interactions. Agent heuristics dynamically generate new concepts and modify existing associations as conditions change within and between knowledge domains.

2) The second way to create a concept through the shared cognitive responses occurring within the social sphere of a human community of practice. These universals arise from the community and engage its members in collaborations. Multiple concepts that are declared by different communities are automatically harmonized when their meaning is determined to be the same. Topic maps serve as the primary interface between participants through their individual Semantic Desktop.

A mind map is a diagram used to represent any concept - such as a subject, task or process. Although it is centuries old, the mind mapping technique has been surprisingly effective in learning, brainstorming, visual thinking and problem solving. A broad and disparate spectrum of people from accountants to zoologists use this style of mapping to assist in addressing problems and decision making, as its effectiveness is not limited to any one domain of practice.

A topic map automates the notion of a mind map to augment human intelligence to Net-scale. This high level of organization is necessary to manage the rapid exchanges of concepts between human and machines. A topic map is represented as a subject set linked concentrically around a study topic. A map provides a natural way to organize topics within a computer (as compared to a conventional operating system directory) by overlaying the rigid hierarchical file structure with taxonomic-based graph structures. Topics are linked intuitively according to their association as these relationships define their meaning within a particular map. Each topic can link to content resources located in a computer or anywhere on the Internet.

This concept architecture is expressed using a finite number of core-object types having a formal correspondence to the axioms of geometry, while remaining open to axiomatic forcing. This openness transcends the limitations of formal systems based on complete and consistent symbolic models. Because the axiomatic span of Cubicon genealogy contains multiple symbolic programming language types, the designers" work is processed independent of any specific symbolic natural language. Concepts being expressed collectively are embodied as numeric patterns without intervening complexity and inflexibility found in the RDF specification. Concepts sit outside of the formal specification in precisely the fashion envisioned in the Topic Map 1.0 standard. The result is a scalable infrastructure that reaches down to the core level in a stable, efficient and effective manner.

The Context Engine also natively processes an advanced form of markup called cleartext. This Cubicon innovation overcomes the complexity of analytic encoding that takes XML to its knees when multiple markups are layered on the same text source. Cleartext can represent Word, PDF and HTML as a universal document medium. It also provides transclusion – the embodiment of part of a document into another document by reference. Cleartext includes a mechanism that performs this word passage referencing transparently independent of node location. An author maintains control over their original written work and can also receive micropayment on each citation event. This Cubicon mechanism enables authors to profit directly from their intellectual property.


 4. Secure Meta Architecture —

"The Cubicon environment is intrinsically secure based on two fundamental precepts, core-object program constructions and community-based development."

Core-object Program Construction means that all programs are represented as a combination expressed in a finite number of concepts. These concepts range from the bottom up as unit, attribute, operation, expression, method, template, block, matrix, composite, context and framework. Instances of these components are assembled in an intentionally responsive design tree and distilled into a module grove within the Cubicon Context Engine. A design tree is expressed to the designer on seven iconic abstraction levels.

Cubicon Iconic Abstraction Levels

Data and structural regularity within and through the abstraction levels produce many benefits. For example, the Context Engine performs a garbage collection event three orders-of-magnitude faster than current commercial VMs such as the JVM (Java Virtual Machine). This efficient Memory Management architecture eliminates the need to develop "workaround code” to achieve required performance in deterministic RTOS domains. By not overriding the iconic interface, complexity management is maintained and security is assured. Designers shift reuse patterns to respond to real time events and elevate communication to genealogy management of unexpected pressure on generic expressions. Again, because the use and adjustment of design expression follows the familiar processes humans use in natural language communications, the class of individuals who may easily be involved in design processes is increased to just about everyone. Icon-to-bits binding enables a designer"s intent to be expressed directly to the software Context Engine. This capability is enhanced when the Context Engine is synthesized into hardware.

Community-based Development means that core-objects can only be sourced from a Community Repository that is controlled by an entity that takes all financial and legal responsibility for their creations. A Context Engine authenticates all programs and considers any behavior sourced outside this ecology as alien and unintelligible. The combinatorial expression of core-objects is open to innovation and novelty, thus providing the necessary flexibility required of real world communication. The management of the genealogy that expresses as an ecosystem of behaviors allows for market transparency over intellectual property claims and increases the software value chain. The market becomes consistent with Adam Smith's economic theory, while creating a stratified constraint predicted by John Nash's work on a separation of community from individual specification of value.

Secure Engine Execution
Cubicon stops a wide variety of software that organizations do not want running in their networks; viruses, worms, trojans, spyware, adware, hacker tools, peer-to-peer file sharing software, games, unlicensed software, stolen software and even old versions of valid software. Instead of trying to recognize rogue software, Cubicon only executes authenticated programs sourced from a Community Repository, period. A Community Repository is a shared architectural space in the Cubicon ecology shared by its membership. There is no need for white, black or gray lists and a zero day assault has no meaning in such an ecosystem. The precise instantiation and distillation of the Cubicon ecology leaves no room for malware creation and execution in the respect that the process is completely syntax-driven, semantically bound and based on a well-understood computer science theory.


 5. Contextual Processing —

The synthesis of the Cubicon Memory Management module is the proposed first project. This project would introduce Intel and/or AMD hardware designers to the Cubicon design methodology. Conversion productivity and execution performance measurements would provide concrete data to appraise the feasibility and cost of full-scale Context Processor implementation.

Instruction-based and Cycle-based Deployments
The Context Processor is an architecture framework represented concretely in the Cubicon language. This universal iconic specification provides the means to derive both instruction-based and cycle-based deployments through ANSI C or a HDL (Hardware Definition Language) source language.

The Context Engine is an instruction-based implementation where distilled designs execute on ILP multicore and manycore processors. This software deployment would enable evolutionary processor platforms to execute Cubicon programs.

There are two possible cycle-based revolutionary device implementations: CubeMachine and CubeSystem. A CubeMachine acts similar to a microprocessor in that it contains all generic core resources to dynamically load and execute any Cubicon behavior. A CubeSystem acts similar to an ASIC in that it only contains the necessary resources to execute a specific SoC (System-on-a-Chip) system. A CubeSystem only needs to contain particular cores required to process a predetermined set of behaviors. Contexts are premarshalled meaning that binding instructions and data are performed statically during the design synthesis development stage. In both devices, an optimized gate level netlist would drive the process synthesis in order to generate a silicon spin.

Context Processing Deployments

HDL Synthesis Design Flow
The traditional design flow with a HDL consists of RTL source code translation and logic optimization stages before process synthesis and further optimization for a particular hardware technology. Design constraints are applied in the later stages requiring inefficient back annotation into the HDL code. This HDL code provides few constraints, thus placing the design constraint burden on the process synthesis tools. New design tools attempt to drive the constraints to a higher level in the design; however, there is a limit to this integration due to the lack of standards and the inability to capture the constraints into the HDL.

Design Synthesis Flow
Cubicon effectively moves all design and hardware constraint management into the component declaration phase. Prior to process synthesis for a CubeMachine or CubeSystem, Cubicon provides several automatic steps in CubeStudio that optimizes a system in preparation for process synthesis. We refer to these steps as design synthesis.

Synthesis Design Flow Comparison
HDL vs. Cubicon

A system"s behaviors and structures are marshalled into a set of executable contexts. The required cores are further optimized to only contain the required cycle process functionality demanded by the declared system"s operations. The number of core replicates is determined by the intra and inter-module relationships based on time vs. space considerations. The optimized behaviors and structures are integrated into a final type integration process that produces the four compressed types. These types drive the process synthesis. At this point, Cubicon technologies intersect with current life cycle tools.

Design Synthesis

Process Synthesis Flow
Design synthesis reduces the burden on the process synthesizer through these pre-optimization steps. The process synthesizer maps the library of available gate and leaf level cells. Innate core components are integrated into the SoC through a Cubicon standard signal port specification. This integration capability enables the mix of existing cores into the Cubicon space and will conform to SoC industry standards. The optimized gate level netlist describes a CubeMachine or CubeSystem. The circuit characteristics for a particular foundry"s implementation of the Context Processor are linked back into CubeStudio as hardware constraints contained in the core-objects.

Separation of Design Concerns
Current hardware development tool methodologies do not adequately separate structural and behavioral design concerns from physical design concerns. This lack of separation creates many problems when developing complex submicron chips. Cubicon effectively separates these concerns so that a developer can initially concentrate on feature/functional tradeoffs. Then as a second concentration, provide the means to optimize a CubeSystem with the most effective process technology.

In summary:

Structural and Behavioral

Physical


 6. Multi-paradigm Iconic Programming Language —

This is an approach that supports more than one programming paradigm. The theory of a multi-paradigm language is to provide a framework in which programmers can work in a variety of styles, freely intermixing constructs from different paradigms. The design goal of such languages is to allow programmers to use the best tool for a job, admitting that no one paradigm solves all problems in the easiest or most efficient way. A multiple paradigm language must also now account for heterogeneous parallel construction that allows the average developer to effectively harness the promised underlying power of manycores. Cubicon provides language mechanisms of at least ten of the programming paradigms listed on the Wikipedia reference page.

The multi-paradigm language challenge is to overcome the "kitchen sink” syndrome. Representing a multitude of program constructs and limiting an expression medium to ASCII in long strings of text is extremely limiting. A text-based program must be run through a compile before there is any reasoning about its semantic nature. Furthermore, many scholars believe that true, semantics only occur when interpreted in a moment within the human mind. Thus, the user interface to programming languages becomes a glass ceiling limiting true functionality when situations are novel. The complicated natures of text-based programs obscure the need to understand the embedding complexity of real time situations.

There is scant historical evidence that extending any symbolic programming language can manage the relentless drive towards higher system complexity. Fundamental rethinking must be embraced by industry to make existing computing and communications more productive and to substantially increase the population that can envision and maintain advanced parallel systems.

Cubicon Iconic Expression Medium
Cubicon provides an automated multi-dimensional visualization of systems that are natural language, domain and methodology independent. Collective management of emerging concept representations are enabled directly without intervening influences from the computing mediums. The resulting distributed communication and computing leverages human pattern matching capability to provide high productivity and ease of learning. Cubicon is both broad and deep. It is broad in the sense that it provides multiple program constructions, aggregated from a small set of core-objects, in highly tuned standard visualizations for a wide array of language mechanisms. Deep in that it provides a high level of domain abstraction while retaining the ability to dive into bit-level manipulations that are very close to the machine, all within a few mouse clicks.


 7. Communications/Computing Convergence —

Island and stovepipe computing are giving way to grid, distributed and parallel topologies. An effective semantic substrate is necessary to effective collapse the OSI Model 4-7 layers to drive the data regularity supporting intelligence into normalized packet processing on the router backplane of the Internet.

CoreProtocol Processing
The W3C (World Wide Web Consortium) has produced the first draft of the EXI (Efficient XML Interchange) format specification that defines the compact representation of XML in a binary format.

The Semantic Engine natively processes compact binary services at the router protocol level without XML string parsing therefore eliminating the need for hardware acceleration. This DPI (Deep Packet Inspection) substrate enables "secure, intelligent routing”. Through the use of a multi-layered framework, a service remains transparent up through the application level and thus effectively fusing communications and computing. This architecture is emendable to synthesis and is a likely candidate to drive into future manycore functionality.

Comparison of
Binary XML vs. CoreProtocol

 Conclusion

Clearly, a major crisis looms in the not-to-distant future if manycore processors are not matched up with a programming language that can fully utilize the hardware capabilities. The present technology trajectory for manycore programming does not offer hope that in five years users will be well-served with vastly superior computing systems. Software must lead hardware if the industry is to resolve the conundrum it is presently in.

The Cubicon Architecture has undergone many years of development and careful vetting at the architecture level and it is now ready for full development. The Cubicon Architecture solves the manycore programming problem.



 

email: klausner@coretalk.net
Planning for a Deep Semantic Net
Contact: Sanford B. Klausner, Founder and CTO
408.621.4709


  © Copyright 1987-2008, Sanford B. Klausner