Since a while the processor market has observed a Moore’s law decline: the processing velocity cannot be duplicated each year anymore. That is why, the old and almost forgotten discipline of parallel processing have had a kind of resurrection. GPUs, DSPs, multi-cores, are names that are kicking the market recently, offering to the consumers more horse power, more multi-tasking, but not more giga-hertz per core, as it used to be.
Also an ecologist spirit has hit the chips manufacturers, embracing the green-computing concept. It fits perfectly with the Moore’s law decay: more velocity, more power consumption and more injected heat into the environment. Not good. A better solution, they say, is to have a set of specialized processors which are activated when their specific task is requested by the user: do you need extensive graphics processing? No problem, the GPU will deal with it. Do you need decode or encode high resolution multimedia? We have a DSP for you. Are you only typing in an editor? We will turn off all the other processors thus saving energy.
Even though this scenario seems quite idilic, the hardware manufacturers have placed a lot of responsability upon the software. The hardware is there, it is already wired into the main memory, but now all the heuristics and logic to control the data flow among them is a huge and still open problem, yet with multiple partial solutions currently available.
And that is the case of Texas Instrument, who, since his first generation of OMAP, has delivered to the embedded and mobile market a multicore chip, with a DSP and a high end ARM processor. But deliver a chip is not enough, so TI had had to provide the system software capable to squeeze those heterogeneous processors.
At the begining, TI only delivered to its clients a mere proof of concept, with which the client could take it as a reference for his own implementation. That was the case of dsp-gateway, developed by Nokia as an open source project.
But as the OMAP processor capacities were increasing, TI was under more pressure from his customers to deliver a general mechanism to communicate with the embedded processors.
For that reason TI started to develop a mechanism of Inter-Processors Communication (IPC), whose approach is based on the concept that the general purpose processor (GPP), the host processor in charge of the user interactions, can control and interact with the other processors as if they were just another devices in the system. Those devices are called slave processors.
Thus, this designed IPC mechanism runs in the host processor’s kernel space and it fulfills the following responsibilities:
a) It allows the exchange of messages between the host processor with the slaves.
b) It can map files from the host’s file system into the memory space of the slave processor.
c) It permits the dynamic loading of basic operating systems and programs into the slave processors.
Also, TI has developed a programming library that provides an API with which the developers could build applications that use the processing power of the slave processors.
Sadly, whilst the development in the host processor, tipically Linux environments, is open and mostly free, the development for the slave cores (DSP in general) is still close and controlled by TI’s licences. Nevertheless, TI has provided, gratis but closed, binary objects which can be loaded into the slaves cores and they do multimedia processing.
Well, actually there have been some efforts to develop a GCC backend for the C64x DSP, and also a LLVM backend, with which, least theorically, we could write programs to be loaded and executed through these IPC mechanisms. But they are not mature enough to use them seriously.
DSPLink and DSPBridge
In order to develop an general purpose IPC for his OMAP processors family, TI has designed the DSPBridge. Oriented to multi-slave systems, agnostic to operating system, handle power management demands and other industrial weight
But the DSPBridge was not ready for production until the OMAP3 came out to the market. That is why, another group inside TI, narrowed the scope of the DSPBridge’s design and slimmed its implementation, bringing out DSPLink, capable to run in OMAP2, OMAP3 and also in the DaVinci family.
DSPLink is distributed as an isolated kernel module and a programming library, along with the closed binaries that run in the DSP side. Nevertheless, the kernel module does not meet the requirements to be mainlined into the kernel tree. Also, it lacks of power management features and a dynamic MMU support.
On the other hand, DSPBridge has been brewed to be mainlined into the kernel, though it has stuck in the Greg’s staging tree for a long time. It seems that all the resources within TI are devoted to SysLink, the next generation of IPC. Nonetheless, many recent OMAP3 based devices uses this IPC mechanism for their multimedia applications.
Initially, TI offered an OpenMAX layer on top of the DSPBridge user-space API to process multimedia in the C64x DSP, but that solution was too bloated for some developers, and the project gst-dsp appeared, which reuse the codecs for the DSP available in the TI’s OpenMAX implementation, along with the DSPBridge kernel module, to provide a thin and lean interface through GStreamer framework.
SysLink and OMAP4
Then OMAP4 came to existence. It is not only anymore a DSP and a high end ARM processor. It has a DSP, a dual ARM Cortex-M3, and, as host processor, a dual ARM Cortex-A9. Five processing units in a single silicon! How in hell we will share information among all of them? DSPBridge was not designed for this scenario in mind.
The ARM Cortex-M3 has the purpose to process video and images, and for that reason a tiler-based memory allocation is proposed, where the memory buffers are already perceived as 2D, where fast operations of mirroring and rotation are available.
Regretfully, in the case of the pandaboard (OMAP4330), the available DSP has lower capacities than the one in the beagleboards OMAP3, so the published codecs, for the OMAP3 DSP, can not be reused in the pandaboard. But video codecs for the M3 cores are currently available, and they are capable to process high definition resolutions.
The answer is SysLink. Where, besides the three operation developed for DSPBridge, two more core responsibilities were added:
a) Zero-copy shared memory: ability to “pass” data buffers to other processors by simply providing its location in shared memory
b) TILER-based memory allocation: allocate 2D-buffers with mirroring and rotation options.
c) Remote function calls: one processor can invoke functions on a remote processor
The stack offered is similar than the OMAP3: in the user space we start with the SysLink API, then an OpenMAX layer, called now as DOMX, and finally the gst-openmax elements for the GStreamer framework. And again, a bloated, buzzworded, stack for multimedia.
In his spare time, Rob Clark developed a proof of concept to remove the DOMX/gst-openmax layers and provide a set of GStreamer elements that talk directly with the SysLink API: libdce/gst-ducati.
Either way, I feel more comfortable with the approach proposed by Felipe Contreras in gst-dsp: a slim and simple API to SysLink and plain GStreamer elements using that API. And because of that reason, I started to code a minimalistic API, copying the spirit of the dsp_bridge, for the SysLink interface: https://gitorious.org/vj-pandaboard/syslink/