Aura

In the last weeks Miguel, Calvaris and myself, developed an application for the N9/N950 mobile phone and we called it Aura.

Basically it uses the device’s camera (either the main one or the frontal one) for video recording, as a normal camera application, but also it exposes a set of effects that can be applied, in real time, to the video stream.

For example, here is a video using the historical effect:

Aura is inspired in the Gnome application, Cheese, and it uses many of the effects available in Gnome Video Effects.

The list of effects that were possible to port to the N9/N950 are: dice, edge, flip, historical, hulk, mauve, noir/blanc, optical illusion, quark, radioactive, waveform, ripple, saturation, shagadelic, kung-fu, vertigo and warp.

Besides of these software effects, it is possible to add, simultaneously, another set of effects that the hardware is capable, such as sepia colors. These hardware capabilities do not impose extra processing as the software effects do.

Because of this processing cost, imposed by the non-hardware video effects, Aura has a fixed video resolution. Otherwise the performance would make the application unusable. Also, we had a missing feature: the still image capture. But, hey! there is good news: Aura is fully open source, you can checkout the code at github and we happily accept patches.

Honoring Cheese, the name of Aura is taken from a kind of Finnish blue cheese.

We hope you enjoy this application as we enjoyed developing it.

Diving into SysLink v2

Following with our SysLink saga, now we will dive into its internals.

As we stated before, SysLink is a software stack which implements an Inter-Processors Communication (IPC), whose the purpose is enabling the Asymmetric Muli-Processing (AMP).

Most of the readers are more familiar with the concept of the Symmetric Multi-Processing (SMP), which is the most common approach to handle multiple processors in a computer, where a single operating system controls and spreads the processing load among the available processors. Typically these processors are identical.

On the contrary, AMP is designed to deal with different kind of processors, each one running different instances of operative systems, with different architectures and interfaces. The typical approach is to have a master processor with one or more slave units, all of them sharing the same memory area, hence the operating system running in the master processor expose the other processors as other devices available in the system (figure 1).

Asymmetric Multi-Processing (AMS)
Figure 1

The main advantage of AMP, as we mention in the previous post, is that we can integrate specialized processors and delegate them tasks that are rather expensive to execute in our general purpose processor, such as multimedia, cryptography, real-time processing, etc.

All the operating systems, executed in an AMP system, must share, at least, one key component: a mechanism of Inter-Processors Communication, because with it, they will be able to interchange data and synchronize tasks among each other.

Basically there are two types of IPC: shared memory and message passing. With the shared memory we avoid copying the data to process in other unit, but also we will demand complicated semaphoring protocols, in order to avoid overlapping, what lead to the data corruption. On the other hand, message passing is oriented to the transfer of small chunks of data, called messages, sent by one processor and received by another, and are used to notify, to control and synchronize.

Therefore, SysLink version 2.0, provides a set of components for IPC, for both shared memory and message passing (figure 2):

  • MultiProc: identifies and names each available processor in the configuration.
  • SharedRegion: handles memory areas which will be shared across the different processors.
  • Gate: provides local and remote context protection, preventing preemption by another thread locally and protects memory regions from remote processors.
  • Message: supports the variable length message passing across processors.
  • Notify: registers callback what will be executed when a remote event is triggered.
  • HeapBuf: manages fixed size buffers that can be used by multiple processors within the shared memory.
  • HeapMem: manages variable size buffers that can be used by multiple processors within the shared memory.
  • List: provides a way to create, access and manipulate double linked list in a the shared memory.
  • NameServer: a dictionary table within the shared memory.
SysLink v2
Figure 2

All these operations are done through ioctl operations to the /dev/syslink_ipc device.

Nonetheless, SysLink v2.0, being a master/slave configuration, it must provide a way to control the slave processors, operations like starting and stopping, to load an operative system into them, and so on. In that regard, there is a module called the Processor Manager, which is operated also through ioctl calls to the /dev/syslink-procmgr device, and also to the specific processors, what are also mapped to the device file-system as /dev/omap-rproc[0-9], in the case of the OMAP processors.

Now, on top of all these IPC facilities, there must be a façade for all the remote operation that we could execute, and that façade is called Remote Command Messaging (RCM). Our application, which is running in the master processor, will act as client RCM and will request remote commands on the RCM server, which is running in a slave processor. DCE (Distributed Codec Engine) is an open source RCM, developed by Rob Clark, to expose distributed multimedia processing in the OMAP4.

And I continue developing a thin API for SysLink 2.0 🙂

References

SysLink chronology

Introduction

Since a while the processor market has observed a Moore’s law decline: the processing velocity cannot be duplicated each year anymore. That is why, the old and almost forgotten discipline of parallel processing have had a kind of resurrection. GPUs, DSPs, multi-cores, are names that are kicking the market recently, offering to the consumers more horse power, more multi-tasking, but not more giga-hertz per core, as it used to be.

Also an ecologist spirit has hit the chips manufacturers, embracing the green-computing concept. It fits perfectly with the Moore’s law decay: more velocity, more power consumption and more injected heat into the environment. Not good. A better solution, they say, is to have a set of specialized processors which are activated when their specific task is requested by the user: do you need extensive graphics processing? No problem, the GPU will deal with it. Do you need decode or encode high resolution multimedia? We have a DSP for you. Are you only typing in an editor? We will turn off all the other processors thus saving energy.

Even though this scenario seems quite idilic, the hardware manufacturers have placed a lot of responsability upon the software. The hardware is there, it is already wired into the main memory, but now all the heuristics and logic to control the data flow among them is a huge and still open problem, yet with multiple partial solutions currently available.

And that is the case of Texas Instrument, who, since his first generation of OMAP, has delivered to the embedded and mobile market a multicore chip, with a DSP and a high end ARM processor. But deliver a chip is not enough, so TI had had to provide the system software capable to squeeze those heterogeneous processors.

At the begining, TI only delivered to its clients a mere proof of concept, with which the client could take it as a reference for his own implementation. That was the case of dsp-gateway[1], developed by Nokia as an open source project.

But as the OMAP processor capacities were increasing, TI was under more pressure from his customers to deliver a general mechanism to communicate with the embedded processors.

For that reason TI started to develop a mechanism of Inter-Processors Communication (IPC), whose approach is based on the concept that the general purpose processor (GPP), the host processor in charge of the user interactions, can control and interact with the other processors as if they were just another devices in the system. Those devices are called slave processors.

Thus, this designed IPC mechanism runs in the host processor’s kernel space and it fulfills the following responsibilities:

a) It allows the exchange of messages between the host processor with the slaves.

b) It can map files from the host’s file system into the memory space of the slave processor.

c) It permits the dynamic loading of basic operating systems and programs into the slave processors.

Also, TI has developed a programming library that provides an API with which the developers could build applications that use the processing power of the slave processors.

Sadly, whilst the development in the host processor, tipically Linux environments, is open and mostly free, the development for the slave cores (DSP in general) is still close and controlled by TI’s licences. Nevertheless, TI has provided, gratis but closed, binary objects which can be loaded into the slaves cores and they do multimedia processing.

Well, actually there have been some efforts to develop a GCC backend[2] for the C64x DSP, and also a LLVM backend[3], with which, least theorically, we could write programs to be loaded and executed through these IPC mechanisms. But they are not mature enough to use them seriously.

DSPLink and DSPBridge

In order to develop an general purpose IPC for his OMAP processors family, TI has designed the DSPBridge[4]. Oriented to multi-slave systems, agnostic to operating system, handle power management demands and other industrial weight
buzzwords.

But the DSPBridge was not ready for production until the OMAP3 came out to the market. That is why, another group inside TI, narrowed the scope of the DSPBridge’s design and slimmed its implementation, bringing out DSPLink[5], capable to run in OMAP2, OMAP3 and also in the DaVinci family.

DSPLink is distributed as an isolated kernel module and a programming library, along with the closed binaries that run in the DSP side. Nevertheless, the kernel module does not meet the requirements to be mainlined into the kernel tree. Also, it lacks of power management features and a dynamic MMU support.

On the other hand, DSPBridge has been brewed to be mainlined into the kernel, though it has stuck in the Greg’s staging tree for a long time. It seems that all the resources within TI are devoted to SysLink, the next generation of IPC. Nonetheless, many recent OMAP3 based devices uses this IPC mechanism for their multimedia applications.

Initially, TI offered an OpenMAX layer on top of the DSPBridge user-space API to process multimedia in the C64x DSP, but that solution was too bloated for some developers, and the project gst-dsp[6] appeared, which reuse the codecs for the DSP available in the TI’s OpenMAX implementation, along with the DSPBridge kernel module, to provide a thin and lean interface through GStreamer framework.

SysLink and OMAP4

Then OMAP4 came to existence. It is not only anymore a DSP and a high end ARM processor. It has a DSP, a dual ARM Cortex-M3, and, as host processor, a dual ARM Cortex-A9. Five processing units in a single silicon! How in hell we will share information among all of them? DSPBridge was not designed for this scenario in mind.

The ARM Cortex-M3 has the purpose to process video and images, and for that reason a tiler-based memory allocation is proposed, where the memory buffers are already perceived as 2D, where fast operations of mirroring and rotation are available.

Regretfully, in the case of the pandaboard (OMAP4330), the available DSP has lower capacities than the one in the beagleboards OMAP3, so the published codecs, for the OMAP3 DSP, can not be reused in the pandaboard. But video codecs for the M3 cores are currently available, and they are capable to process high definition resolutions.

The answer is SysLink. Where, besides the three operation developed for DSPBridge, two more core responsibilities were added:

a) Zero-copy shared memory: ability to “pass” data buffers to other processors by simply providing its location in shared memory

b) TILER-based memory allocation: allocate 2D-buffers with mirroring and rotation options.

c) Remote function calls: one processor can invoke functions on a remote processor

The stack offered is similar than the OMAP3: in the user space we start with the SysLink API, then an OpenMAX layer, called now as DOMX, and finally the gst-openmax elements for the GStreamer framework. And again, a bloated, buzzworded, stack for multimedia.

In his spare time, Rob Clark developed a proof of concept to remove the DOMX/gst-openmax layers and provide a set of GStreamer elements that talk directly with the SysLink API: libdce[7]/gst-ducati[8].

Either way, I feel more comfortable with the approach proposed by Felipe Contreras in gst-dsp: a slim and simple API to SysLink and plain GStreamer elements using that API. And because of that reason, I started to code a minimalistic API, copying the spirit of the dsp_bridge[9], for the SysLink interface: https://gitorious.org/vj-pandaboard/syslink/

1. http://sourceforge.net/projects/dspgateway/
2. http://www.qucosa.de/fileadmin/data/qucosa/documents/4857/data/thesis.pdf
3. https://www.studentrobotics.org/trac/wiki/Beagleboard_DSP
4. http://www.omappedia.org/wiki/DSPBridge_Project
5. http://processors.wiki.ti.com/index.php/Category:DSPLink
6. https://code.google.com/p/gst-dsp/
7. https://github.com/robclark/libdce
8. https://github.com/robclark/gst-ducati
9. https://github.com/felipec/gst-dsp/blob/HEAD/dsp_bridge.h

AAC decoder for gst-dsp

One of my purposes for this year was collaborate with the gst-dsp project. But the JPEG decoder was not enough, as there are other released socketnodes by TI which are not yet wrapped in gst-dsp, such as, in this case, the AAC decoder.

gst-dsp is a project with only video decoding use case in mind, and audio streams might not be optimally handled by it. The biggest concern was about the memory mapping of small buffers, which could consume more CPU rather than the direct decoding. Nevertheless I decided give it a try.

These last two weeks I devoted them to pull out the dspadec element and it seems to perform quite good without any significant modification in the gst-dsp core. You can find the submitted patches in the gst-dsp mailing list.

These patches are still in review process and perhaps they will not land on the repository, nevertheless, I also updated the marmita’s recipes in order to provide an installable image for the beagleboard, so anybody could test this new element.

Among other candies, this marmita snapshot brings libsoup and the souphttpsrc GStreamer element, besides all the alsa stuff for the audio rendering. Also gdb hit into the tarball.

Enjoy it!

These bytes were brought to you thanks to Igalia, who sponsored this development.

Pandaboard – Chapter One

Finally, the last Friday the Pandaboard arrived to the office. Thanks to all the guy in TI who generously decided to give me one, specially to Jayabharath Goluguri and Rob Clark.

The next Saturday, Ryan Lortie and I came to the office to fool around with the new toy. Just one word: Impressive.

Ryan set up the Ubuntu Maverick Meerkat image for the Pandaboard. At the beginning we ran with a couple problems, but with the help of Ogra we couped  them. First lesson learnt:

The SD for Maverick must be, at least, a 4Gb card, not less.

Second lesson learnt:

The USB is unable the power the Pandaboard. Thou shall power the board with a normal 5V power supply.

Also we found that my monitor, a LG Flatron W2261VP, is not well handled by the PVR X driver, but still it’s usable under 640×480. Ryan filed a bug about this issue.

It was great start up experience. The next stage is play with syslink and DOMX. Anyway, I won’t use Ubuntu for it, my plan is go with the minimal-fs stuff.

dsp-exec landed on dsp-tools

In the DSP bridge realm, usually when the kernel module is loaded, it in turn loads the so called DSP base image, which is a file what encompass the DSP/BIOS kernel and the DSP/BIOS Bridge.

Usually, in a development cycle, you may want to test different base images, and removing and reloading the Linux bridgedriver module is not very practical. For this case, TI provides the cexec.out utility, which uses the bloated libdspbridge API, to load in runtime different DSP base images.

But we all know that cool boys use dsp_bridge instead of libdspbridge, which is much more clean, small and nice. And also we have a neat set of utilities called dsp-tools. Nevertheless a utility like cexec.out was missing, and because of that dsp-exec has born.

Last week, my patches were committed by FelipeC into the stage repository in github, and I’m enjoying them while I’m poking around in the audio decoding 🙂

jpeg decoder in gst-dsp

Do you remember this comment? Well, I took the challenge, and it has been hard to accomplish.

I started with an early implementation of FelipeC, but then I realised that I didn’t understand a bit about what it was going on.

So, I outlined a strategy with two milestones: try to clean up the TI OpenMAX IL implementation and then to add the jpeg decoder to gst-dsp.

In order to clean up the TI’s OMXIL, FelipeC recommended me to rewrite libdspbridge using dsp_bridge beneath, but I didn’t see a real gain on that task. It makes sense for a progressive TI-OMXIL update without breaking the ABI, but that was not my purpose, so I stepped further and decided to rewrite the LCML in terms of dsp_bridge instead.

The LCML is the acronym for Linux Common Multimedia Layer, and it is a shared library, loaded at run-time by the OMX components, and it provides the communication between the ARM-side application and the multimedia DSP socket node (SN). It is build upon the libdspbridge for the interaction with the DSPBridge kernel module.

My task was to rewrite the LCML, removing the libdspbridge linking dependency. You can see the result in my lcml-ng branch.

Along the rewrite process I understood the communication protocol used by the socket nodes. The clean up was really painful because the LCML code is very messy and poorly designed. And I have to say this, the Hungarian notation must be buried deep down into the oblivion.

It was my intention to keep the ABI compatibility but I rather preferred be readable, so I ended breaking it.

With a clear idea of how the LCML library works, I retook the challenge, with some degree of success: the SN was loaded up and allocated correctly, and also I found that the input port only admits one buffer, and not two as rest of the video decoders in gstdspvdec. But, when everything looked promising and the input buffer was pushed, the SN threw an critical error event and the output buffer was never received.

I had to do more than merely understand the LCML, I had must rewrite the JPEG decoder OMX component too. But this time the code was even more obfuscated than the one in the LCML.

And I had an epiphany: developing software in community implies have clean and readable code, for sake of the peer reviewing, each one with heterogeneous backgrounds. Meanwhile, under the closed and internal development approach, the QA is based on black box testing, where the cleanness of the code is not a praised virtue, but somewhat the opposite.

I rewrote all the jpeg decoder bits, from the component to its test application, but I have not pushed that branch yet into gitorious.

Finally I came across with the missing parts: each buffer pushed into the SN must have metadata, which is a structure with information about the buffer; in the case of the jpeg decoder, there were also a couple of magic numbers. The output buffer also comes with metadata, which, among other information, expresses if the buffer was decoded correctly.

Yesterday, before meeting my mates for cinema, I emailed a couple patches with the initial support for the jpeg decoding in gst-dsp.

The next task is find a strategy for the buffer number assignation on each port, so it could be defined as late as possible.

my DSP related activities

When I started to play with the Beagleboard, my objective was to poke with the DSP accelerated codecs through OpenMAX and GStreamer. But soon I realized that it would be a hard task to achieve since the framework, developed by Texas Instrument, is in part proprietary (though free), and the another is open source, but it is not developed with an open source community in mind.

When I started to pulled it out, the first decision I had to face was to choose a cross compiling environment. As you know, there are plenty: scratchbox, OpenEmbedded, buildroot, PTXdist, etc. But just because people I know from TI began to write recipes for Poky, I devoted some time learning about it. Although, after a while, I jumped into OpenEmbedded. The reason were the slow updating rate against upstream which Poky had (and as I don’t follow the project anymore, I’m not aware about its current state).

But bitbake and OpenEmbedded are not the magic wand to build up a complete image to boot up in a device. There are a lot of things to define previously. Just to mention one, the distribution to build. By default OpenEmbedded offers Angstrom. But I did not want a full featured distribution, I wondered for something thin and lean, only a serial shell to start with, something I could set as a workbench for my multimedia experiments.

And for that reason marmita born.

As you may see, I mimicked the “Poky’s way”, making an overlay of OpenEmbedded, but as soon as I was getting involved in the OE’s community, I realized that maybe that it was not the correct decision, maybe I should push my changes into Angstrom instead.

Anyway, right now I have a steady set of recipes which allows me to build images for the Beagleboard, with the latest dspbridge kernel branch, and many of the TI bytes (either proprietary and open source) required for running the DSP accelerated codecs.

On the other hand, I have revamped the DSP how-to at elinux.org wiki, with the instructions to build up a kernel with DSP/BIOS Bridge support, and the means to test the communication with the DSP through it.

Along this process I became aware of the TI problems to release his DSP user space stack to the open source community. Even though the kernel side is moving quite well towards the kernel mainline, the user space bits are not doing that well, even more, they will be completely deprecated soon, because the kernel interface is still evolving.

At the lowest layer, on the ARM side of the DSP users pace stack, there is a library known as libbridge, which is basically an abstraction of the ioctl to the dspbridge kernel module, and it offers an interface with a nice semantics, but it is too much aligned to the old win16/32 API style (a bad idea in my opinion).

But the problem does not start with the API style, it begins at locating the library inside of a chaotic bunch of files, insanely bundled in a git repository, along with binaries (either for Windows and Linux), tarballs, and all sort of unrelated documentation.

Furthermore, the image building machinery within TI is a custom set of invasive makefiles, which all the projects must include and being conform with them. As result of this highly coupled build engine, extracting and isolating a project for it release is a painful and error prone process.

Given those problems, I got lost as soon as I began. So I decided to emulate the Felipe Contreras‘ approach: get rid of libbridge and use his minimalistic dspbridge ioctl wrapper: dsp_bridge, what he uses for gst-dsp. In order to train myself in these topics I wrote a clone of the TI’s ping application, using the dsp_bridge instead. And later on I wrote the DSP socket node counterpart. Both included in the dsp-samples repository.

Currently, the ping application is also included in the Meego’s dsp-tools repository.

Nevertheless, most of the DSP multimedia codecs are exposed, out of the box, through the TI’s OpenMAX IL implementation. And it depends on libbridge. For that reason I ripped out the libbridge from the userspace-dspbridge repository and pushed it into a different repository, cleaned it up its building machinery and removed other unneeded bytes too.

Finally, I had to do the same for the OpenMAX IL, which is not only messed up with the inside building machinery, but it is not released through a git repository yet, using instead old fashioned tarballs.

The future work will be integrate gst-openmax into marmita and try to participate with gst-dsp development. Also, FelipeC came up with the idea of rewrite libbridge in terms of dsp_bridge, task that I have been exploring lately.