A GStreamer Video Sink using KMS

The purpose of this blog post is to show the concepts related to the GstKMSSink, a new video sink for GStreamer 1.0, co-developed by Alessandro Decina and myself, done during my hack-fest time in the Igalia’s multimedia team.

One interesting thing to notice is that this element shows it is possible to write DRI clients without the burden of X Window.

Brief introduction to graphics in Linux

If you want to dump images onto your screen, you can simply use the frame buffer device. It provides an abstraction for the graphics hardware and represents the frame buffer of the video hardware. This kernel device allows user application to access the graphics hardware without knowing the low-level details [1].

In GStreamer, we have two options for displaying images using the frame buffer device; or three, if we use OMAP3: fbvideosink, fbdevsink and gst-omapfb.

Nevertheless, since the appearance of the GPUs, the frame buffer device interface has not been sufficient to fulfill all their capabilities. A new kernel interface ought to emerge. And that was the Direct Rendering Manager (DRM).

What in the hell is DRM?

The DRM layer is intended to support the needs of complex graphics devices, usually containing programmable pipelines well suited to 3D graphics acceleration [2]. It deals with [3]:

  1. A DMA queue for graphic buffers transfers [4].
  2. It provides locks for graphics hardware, treating it as shared resource for simultaneous 3D applications [5].
  3. And it provides secure hardware access, preventing clients from escalating privileges [6].

The DRM layer consists of two in-kernel drivers: a generic DRM driver, and another which has specific support for the video hardware [7]. This is possible because the DRM engine is extensible, enabling the device-specific driver to hook out those functionalities that are required by the hardware. For example, in the case of the Intel cards, the Linux kernel driver i915 supports this card and couples its capabilities to the DRM driver.

The device-specific driver, in particular, should cover two main kernel interfaces: the Kernel Mode Settings (KMS) and the Graphics Execution Manager (GEM). Both elements are also exposed to the user-space through the DRM.

With KMS, the user can ask the kernel to enable native resolution in the frame buffer, setting certain display resolution and colour depth mode. One of the benefits of doing it in kernel is that, since the kernel is in complete control of the hardware, it can switch back in the case of failure [8].

In order to allocate command buffers, cursor memory, scanout buffers, etc., the device-specific driver should support a memory manager, and GEM is the manager with more acceptance these days, because of its simplicity [9].

Beside to the graphics memory management, GEM ensures conflict-free sharing of data between applications by managing the memory synchronization. This is important because modern graphics hardware are essentially NUMA environments.

The following diagram shows the components view of the DRM layer:

Direct Rendering Infrastructure

What is the deal with KMS?

KMS is important because on it relies GEM and DRM to allocate frame buffers and to configure the display. And it is important to us because almost all of the ioctls called by the GStreamer element are part of the KMS subset.

Even more, there are some voices saying that KMS is the future replacement for the frame buffer device [10].

To carry out its duties, the KMS identifies five main concepts [11,12]:

Frame buffer:
The frame buffer is just a buffer, in the video memory, that has an image encoded in it as an array of pixels. As KMS configures the ring buffer in this video memory, it holds a the information of this configuration, such as width, height, color depth, bits per pixel, pixel format, and so on.
Stands for Cathode Ray Tube Controller. It reads the data out of the frame buffer and generates the video mode timing. The CRTC also determines what part of the frame buffer is read; e.g., when multi-head is enabled, each CRTC scans out of a different part of the video memory; in clone mode, each CRTC scans out of the same part of the memory.Hence, from the KMS perspective, the CRTC’s abstraction contains the display mode information, including, resolution, depth, polarity, porch, refresh rate, etc. Also, it has the information of the buffer region to display and when to change to the next frame buffer.
Overlay planes:
Overlays are treated a little like CRTCs, but without associated modes our encoder trees hanging off of them: they can be enabled with a specific frame buffer attached at a specific location, but they don’t have to worry about mode setting, though they do need to have an associated CRTC to actually pump their pixels out [13].
The encoder takes the digital bitstream from the CRTC and converts it to the appropriate format across the connector to the monitor.
The connector provides the appropriate physical plug for the monitor to connect to, such as HDMI, DVI-D, VGA, S-Video, etc..

And what about this KMSSink?

KMSSink is a first approach towards a video sink as a DRI client. For now it only works in the panda-board with a recent kernel (I guess, 3.3 would make it).

For now it only uses the custom non-tiled buffers and use an overlay plane to display them. So, it is in the to-do, add support for more hardware.


[1] http://free-electrons.com/kerneldoc/latest/fb/framebuffer.txt
[2] http://free-electrons.com/kerneldoc/latest/DocBook/drm/drmIntroduction.html
[3] https://www.kernel.org/doc/readme/drivers-gpu-drm-README.drm
[4] http://dri.sourceforge.net/doc/drm_low_level.html
[5] http://dri.sourceforge.net/doc/hardware_locking_low_level.html
[6] http://dri.sourceforge.net/doc/security_low_level.html
[7] https://en.wikipedia.org/wiki/Direct_Rendering_Manager
[8] http://www.bitwiz.org.uk/s/how-dri-and-drm-work.html
[9] https://lwn.net/Articles/283798/
[10] http://phoronix.com/forums/showthread.php?23756-KMS-as-anext-gen-Framebuffer
[11] http://elinux.org/images/7/71/Elce11_dae.pdf
[12] http://www.botchco.com/agd5f/?p=51
[13] https://lwn.net/Articles/440192/

GStreamer video decoder for SysLink

This blog post is another one of our series about SysLink (1, 2): finally I came with a usable GStreamer element for video decoding, which talks directly with the SysLink framework in the OMAP4’s kernel.

As we stated before, SysLink is a set of kernel modules that enables the initialization of remote processors, in a multi-core system (which might be heterogeneous), that run their own operating systems in their own memory space, and also, SysLink, enables the communication between the host processor with the remotes ones. This software and hardware setup could be viewed as an Asymmetric Multi-Processing model.

TI provides a user-space library to access the SysLink services, but I find its implementation a bit clumsy, so I took the challenge of rewrite a part of it, in a simple and straightforward fashion, as gst-dsp does for DSP/Bridge. The result is the interface syslink.h.

Simultaneously, I wrote the utility to load and monitor the operating system into the Cortex-M3 processors for the PandaBoard. This board, such as all the OMAP4-based SoCs, has two ARM Core-M3 as remote processors. Hence, this so called daemon.c, is in charge of loading the firmware images, setting the processor in its running state, allocating the interchange memory areas, and monitoring for any error message.

In order to load the images files into the processors memory areas, it is required to parse the ELF header of the files, and that is the reason of why I decided to depend on libelf, rather than write another ELF parser. Yes, one sad dependency for the daemon. The use of libelf is isolated in elf.h.

When I was developing the daemon, for debugging purposes, I needed to trace the messages generated by the images in the remote processors. For that reason I wrote tracer.c, whose only responsibility is to read and to parse the ring buffer used by the images, in the remote processors, for logging.

Now, in OMAP4, the subsystem comprised by the two Cortex-M3 processors is called Ducati. The first processor is used only for the exchange of notification messages among the host processor and the second M3 processor, where all the multimedia processing is done.

There are at least two images for the second Cortex-M3 processor: DOMX, which is closed source and focused, as far as I know, on the OMX IL interface; and, in the other hand, DCE, which is open source, it’s developed by Rob Clark, and it provides a simple interface of buffers interchange.

My work use DCE, obviously.

But, please, let me go back one step in this component description: in order to send and receive commands between the host processor and one remote processor, SysLink uses a packet based protocol, called Remote Command Messaging, or just RCM for the friends. There are two types of interfaces of RCM, the client and the server. The client interface is used by the applications running in the host processor, and they request services to the server interface, exposed by the systems running in the remote processors, it is accepting the requests and it returns results.

The RCM client interface is in rcm.h.

Above the RCM client, sits my dce.h interface, which is in charge of control the state of the video decoders and it is also in charge of handling the buffers.

But these buffers are tricky. They are not memory areas allocated by a simple malloc, instead they are buffers allocated by a mechanism in the kernel called tiler. The purpose of this mechanism is to provide buffers with capacity of 2D operations by hardware (in other words, cheap and quick in computations). These buffers are shared along all the processing pipeline, so the copies of memory areas are not needed. Of course, in order to achieve this paradise, the video renderer must handle this type of buffers too.

In my code, the interface to the tiler mechanism is in tiler.h.

And finally, the all mighty GStreamer element for video decoding: gstsyslinkvdec.c! Following the spirit of gst-dsp, this element is intended to deal with all the available video decoders in the DCE image, although for now, the H264 decoding is the only one handled.

For now, I have only tested the decoder with fakesink, because the element pushes tiled buffers onto the source pad, and, in order to have an efficient video player, it is required a video renderer that handles this type of tiled buffers. TI is developing one, pvrvideosink, but it depends on EGL, and I would like to avoid X whenever is possible.

I have not measured either the performance of this work compared with the TI’s combo (syslink user-space library / memmgr / libdce / gst-ducati), but I suspect that my approach would be little more efficient, faster, and, at least, simpler 😉

The sad news, as in every hard paced development, all these kernel mechanisms are already deprecated: SysLink and DMM-Tiler will never be mainlined into the kernel, but their successors, rproc/rpmsg and omapdrm, have a good chance. And both have a very different approach since their predecessors. Nevertheless, SysLink is already here and it is being used widely, so this effort has an opportunity for being worthy.

My current task is to decide if I should drop the 2D buffers in the video decoders or if I should develop a video renderer for them.

Diving into SysLink v2

Following with our SysLink saga, now we will dive into its internals.

As we stated before, SysLink is a software stack which implements an Inter-Processors Communication (IPC), whose the purpose is enabling the Asymmetric Muli-Processing (AMP).

Most of the readers are more familiar with the concept of the Symmetric Multi-Processing (SMP), which is the most common approach to handle multiple processors in a computer, where a single operating system controls and spreads the processing load among the available processors. Typically these processors are identical.

On the contrary, AMP is designed to deal with different kind of processors, each one running different instances of operative systems, with different architectures and interfaces. The typical approach is to have a master processor with one or more slave units, all of them sharing the same memory area, hence the operating system running in the master processor expose the other processors as other devices available in the system (figure 1).

Asymmetric Multi-Processing (AMS)
Figure 1

The main advantage of AMP, as we mention in the previous post, is that we can integrate specialized processors and delegate them tasks that are rather expensive to execute in our general purpose processor, such as multimedia, cryptography, real-time processing, etc.

All the operating systems, executed in an AMP system, must share, at least, one key component: a mechanism of Inter-Processors Communication, because with it, they will be able to interchange data and synchronize tasks among each other.

Basically there are two types of IPC: shared memory and message passing. With the shared memory we avoid copying the data to process in other unit, but also we will demand complicated semaphoring protocols, in order to avoid overlapping, what lead to the data corruption. On the other hand, message passing is oriented to the transfer of small chunks of data, called messages, sent by one processor and received by another, and are used to notify, to control and synchronize.

Therefore, SysLink version 2.0, provides a set of components for IPC, for both shared memory and message passing (figure 2):

  • MultiProc: identifies and names each available processor in the configuration.
  • SharedRegion: handles memory areas which will be shared across the different processors.
  • Gate: provides local and remote context protection, preventing preemption by another thread locally and protects memory regions from remote processors.
  • Message: supports the variable length message passing across processors.
  • Notify: registers callback what will be executed when a remote event is triggered.
  • HeapBuf: manages fixed size buffers that can be used by multiple processors within the shared memory.
  • HeapMem: manages variable size buffers that can be used by multiple processors within the shared memory.
  • List: provides a way to create, access and manipulate double linked list in a the shared memory.
  • NameServer: a dictionary table within the shared memory.
SysLink v2
Figure 2

All these operations are done through ioctl operations to the /dev/syslink_ipc device.

Nonetheless, SysLink v2.0, being a master/slave configuration, it must provide a way to control the slave processors, operations like starting and stopping, to load an operative system into them, and so on. In that regard, there is a module called the Processor Manager, which is operated also through ioctl calls to the /dev/syslink-procmgr device, and also to the specific processors, what are also mapped to the device file-system as /dev/omap-rproc[0-9], in the case of the OMAP processors.

Now, on top of all these IPC facilities, there must be a façade for all the remote operation that we could execute, and that façade is called Remote Command Messaging (RCM). Our application, which is running in the master processor, will act as client RCM and will request remote commands on the RCM server, which is running in a slave processor. DCE (Distributed Codec Engine) is an open source RCM, developed by Rob Clark, to expose distributed multimedia processing in the OMAP4.

And I continue developing a thin API for SysLink 2.0 🙂


SysLink chronology


Since a while the processor market has observed a Moore’s law decline: the processing velocity cannot be duplicated each year anymore. That is why, the old and almost forgotten discipline of parallel processing have had a kind of resurrection. GPUs, DSPs, multi-cores, are names that are kicking the market recently, offering to the consumers more horse power, more multi-tasking, but not more giga-hertz per core, as it used to be.

Also an ecologist spirit has hit the chips manufacturers, embracing the green-computing concept. It fits perfectly with the Moore’s law decay: more velocity, more power consumption and more injected heat into the environment. Not good. A better solution, they say, is to have a set of specialized processors which are activated when their specific task is requested by the user: do you need extensive graphics processing? No problem, the GPU will deal with it. Do you need decode or encode high resolution multimedia? We have a DSP for you. Are you only typing in an editor? We will turn off all the other processors thus saving energy.

Even though this scenario seems quite idilic, the hardware manufacturers have placed a lot of responsability upon the software. The hardware is there, it is already wired into the main memory, but now all the heuristics and logic to control the data flow among them is a huge and still open problem, yet with multiple partial solutions currently available.

And that is the case of Texas Instrument, who, since his first generation of OMAP, has delivered to the embedded and mobile market a multicore chip, with a DSP and a high end ARM processor. But deliver a chip is not enough, so TI had had to provide the system software capable to squeeze those heterogeneous processors.

At the begining, TI only delivered to its clients a mere proof of concept, with which the client could take it as a reference for his own implementation. That was the case of dsp-gateway[1], developed by Nokia as an open source project.

But as the OMAP processor capacities were increasing, TI was under more pressure from his customers to deliver a general mechanism to communicate with the embedded processors.

For that reason TI started to develop a mechanism of Inter-Processors Communication (IPC), whose approach is based on the concept that the general purpose processor (GPP), the host processor in charge of the user interactions, can control and interact with the other processors as if they were just another devices in the system. Those devices are called slave processors.

Thus, this designed IPC mechanism runs in the host processor’s kernel space and it fulfills the following responsibilities:

a) It allows the exchange of messages between the host processor with the slaves.

b) It can map files from the host’s file system into the memory space of the slave processor.

c) It permits the dynamic loading of basic operating systems and programs into the slave processors.

Also, TI has developed a programming library that provides an API with which the developers could build applications that use the processing power of the slave processors.

Sadly, whilst the development in the host processor, tipically Linux environments, is open and mostly free, the development for the slave cores (DSP in general) is still close and controlled by TI’s licences. Nevertheless, TI has provided, gratis but closed, binary objects which can be loaded into the slaves cores and they do multimedia processing.

Well, actually there have been some efforts to develop a GCC backend[2] for the C64x DSP, and also a LLVM backend[3], with which, least theorically, we could write programs to be loaded and executed through these IPC mechanisms. But they are not mature enough to use them seriously.

DSPLink and DSPBridge

In order to develop an general purpose IPC for his OMAP processors family, TI has designed the DSPBridge[4]. Oriented to multi-slave systems, agnostic to operating system, handle power management demands and other industrial weight

But the DSPBridge was not ready for production until the OMAP3 came out to the market. That is why, another group inside TI, narrowed the scope of the DSPBridge’s design and slimmed its implementation, bringing out DSPLink[5], capable to run in OMAP2, OMAP3 and also in the DaVinci family.

DSPLink is distributed as an isolated kernel module and a programming library, along with the closed binaries that run in the DSP side. Nevertheless, the kernel module does not meet the requirements to be mainlined into the kernel tree. Also, it lacks of power management features and a dynamic MMU support.

On the other hand, DSPBridge has been brewed to be mainlined into the kernel, though it has stuck in the Greg’s staging tree for a long time. It seems that all the resources within TI are devoted to SysLink, the next generation of IPC. Nonetheless, many recent OMAP3 based devices uses this IPC mechanism for their multimedia applications.

Initially, TI offered an OpenMAX layer on top of the DSPBridge user-space API to process multimedia in the C64x DSP, but that solution was too bloated for some developers, and the project gst-dsp[6] appeared, which reuse the codecs for the DSP available in the TI’s OpenMAX implementation, along with the DSPBridge kernel module, to provide a thin and lean interface through GStreamer framework.

SysLink and OMAP4

Then OMAP4 came to existence. It is not only anymore a DSP and a high end ARM processor. It has a DSP, a dual ARM Cortex-M3, and, as host processor, a dual ARM Cortex-A9. Five processing units in a single silicon! How in hell we will share information among all of them? DSPBridge was not designed for this scenario in mind.

The ARM Cortex-M3 has the purpose to process video and images, and for that reason a tiler-based memory allocation is proposed, where the memory buffers are already perceived as 2D, where fast operations of mirroring and rotation are available.

Regretfully, in the case of the pandaboard (OMAP4330), the available DSP has lower capacities than the one in the beagleboards OMAP3, so the published codecs, for the OMAP3 DSP, can not be reused in the pandaboard. But video codecs for the M3 cores are currently available, and they are capable to process high definition resolutions.

The answer is SysLink. Where, besides the three operation developed for DSPBridge, two more core responsibilities were added:

a) Zero-copy shared memory: ability to “pass” data buffers to other processors by simply providing its location in shared memory

b) TILER-based memory allocation: allocate 2D-buffers with mirroring and rotation options.

c) Remote function calls: one processor can invoke functions on a remote processor

The stack offered is similar than the OMAP3: in the user space we start with the SysLink API, then an OpenMAX layer, called now as DOMX, and finally the gst-openmax elements for the GStreamer framework. And again, a bloated, buzzworded, stack for multimedia.

In his spare time, Rob Clark developed a proof of concept to remove the DOMX/gst-openmax layers and provide a set of GStreamer elements that talk directly with the SysLink API: libdce[7]/gst-ducati[8].

Either way, I feel more comfortable with the approach proposed by Felipe Contreras in gst-dsp: a slim and simple API to SysLink and plain GStreamer elements using that API. And because of that reason, I started to code a minimalistic API, copying the spirit of the dsp_bridge[9], for the SysLink interface: https://gitorious.org/vj-pandaboard/syslink/

1. http://sourceforge.net/projects/dspgateway/
2. http://www.qucosa.de/fileadmin/data/qucosa/documents/4857/data/thesis.pdf
3. https://www.studentrobotics.org/trac/wiki/Beagleboard_DSP
4. http://www.omappedia.org/wiki/DSPBridge_Project
5. http://processors.wiki.ti.com/index.php/Category:DSPLink
6. https://code.google.com/p/gst-dsp/
7. https://github.com/robclark/libdce
8. https://github.com/robclark/gst-ducati
9. https://github.com/felipec/gst-dsp/blob/HEAD/dsp_bridge.h

Custom Ubuntu’s root file-system for the panda-board

Since I started to work with the beagle-board, one of my main concerns has been generate a custom image with the minimal setup required for the task: In my opinion it is important to have installed only the necessary and do not bloat the SD with a full featured distribution. In this way, we will control the dependencies, and we will remove possible points of failure which only will drowse our cycle of development/testing.

In the case of a OMAP3 base board or any other board with inferior capabilities, the solutions were in builder systems with cross-compilation support, such as OpenEmbedded, Buildroot, etc.. Those builders use the cross-compilation technique to generate custom images for the target hardware.

But in OMAP4 we are in the border line between an embedded system and a PC, and because of this, I needed a kind of a mixed approach: a native compiling environment within a custom image, in order to set it up as a builbot slave.

As I attained certain experience with OpenEmbedded I started to craft a custom native-sdk image, with rather thwarted results: after several days of patches and bug fixes, I was not able to compile, in the panda-board, the software I was willing to test.

Dodging the OE’s approach, I decided to uproot it and I started to explore new paths. Soon I stumbled with rootstock: a shell script which generates custom images of the Ubuntu distribution for the ARM architecture.

Nonetheless I tried it under Debian, since I don’t have any machine with Ubuntu. The script is difficult to trace and its logs are not useful on many occasions, mainly when the processing occurs in the qemu virtual machine. Anyway, I found that the qemu distributed by Squeeze was not functional in this particular case, so I grabbed the latest code from its repository, I compiled it and it worked like a charm.

NOTE: use the latest code of the RootStock hosted in the Launchpad bazaar repository.

At the end I came with a script where I set the parameters for my particular use case:


export TMPDIR=$(pwd)/tmp
export PATH=$PATH:/usr/sbin/:/sbin:/opt/qemu/bin

[ -d $TMPDIR ] || mkdir -p $TMPDIR


./rootstock --fqdn panda1.local.igalia.com 
            --login user 
            --password 000  
            --imagesize 512M 
            --dist maverick 
            --serial ttyO2 
            --seed $BASE,$V8_DEPS,$JSC_DEPS

In this way I can generate a minimal image, in a tarball, for my panda build-slave.

Meego in the Pandaboard

The objective was rather simple: run Meego on the Pandaboard.

I found that there is a wiki page for this topic, which provides a kick-start file for generating the file system, which is more than enough as a starting point.

Note: With the purpose to keep track of the Meegos’ kick-start files available for OMAP, I setup a repository in gitorious.

Nevertheless, the kernel did not work for me. Neither the boot.src provides the correct kernel parameters for the video subsystem. Hence, I borrowed what currently works: the x-loader, u-boot and kernel of the Ubuntu Maverick Meerkat for the Panda.

For the x-loader and the u-boot I just copied the binaries from the Ubuntu SD. In the boot.src I got rid of the initrd fanciness. And for the kernel, I grabbed the latest Ubuntu’s kernel from its git repository and using the config file also from the SD.

My first approach was to have GLX/EGL through the SGX kernel module, which also comes in the Ubuntu image. You can find the kernel module and the proprietary libraries in the OMAP4 PPA.

This approach failed though: The Meego’s compositor (mcompositor) could not render correctly in both monitors where I tested the result.

So, my alternative approach was get rid of the compositor and launch directly the duihome, the window manager. In order to do that, you must change the file /etc/sysconfig/uxlaunch, with

+session= /usr/bin/duihome -software -show-cursor

The -software parameter request to use software rendering, not accelerated. The -show-cursor requests for a mouse pointer. Those options are consumed by meegotouch library. Nevertheless, each application must be launched explicitly with them.

In order to circumvent this issue, there is already an environment variable for the -software: M_USE_SOFTWARE_RENDERING=1. But for the -show-cursor there is none. Although it is not difficult to implement (patch).

Et voilà, we have a singing & dancing Meego environment. But still, the UI transitions are really missed.

Next step: Syslink, DOMX and GStreamer.

Pandaboard – Chapter One

Finally, the last Friday the Pandaboard arrived to the office. Thanks to all the guy in TI who generously decided to give me one, specially to Jayabharath Goluguri and Rob Clark.

The next Saturday, Ryan Lortie and I came to the office to fool around with the new toy. Just one word: Impressive.

Ryan set up the Ubuntu Maverick Meerkat image for the Pandaboard. At the beginning we ran with a couple problems, but with the help of Ogra we couped  them. First lesson learnt:

The SD for Maverick must be, at least, a 4Gb card, not less.

Second lesson learnt:

The USB is unable the power the Pandaboard. Thou shall power the board with a normal 5V power supply.

Also we found that my monitor, a LG Flatron W2261VP, is not well handled by the PVR X driver, but still it’s usable under 640×480. Ryan filed a bug about this issue.

It was great start up experience. The next stage is play with syslink and DOMX. Anyway, I won’t use Ubuntu for it, my plan is go with the minimal-fs stuff.