24
Mar 14

Munich

I spent a week in Munich. I went there for two reasons: to attend the Web & TV workshop organized by the W3C and to hack along with the gst-gang in the GStreamer Hackfest 2014. All these sponsored by Igalia, my company.

I arrived to Munich on Tuesday evening, and when I reached the Marienplatz metro station, I ran across with a crowd of Bayern Munich fans, chanting songs about the glory of their team, huddling and dancing. And a lot of police officers surrounding the tracks.

The workshop was organized by the W3C Web and TV Interest Group, and intended to spark discussions around how to integrate and standardize TV technologies and the Web.

The first slide of the workshop

The first slide of the workshop

On Wednesday morning, the workshop began. People from Espial and Samsung talked about HbbTV, and japanese broadcasters talked about their Hybridcast. Both technologies seek to enhance the television experience, using the Internet Protocols, the first for Europe, and the former for Japan. Also, broadcasters from Chine showed their approach using ad-hoc technologies. I have to say that Hybricast awed me.

 

Afterwards, most of the workshop was around the problem of the companion device. People showed their solutions and proposals, in particular about device discovering, and data sharing and control. All the solutions relied on WebSockets and WebRTC for the data sharing between devices.

 

During the panels, I enjoyed a lot the participation of Jon Piesing, in particular his slide summarizing the specifications used by the HbbTV V2. It’s like juggling specs!

Specifications used by HbbTV V2

Specifications used by HbbTV V2

Broadcaster are very interested in Encrypted Media Extension and Media APIs, for example the specification for a Tuner API. Also there’s a lot of expectation about meta-data handling and defining common TV ontologies.

Finally, there were a couple talks about miscellaneous technologies surrounding the IPTV broadcasting.

The second stage of my visit to Bavaria’s Capital, was the GStreamer Hackfest. It was in the Google Offices, near to the Marienplatz.

Christan Schaller has made a very good summary of what appened along the hackfest. From my side, I worked with Nicolas Dufresne with the v4l2 video converter for the Exynos4, which is a piece required for the hardware acceleration decoding for that platform using v4l2 video decoder.


26
Jul 13

Composited video support in WebKitGTK+

A couple months ago we started to work on adding support for composited video in WebKitGTK+. The objective is to play video in WebKitGTK+ using the hardware accelerated path, so we could play videos at high definition resolutions (1080p).

How does WebKit paint?

Basically we can perceive a browser as an application for retrieving, presenting and traversing information on the Web.

For the composited video support, we are interested in the presentation task of the browser. More particularly, in the graphical presentation.

In WebKit, each HTML element on a web page is stored as a tree of Node objects called the DOM tree.

Then, each Node that produces visual output has a corresponding RenderObject, and they are stored in another tree, called the Render Tree.

Finally, each RenderObject is associated with a RenderLayer. These RenderLayers exist so that the elements of the page are composited in the correct order to properly display overlapping content, semi-transparent elements, etc.

It is worth to mention that there is not a one-to-one correspondence between RenderObjects and RenderLayers, and that there is a RenderLayer tree as well.

Render Trees in WebKit

Render Trees in WebKit (from GPU Accelerated Compositing in Chrome).

WebKit fundamentally renders a web page by traversing the RenderLayer tree.

What is the accelerated compositing?

WebKit has two paths for rendering the contents of a web page: the software path and hardware accelerated path.

The software path is the traditional model, where all the work is done in the main CPU. In this mode, RenderObjects paint themselves into the final bitmap, compositing a final layer which is presented to the user.

In the hardware accelerated path, some of the RenderLayers get their own backing surface into which they paint. Then, all the backing surfaces are composited onto the destination bitmap, and this task is responsibility of the compositor.

With the introduction of compositing an additional conceptual tree is added: the GraphicsLayer tree, where each RenderLayer may have its own GraphicsLayer.

In the hardware accelerated path, it is used the GPU for compositing some of the RenderLayer contents.

Accelerated Compositing in WebKit

Accelerated Compositing in WebKit (from Hardware Acceleration in WebKit).

As Iago said, the accelerated compositing, involves offloading the compositing of the GraphicLayers onto the GPU, since it does the compositing very fast, releasing that burden to the CPU for delivering a better and more responsive user experience.

Although there are other options, typically, OpenGL is used to render computing graphics, interacting with the GPU to achieve hardware acceleration. And WebKit provides cross-platform implementation to render with
OpenGL.

How does WebKit paint using OpenGL?

Ideally, we could go from the GraphicsLayer tree directly to OpenGL, traversing it and drawing the texture-backed layers with a common WebKit implementation.

But an abstraction layer was needed because different GPUs may behave differently, they may offer different extensions, and we still want to use the software path if hardware acceleration is not available.

This abstraction layer is known as the Texture Mapper, which is a light-weight scene-graph implementation, which is specially attuned for an efficient usage of the GPU.

It is a combination of a specialized accelerated drawing context (TextureMapper) and a scene-graph (TextureMapperLayer):

The TextureMapper is an abstract class that provides the necessary drawing primitives for the scene-graph. Its purpose is to abstract different implementations of the drawing primitives from the scene-graph.

One of the implementations is the TextureMapperGL, which provides a GPU-accelerated implementation of the drawing primitives, using shaders compatible with GL/ES 2.0.

There is a TextureMapperLayer which may represent a GraphicsLayer node in the GPU-renderable layer tree. The TextureMapperLayer tree is equivalent to the GraphicsLayer tree.

How does WebKitGTK+ play a video?

As we stated earlier, in WebKit each HTML element, on a web page, is stored as a Node in the DOM tree. And WebKit provides a Node class hierarchy for all the HTML elements. In the case of the video tag there is a parent class called HTMLMediaElement, which aggregates a common, cross platform, media player. The MediaPlayer is a decorator for a platform-specific media player known as MediaPlayerPrivate.

All previously said is shown in the next diagram.

Video in WebKit

Video in WebKit. Three layers from top to bottom

In the GTK+ port the audio and video decoding is done with GStreamer. In the case of video, a special GStreamer video sink injects the decoded buffers into the WebKit process. You can think about it as a special kind of GstAppSink, and it is part of the WebKitGTK+ code-base.

And we come back to the two paths for content rendering in WebKit:

In the software path the decoded video buffers are copied into a Cairo surface.

But in the hardware accelerated path, the decoded video buffers shall be uploaded into a OpenGL texture. When a new video buffer is available to be shown, a message is sent to the GraphicsLayer asking for redraw.

Uploading video buffers into GL textures

When we are dealing with big enough buffers, such as the high definition video buffers, copying buffers is a performance killer. That is why zero-copy techniques are mandatory.

Even more, when we are working on a multi-processor environment, such as those where we have a CPU and a GPU, switching buffers among processor’s contexts, is also very expensive.

It is because of these reasons, that the video decoding and the OpenGL texture handling, should happen only in the GPU, without context switching and without copying memory chunks.

The simplest approach could be that decoder deliver an EGLImage, so we could blend the handle into the texture. As far as I know, the gst-omx video decoder in the Raspberry Pi, works in this way.

GStreamer added a new API, that will be available in the version 1.2, to upload video buffers into a texture efficiently: GstVideoGLTextureUploadMeta. This API is exposed through buffer’s metadata, and ought be implemented by any downstream element that deals with the decoded video frames, most commonly the video decoder.

For example, in gstreamer-vaapi there are a couple patches (which still are a work-in-progress) in bugzilla, enabling this API. In the low level, calling gst_video_gl_texture_upload_meta_upload() will call vaCopySurfaceGLX(), which will do an efficient copy of the vaAPI surface into a texture using a GLX extension.

Demo

This is an old demo, when all the pieces started to fit, but no the current performance. Still, it shows what has been achieved:

Future work

So far, all these bits are already integrated in WebKitGTK+ and GStreamer. Nevertheless there are some open issues.

  • gstreamer-vaapi et all:
    GStreamer 1.2 is not released yet, and its new API might change. Also, the port of gstreamer-vaapi to GStreamer 1.2 is still a work in progress, where the available patches may have rough areas.Also, there are many other projects that need to be updated with this new API, such as clutter-gst and provide more feedback to the community.
    Another important thing is to have more GStreamer elements implementing these new API, such as the texture upload and the caps features
  • Tearing:
    The composited video task unveiled a major problem in WebKitGTK+: it does not handle the vertical blank interval at all, causing tearing artifacts, clearly observable in high resolutions videos with high motion.WebKitGTK+ composites the scene off-screen, using X Composite redirected window, and then display it at a X Damage callback, but currently, GTK+ does not take care of the vertical blank interval, causing this tearing artifact in heavy compositions.
    At Igalia, we are currently researching for a way to fix this issue.
  • Performance:
    There is always room for performance improvement. And we are always aiming in that direction, improving the frame rate, the CPU, GPU and memory usage, et cetera.
    So, keep tuned, or even better, come and help us.

08
Apr 13

GStreamer Hackfest 2013 – Milan

Last week, from 28th to 31th of March, some of us gathered at Milan to hack some bits of the GStreamer internals. For me was a great experience interact with great hackers such as Sebastian Drödge, Wim Taymans, Edward Hervey, Alessandro Decina and many more. We talked about GStreamer and, more particularly, we agreed on new features which I would like to discuss here.

GStreamer Hackers at Milan

GStreamer Hackers at Milan

For sake of completeness, let me say that I have been interested in hardware accelerated multimedia for a while, and just lately I started to wet my feet in VAAPI and VDPAU, and their support in our beloved GStreamer.

GstContext

The first feature that reached upstream is the GstContext. Historically, in 2011, Nicolas Dufresne added GstVideoContext as an interface to a share video context (such as display name, X11 display, VA-API display, etc.) among the pipeline elements and the applications. But now, Sebastian, generalized the interface to a container to stores and shares any kind of contexts between multiple elements and the application.

The first approach, that is still living in gst-plugins-bad, was merely a wrapper to a custom query to set or request a video context. But now, the context sharing is part of the pipeline setup.

An element that needs a shared context must follow these actions:

  1. Check if the element already has a context
  2. Query downstream for the context
  3. Post a message in the bus to see if the application has one to share.
  4. Create the context if there is none, post a message and send an event letting know that the element has the context.

You can see the example of the eglglessink to know how to use this feature.

GstVideoGLTextureUploadMeta

Also in 2011, Nicolas Dufresne, added a helper class to upload a buffer into a surface (OpenGL texture, VA API surface, Wayland surface, etc.). This is quite important since the new video players are scene based, using framework such as Clutter or OpenGL directly, where the video display is composed by various actors, such as the multimedia controls widgets.

But still, this interface didn’t fit well for GStreamer 1.0, until now, where it was introduced in the figure of a buffer’s meta, though this meta is only specific for OpenGL textures. If the buffer provides this new GstVideoGLTextureUploadMeta meta, a new function gst_video_gl_texture_upload_meta_upload() is available to upload that buffer into an OpenGL texture specified by its numeric identifier.

Obviously, in order to use this meta, it should be proposed for allocation by the sink. Again, you can see the case of eglglesink as example.

Caps Features

The caps features are a new data type for specify a specific extension or requirement for the handled media.

From the practical point of view, we can say that caps structures with the same name but with a non-equal set of caps features are not compatible, and, if a pad supports multiple sets of features it has to add multiple equal structures with different feature sets to the caps.

Empty GstCapsFeatures are equivalent with the GstCapsFeatures handled by the common system memory. Other examples would be a specific memory types or the requirement of having a specific meta on the buffer.

Again, we can see the example of the capsfeatures in eglglessink, because now the gst-inspect also shows the caps feature of the pads:

Pad Templates:
  SINK template: 'sink'
    Availability: Always
    Capabilities:
      video/x-raw(memory:EGLImage)
                 format: { RGBA, BGRA, ARGB, ABGR, RGBx,
                           BGRx, xRGB, xBGR, AYUV, Y444,
                           I420, YV12, NV12, NV21, Y42B,
                           Y41B, RGB, BGR, RGB16 }
                  width: [ 1, 2147483647 ]
                 height: [ 1, 2147483647 ]
              framerate: [ 0/1, 2147483647/1 ]
      video/x-raw(meta:GstVideoGLTextureUploadMeta)
                 format: { RGBA, BGRA, ARGB, ABGR, RGBx,
                           BGRx, xRGB, xBGR, AYUV, Y444,
                           I420, YV12, NV12, NV21, Y42B,
                           Y41B, RGB, BGR, RGB16 }
                  width: [ 1, 2147483647 ]
                 height: [ 1, 2147483647 ]
              framerate: [ 0/1, 2147483647/1 ]
      video/x-raw
                 format: { RGBA, BGRA, ARGB, ABGR, RGBx,
                           BGRx, xRGB, xBGR, AYUV, Y444,
                           I420, YV12, NV12, NV21, Y42B,
                           Y41B, RGB, BGR, RGB16 }
                  width: [ 1, 2147483647 ]
                 height: [ 1, 2147483647 ]
              framerate: [ 0/1, 2147483647/1 ]

Parsers meta

This is a feature which has been pulled by Edward Hervey. The idea is that the video codec parsers (H264, MPEG, VC1) attach a meta into the buffer with a defined structure that carries that new information provided by the codified stream.

This is particularly useful by the decoders, which will not have to parse again the buffer in order to extract the information they need to decode the current buffer and the following.

For example, here is the H264 parser meta definition.

VDPAU

Another task pulled by Edward Hervey, for which I feel excited, is the port of VDPAU decoding elements to GStreamer 1.0.

Right now only the MPEG decoder is upstreamed, but MPEG4 and H264 are coming.

As a final note, I want to thank Collabora and Fluendo for sponsoring dinners. A special thank you, as well, for Igalia which covered my travel expenses and attendance to the hackfest.


26
Mar 13

GStreamer Hackfest 2013

Next Thursday I’ll be flying to Milan to attend the 2013 edition of the GStreamer Hackfest. My main interests are hardware codecs and GL integration, particularly VA API integrated with GL-based sinks.

Thanks Igalia for sponsoring my trip!

Igalia


28
Jan 13

Announcing GPhone v0.10

Hi folks!

As many of you may know, lately I have been working on Ekiga and Opal. And, as usually happens to me, I started to wonder how I would re-write that piece of software. My main ideas growth clearly: craft a GObject library, ala WebKitGTK+, wrapping Opal’s classes, paying attention to the gobject-introspection, also, redirecting the Opal’s multimedia rendering to a GStreamer player,  and, finally, write the application with Vala.

The curiosity itched me hard,  so I started to hack, from time to time, these ideas. In mid-November, last year, I had a functional prototype, which only could make phone calls in a disgraceful user interface. But I got my proof of concept. Nonetheless, as I usually do, I didn’t dropped the pet project, continuing the development of more features.

And today, I am pleased to present you,  the release v0.1 of GPhone.

Well, I guess a screencast is mandatory nowadays:

 


20
Jun 12

A GStreamer Video Sink using KMS

The purpose of this blog post is to show the concepts related to the GstKMSSink, a new video sink for GStreamer 1.0, co-developed by Alessandro Decina and myself, done during my hack-fest time in the Igalia’s multimedia team.

One interesting thing to notice is that this element shows it is possible to write DRI clients without the burden of X Window.

Brief introduction to graphics in Linux

If you want to dump images onto your screen, you can simply use the frame buffer device. It provides an abstraction for the graphics hardware and represents the frame buffer of the video hardware. This kernel device allows user application to access the graphics hardware without knowing the low-level details [1].

In GStreamer, we have two options for displaying images using the frame buffer device; or three, if we use OMAP3: fbvideosink, fbdevsink and gst-omapfb.

Nevertheless, since the appearance of the GPUs, the frame buffer device interface has not been sufficient to fulfill all their capabilities. A new kernel interface ought to emerge. And that was the Direct Rendering Manager (DRM).

What in the hell is DRM?

The DRM layer is intended to support the needs of complex graphics devices, usually containing programmable pipelines well suited to 3D graphics acceleration [2]. It deals with [3]:

  1. A DMA queue for graphic buffers transfers [4].
  2. It provides locks for graphics hardware, treating it as shared resource for simultaneous 3D applications [5].
  3. And it provides secure hardware access, preventing clients from escalating privileges [6].

The DRM layer consists of two in-kernel drivers: a generic DRM driver, and another which has specific support for the video hardware [7]. This is possible because the DRM engine is extensible, enabling the device-specific driver to hook out those functionalities that are required by the hardware. For example, in the case of the Intel cards, the Linux kernel driver i915 supports this card and couples its capabilities to the DRM driver.

The device-specific driver, in particular, should cover two main kernel interfaces: the Kernel Mode Settings (KMS) and the Graphics Execution Manager (GEM). Both elements are also exposed to the user-space through the DRM.

With KMS, the user can ask the kernel to enable native resolution in the frame buffer, setting certain display resolution and colour depth mode. One of the benefits of doing it in kernel is that, since the kernel is in complete control of the hardware, it can switch back in the case of failure [8].

In order to allocate command buffers, cursor memory, scanout buffers, etc., the device-specific driver should support a memory manager, and GEM is the manager with more acceptance these days, because of its simplicity [9].

Beside to the graphics memory management, GEM ensures conflict-free sharing of data between applications by managing the memory synchronization. This is important because modern graphics hardware are essentially NUMA environments.

The following diagram shows the components view of the DRM layer:

Direct Rendering Infrastructure

What is the deal with KMS?

KMS is important because on it relies GEM and DRM to allocate frame buffers and to configure the display. And it is important to us because almost all of the ioctls called by the GStreamer element are part of the KMS subset.

Even more, there are some voices saying that KMS is the future replacement for the frame buffer device [10].

To carry out its duties, the KMS identifies five main concepts [11,12]:

Frame buffer:
The frame buffer is just a buffer, in the video memory, that has an image encoded in it as an array of pixels. As KMS configures the ring buffer in this video memory, it holds a the information of this configuration, such as width, height, color depth, bits per pixel, pixel format, and so on.
CRTC:
Stands for Cathode Ray Tube Controller. It reads the data out of the frame buffer and generates the video mode timing. The CRTC also determines what part of the frame buffer is read; e.g., when multi-head is enabled, each CRTC scans out of a different part of the video memory; in clone mode, each CRTC scans out of the same part of the memory.Hence, from the KMS perspective, the CRTC’s abstraction contains the display mode information, including, resolution, depth, polarity, porch, refresh rate, etc. Also, it has the information of the buffer region to display and when to change to the next frame buffer.
Overlay planes:
Overlays are treated a little like CRTCs, but without associated modes our encoder trees hanging off of them: they can be enabled with a specific frame buffer attached at a specific location, but they don’t have to worry about mode setting, though they do need to have an associated CRTC to actually pump their pixels out [13].
Encoder:
The encoder takes the digital bitstream from the CRTC and converts it to the appropriate format across the connector to the monitor.
Connector:
The connector provides the appropriate physical plug for the monitor to connect to, such as HDMI, DVI-D, VGA, S-Video, etc..

And what about this KMSSink?

KMSSink is a first approach towards a video sink as a DRI client. For now it only works in the panda-board with a recent kernel (I guess, 3.3 would make it).

For now it only uses the custom non-tiled buffers and use an overlay plane to display them. So, it is in the to-do, add support for more hardware.

 Bibliography

[1] http://free-electrons.com/kerneldoc/latest/fb/framebuffer.txt
[2] http://free-electrons.com/kerneldoc/latest/DocBook/drm/drmIntroduction.html
[3] https://www.kernel.org/doc/readme/drivers-gpu-drm-README.drm
[4] http://dri.sourceforge.net/doc/drm_low_level.html
[5] http://dri.sourceforge.net/doc/hardware_locking_low_level.html
[6] http://dri.sourceforge.net/doc/security_low_level.html
[7] https://en.wikipedia.org/wiki/Direct_Rendering_Manager
[8] http://www.bitwiz.org.uk/s/how-dri-and-drm-work.html
[9] https://lwn.net/Articles/283798/
[10] http://phoronix.com/forums/showthread.php?23756-KMS-as-anext-gen-Framebuffer
[11] http://elinux.org/images/7/71/Elce11_dae.pdf
[12] http://www.botchco.com/agd5f/?p=51
[13] https://lwn.net/Articles/440192/


10
Apr 12

Aura

In the last weeks Miguel, Calvaris and myself, developed an application for the N9/N950 mobile phone and we called it Aura.

Basically it uses the device’s camera (either the main one or the frontal one) for video recording, as a normal camera application, but also it exposes a set of effects that can be applied, in real time, to the video stream.

For example, here is a video using the historical effect:

Aura is inspired in the Gnome application, Cheese, and it uses many of the effects available in Gnome Video Effects.

The list of effects that were possible to port to the N9/N950 are: dice, edge, flip, historical, hulk, mauve, noir/blanc, optical illusion, quark, radioactive, waveform, ripple, saturation, shagadelic, kung-fu, vertigo and warp.

Besides of these software effects, it is possible to add, simultaneously, another set of effects that the hardware is capable, such as sepia colors. These hardware capabilities do not impose extra processing as the software effects do.

Because of this processing cost, imposed by the non-hardware video effects, Aura has a fixed video resolution. Otherwise the performance would make the application unusable. Also, we had a missing feature: the still image capture. But, hey! there is good news: Aura is fully open source, you can checkout the code at github and we happily accept patches.

Honoring Cheese, the name of Aura is taken from a kind of Finnish blue cheese.

We hope you enjoy this application as we enjoyed developing it.


17
Jan 12

GStreamer video decoder for SysLink

This blog post is another one of our series about SysLink (1, 2): finally I came with a usable GStreamer element for video decoding, which talks directly with the SysLink framework in the OMAP4’s kernel.

As we stated before, SysLink is a set of kernel modules that enables the initialization of remote processors, in a multi-core system (which might be heterogeneous), that run their own operating systems in their own memory space, and also, SysLink, enables the communication between the host processor with the remotes ones. This software and hardware setup could be viewed as an Asymmetric Multi-Processing model.

TI provides a user-space library to access the SysLink services, but I find its implementation a bit clumsy, so I took the challenge of rewrite a part of it, in a simple and straightforward fashion, as gst-dsp does for DSP/Bridge. The result is the interface syslink.h.

Simultaneously, I wrote the utility to load and monitor the operating system into the Cortex-M3 processors for the PandaBoard. This board, such as all the OMAP4-based SoCs, has two ARM Core-M3 as remote processors. Hence, this so called daemon.c, is in charge of loading the firmware images, setting the processor in its running state, allocating the interchange memory areas, and monitoring for any error message.

In order to load the images files into the processors memory areas, it is required to parse the ELF header of the files, and that is the reason of why I decided to depend on libelf, rather than write another ELF parser. Yes, one sad dependency for the daemon. The use of libelf is isolated in elf.h.

When I was developing the daemon, for debugging purposes, I needed to trace the messages generated by the images in the remote processors. For that reason I wrote tracer.c, whose only responsibility is to read and to parse the ring buffer used by the images, in the remote processors, for logging.

Now, in OMAP4, the subsystem comprised by the two Cortex-M3 processors is called Ducati. The first processor is used only for the exchange of notification messages among the host processor and the second M3 processor, where all the multimedia processing is done.

There are at least two images for the second Cortex-M3 processor: DOMX, which is closed source and focused, as far as I know, on the OMX IL interface; and, in the other hand, DCE, which is open source, it’s developed by Rob Clark, and it provides a simple interface of buffers interchange.

My work use DCE, obviously.

But, please, let me go back one step in this component description: in order to send and receive commands between the host processor and one remote processor, SysLink uses a packet based protocol, called Remote Command Messaging, or just RCM for the friends. There are two types of interfaces of RCM, the client and the server. The client interface is used by the applications running in the host processor, and they request services to the server interface, exposed by the systems running in the remote processors, it is accepting the requests and it returns results.

The RCM client interface is in rcm.h.

Above the RCM client, sits my dce.h interface, which is in charge of control the state of the video decoders and it is also in charge of handling the buffers.

But these buffers are tricky. They are not memory areas allocated by a simple malloc, instead they are buffers allocated by a mechanism in the kernel called tiler. The purpose of this mechanism is to provide buffers with capacity of 2D operations by hardware (in other words, cheap and quick in computations). These buffers are shared along all the processing pipeline, so the copies of memory areas are not needed. Of course, in order to achieve this paradise, the video renderer must handle this type of buffers too.

In my code, the interface to the tiler mechanism is in tiler.h.

And finally, the all mighty GStreamer element for video decoding: gstsyslinkvdec.c! Following the spirit of gst-dsp, this element is intended to deal with all the available video decoders in the DCE image, although for now, the H264 decoding is the only one handled.

For now, I have only tested the decoder with fakesink, because the element pushes tiled buffers onto the source pad, and, in order to have an efficient video player, it is required a video renderer that handles this type of tiled buffers. TI is developing one, pvrvideosink, but it depends on EGL, and I would like to avoid X whenever is possible.

I have not measured either the performance of this work compared with the TI’s combo (syslink user-space library / memmgr / libdce / gst-ducati), but I suspect that my approach would be little more efficient, faster, and, at least, simpler ;)

The sad news, as in every hard paced development, all these kernel mechanisms are already deprecated: SysLink and DMM-Tiler will never be mainlined into the kernel, but their successors, rproc/rpmsg and omapdrm, have a good chance. And both have a very different approach since their predecessors. Nevertheless, SysLink is already here and it is being used widely, so this effort has an opportunity for being worthy.

My current task is to decide if I should drop the 2D buffers in the video decoders or if I should develop a video renderer for them.


29
Nov 11

OpenMAX: a rant

I used to work with OpenMAX a while ago. I was exploring an approach to wrap OpenMAX components with GStreamer elements. The result was gst-goo. Its purpose was to test the OpenMAX components in complex scenarios and it was only focused for the Texas Instruments’ OpenMAX implementation for the OMAP3 processor.

Some time after we started gst-goo, Felipe Contreras released gst-openmax, which had a more open development, but with a hard set of performance objectives like zero-copy. And also only two implementations were supported at that moment: Bellagio and the TI’s one mentioned before.

Recently, Sebastian Dröge has been working on a redesign of gst-openmax, called gst-omx. He explained the rational behind this new design in his talk in the GStreamer Conference 2011. If you are looking for a good summary of the problems faced when wrapping OpenMAX with GStreamer, because of their semantic impedance mismatch, you should watch his talk.

In my opinion, the key purpose of OpenMAX is to provide a common application interface to a set of different and heterogeneous multimedia components: You could take different implementations, that could offer hardware-accelerated codecs or either any other specialized ones, and build up portable multimedia applications. But this objective has failed utterly: every vendor delivers a incompatible implementation with the others available. One of the causes, as Dröge explained, is because of the specification, it is too ambiguous and open to interpretations.

From my perspective, the problem arises from the need of a library like OpenMAX. It is needed because the implementer wants to hide (or to abstract if you prefer) the control and buffer management of his codification entities. By hiding this, the implementer has the freedom to develop his own stack closely, without any kind of external review.

In order to explain the problem brought by the debauchery in the hind of OpenMAX, let me narrow the scope of the problem: I will not fall on the trap of portability among different operative systems, specially in those of non-Unix. Even more, I will only focus on the ARM architecture of the Linux kernel. Thus, I will not consider the software-based codecs, only the hardware-accelerated ones. The reason upholding these constrains is that, beside the PacketVideo’s OpenCORE, I am not aware of any other successful set of non-Unix / software-based multimedia codecs, interfaced with OpenMAX.

As new and more complex hardware appears, with its own processing units, capable of off-loading the main processor, silicon vendors must deliver also the kernel drivers to operate them. This problem is very recurrent among the ARM vendors, where the seek of added value gives the competitive advantage, and the Linux kernel has the virtues required for a fast time to market.

But these virtues have turned into a ballast: It has been observed excessive churn in the ARM architecture, duplicated code, board-specific data encoded in source files, and conflicts at kernel’s code integration. In other words, every vendor has built up their own software stack without taking care of developing common interfaces among all of them. And this has been particularly true for the hardware-accelerated multimedia components, where OpenMAX promised to the user-space developers what the kernel developers could not achieve.

First we need a clear and unified kernel interface for hardware accelerated multimedia codecs, so the user-space implementations could be straight, lean and clean. Those implementations could be OpenMAX, GStreamer, libav, or whatever we possibly want and need.

But there is hope. Recently there has happened a lot of effort bringing new abstractions and common interfaces for the ARM architecture, so in the future we could expect integrated interfaces for all these new hardware, independently of the vendor.

Though, from my perspective, if we reach this point (and we will), we will have less motivation for a library like OpenMAX, because a high level library, such as GStreamer, it would cover a lot of hardware within a single element. Hence, it is a bit pointless to invest too much in OpenMAX or its wrappers nowadays.

Of course, if you think that I made a mistake along these reasons, I would love to read your comments.

And last but not least, Igalia celebrates its 10th anniversary! Happy igalian is happy :)


08
Jun 11

SysLink chronology

Introduction

Since a while the processor market has observed a Moore’s law decline: the processing velocity cannot be duplicated each year anymore. That is why, the old and almost forgotten discipline of parallel processing have had a kind of resurrection. GPUs, DSPs, multi-cores, are names that are kicking the market recently, offering to the consumers more horse power, more multi-tasking, but not more giga-hertz per core, as it used to be.

Also an ecologist spirit has hit the chips manufacturers, embracing the green-computing concept. It fits perfectly with the Moore’s law decay: more velocity, more power consumption and more injected heat into the environment. Not good. A better solution, they say, is to have a set of specialized processors which are activated when their specific task is requested by the user: do you need extensive graphics processing? No problem, the GPU will deal with it. Do you need decode or encode high resolution multimedia? We have a DSP for you. Are you only typing in an editor? We will turn off all the other processors thus saving energy.

Even though this scenario seems quite idilic, the hardware manufacturers have placed a lot of responsability upon the software. The hardware is there, it is already wired into the main memory, but now all the heuristics and logic to control the data flow among them is a huge and still open problem, yet with multiple partial solutions currently available.

And that is the case of Texas Instrument, who, since his first generation of OMAP, has delivered to the embedded and mobile market a multicore chip, with a DSP and a high end ARM processor. But deliver a chip is not enough, so TI had had to provide the system software capable to squeeze those heterogeneous processors.

At the begining, TI only delivered to its clients a mere proof of concept, with which the client could take it as a reference for his own implementation. That was the case of dsp-gateway[1], developed by Nokia as an open source project.

But as the OMAP processor capacities were increasing, TI was under more pressure from his customers to deliver a general mechanism to communicate with the embedded processors.

For that reason TI started to develop a mechanism of Inter-Processors Communication (IPC), whose approach is based on the concept that the general purpose processor (GPP), the host processor in charge of the user interactions, can control and interact with the other processors as if they were just another devices in the system. Those devices are called slave processors.

Thus, this designed IPC mechanism runs in the host processor’s kernel space and it fulfills the following responsibilities:

a) It allows the exchange of messages between the host processor with the slaves.

b) It can map files from the host’s file system into the memory space of the slave processor.

c) It permits the dynamic loading of basic operating systems and programs into the slave processors.

Also, TI has developed a programming library that provides an API with which the developers could build applications that use the processing power of the slave processors.

Sadly, whilst the development in the host processor, tipically Linux environments, is open and mostly free, the development for the slave cores (DSP in general) is still close and controlled by TI’s licences. Nevertheless, TI has provided, gratis but closed, binary objects which can be loaded into the slaves cores and they do multimedia processing.

Well, actually there have been some efforts to develop a GCC backend[2] for the C64x DSP, and also a LLVM backend[3], with which, least theorically, we could write programs to be loaded and executed through these IPC mechanisms. But they are not mature enough to use them seriously.

DSPLink and DSPBridge

In order to develop an general purpose IPC for his OMAP processors family, TI has designed the DSPBridge[4]. Oriented to multi-slave systems, agnostic to operating system, handle power management demands and other industrial weight
buzzwords.

But the DSPBridge was not ready for production until the OMAP3 came out to the market. That is why, another group inside TI, narrowed the scope of the DSPBridge’s design and slimmed its implementation, bringing out DSPLink[5], capable to run in OMAP2, OMAP3 and also in the DaVinci family.

DSPLink is distributed as an isolated kernel module and a programming library, along with the closed binaries that run in the DSP side. Nevertheless, the kernel module does not meet the requirements to be mainlined into the kernel tree. Also, it lacks of power management features and a dynamic MMU support.

On the other hand, DSPBridge has been brewed to be mainlined into the kernel, though it has stuck in the Greg’s staging tree for a long time. It seems that all the resources within TI are devoted to SysLink, the next generation of IPC. Nonetheless, many recent OMAP3 based devices uses this IPC mechanism for their multimedia applications.

Initially, TI offered an OpenMAX layer on top of the DSPBridge user-space API to process multimedia in the C64x DSP, but that solution was too bloated for some developers, and the project gst-dsp[6] appeared, which reuse the codecs for the DSP available in the TI’s OpenMAX implementation, along with the DSPBridge kernel module, to provide a thin and lean interface through GStreamer framework.

SysLink and OMAP4

Then OMAP4 came to existence. It is not only anymore a DSP and a high end ARM processor. It has a DSP, a dual ARM Cortex-M3, and, as host processor, a dual ARM Cortex-A9. Five processing units in a single silicon! How in hell we will share information among all of them? DSPBridge was not designed for this scenario in mind.

The ARM Cortex-M3 has the purpose to process video and images, and for that reason a tiler-based memory allocation is proposed, where the memory buffers are already perceived as 2D, where fast operations of mirroring and rotation are available.

Regretfully, in the case of the pandaboard (OMAP4330), the available DSP has lower capacities than the one in the beagleboards OMAP3, so the published codecs, for the OMAP3 DSP, can not be reused in the pandaboard. But video codecs for the M3 cores are currently available, and they are capable to process high definition resolutions.

The answer is SysLink. Where, besides the three operation developed for DSPBridge, two more core responsibilities were added:

a) Zero-copy shared memory: ability to “pass” data buffers to other processors by simply providing its location in shared memory

b) TILER-based memory allocation: allocate 2D-buffers with mirroring and rotation options.

c) Remote function calls: one processor can invoke functions on a remote processor

The stack offered is similar than the OMAP3: in the user space we start with the SysLink API, then an OpenMAX layer, called now as DOMX, and finally the gst-openmax elements for the GStreamer framework. And again, a bloated, buzzworded, stack for multimedia.

In his spare time, Rob Clark developed a proof of concept to remove the DOMX/gst-openmax layers and provide a set of GStreamer elements that talk directly with the SysLink API: libdce[7]/gst-ducati[8].

Either way, I feel more comfortable with the approach proposed by Felipe Contreras in gst-dsp: a slim and simple API to SysLink and plain GStreamer elements using that API. And because of that reason, I started to code a minimalistic API, copying the spirit of the dsp_bridge[9], for the SysLink interface: https://gitorious.org/vj-pandaboard/syslink/

1. http://sourceforge.net/projects/dspgateway/
2. http://www.qucosa.de/fileadmin/data/qucosa/documents/4857/data/thesis.pdf
3. https://www.studentrobotics.org/trac/wiki/Beagleboard_DSP
4. http://www.omappedia.org/wiki/DSPBridge_Project
5. http://processors.wiki.ti.com/index.php/Category:DSPLink
6. https://code.google.com/p/gst-dsp/
7. https://github.com/robclark/libdce
8. https://github.com/robclark/gst-ducati
9. https://github.com/felipec/gst-dsp/blob/HEAD/dsp_bridge.h