Composited video support in WebKitGTK+

A couple months ago we started to work on adding support for composited video in WebKitGTK+. The objective is to play video in WebKitGTK+ using the hardware accelerated path, so we could play videos at high definition resolutions (1080p).

How does WebKit paint?

Basically we can perceive a browser as an application for retrieving, presenting and traversing information on the Web.

For the composited video support, we are interested in the presentation task of the browser. More particularly, in the graphical presentation.

In WebKit, each HTML element on a web page is stored as a tree of Node objects called the DOM tree.

Then, each Node that produces visual output has a corresponding RenderObject, and they are stored in another tree, called the Render Tree.

Finally, each RenderObject is associated with a RenderLayer. These RenderLayers exist so that the elements of the page are composited in the correct order to properly display overlapping content, semi-transparent elements, etc.

It is worth to mention that there is not a one-to-one correspondence between RenderObjects and RenderLayers, and that there is a RenderLayer tree as well.

Render Trees in WebKit
Render Trees in WebKit (from GPU Accelerated Compositing in Chrome).

WebKit fundamentally renders a web page by traversing the RenderLayer tree.

What is the accelerated compositing?

WebKit has two paths for rendering the contents of a web page: the software path and hardware accelerated path.

The software path is the traditional model, where all the work is done in the main CPU. In this mode, RenderObjects paint themselves into the final bitmap, compositing a final layer which is presented to the user.

In the hardware accelerated path, some of the RenderLayers get their own backing surface into which they paint. Then, all the backing surfaces are composited onto the destination bitmap, and this task is responsibility of the compositor.

With the introduction of compositing an additional conceptual tree is added: the GraphicsLayer tree, where each RenderLayer may have its own GraphicsLayer.

In the hardware accelerated path, it is used the GPU for compositing some of the RenderLayer contents.

Accelerated Compositing in WebKit
Accelerated Compositing in WebKit (from Hardware Acceleration in WebKit).

As Iago said, the accelerated compositing, involves offloading the compositing of the GraphicLayers onto the GPU, since it does the compositing very fast, releasing that burden to the CPU for delivering a better and more responsive user experience.

Although there are other options, typically, OpenGL is used to render computing graphics, interacting with the GPU to achieve hardware acceleration. And WebKit provides cross-platform implementation to render with
OpenGL.

How does WebKit paint using OpenGL?

Ideally, we could go from the GraphicsLayer tree directly to OpenGL, traversing it and drawing the texture-backed layers with a common WebKit implementation.

But an abstraction layer was needed because different GPUs may behave differently, they may offer different extensions, and we still want to use the software path if hardware acceleration is not available.

This abstraction layer is known as the Texture Mapper, which is a light-weight scene-graph implementation, which is specially attuned for an efficient usage of the GPU.

It is a combination of a specialized accelerated drawing context (TextureMapper) and a scene-graph (TextureMapperLayer):

The TextureMapper is an abstract class that provides the necessary drawing primitives for the scene-graph. Its purpose is to abstract different implementations of the drawing primitives from the scene-graph.

One of the implementations is the TextureMapperGL, which provides a GPU-accelerated implementation of the drawing primitives, using shaders compatible with GL/ES 2.0.

There is a TextureMapperLayer which may represent a GraphicsLayer node in the GPU-renderable layer tree. The TextureMapperLayer tree is equivalent to the GraphicsLayer tree.

How does WebKitGTK+ play a video?

As we stated earlier, in WebKit each HTML element, on a web page, is stored as a Node in the DOM tree. And WebKit provides a Node class hierarchy for all the HTML elements. In the case of the video tag there is a parent class called HTMLMediaElement, which aggregates a common, cross platform, media player. The MediaPlayer is a decorator for a platform-specific media player known as MediaPlayerPrivate.

All previously said is shown in the next diagram.

Video in WebKit
Video in WebKit. Three layers from top to bottom

In the GTK+ port the audio and video decoding is done with GStreamer. In the case of video, a special GStreamer video sink injects the decoded buffers into the WebKit process. You can think about it as a special kind of GstAppSink, and it is part of the WebKitGTK+ code-base.

And we come back to the two paths for content rendering in WebKit:

In the software path the decoded video buffers are copied into a Cairo surface.

But in the hardware accelerated path, the decoded video buffers shall be uploaded into a OpenGL texture. When a new video buffer is available to be shown, a message is sent to the GraphicsLayer asking for redraw.

Uploading video buffers into GL textures

When we are dealing with big enough buffers, such as the high definition video buffers, copying buffers is a performance killer. That is why zero-copy techniques are mandatory.

Even more, when we are working on a multi-processor environment, such as those where we have a CPU and a GPU, switching buffers among processor’s contexts, is also very expensive.

It is because of these reasons, that the video decoding and the OpenGL texture handling, should happen only in the GPU, without context switching and without copying memory chunks.

The simplest approach could be that decoder deliver an EGLImage, so we could blend the handle into the texture. As far as I know, the gst-omx video decoder in the Raspberry Pi, works in this way.

GStreamer added a new API, that will be available in the version 1.2, to upload video buffers into a texture efficiently: GstVideoGLTextureUploadMeta. This API is exposed through buffer’s metadata, and ought be implemented by any downstream element that deals with the decoded video frames, most commonly the video decoder.

For example, in gstreamer-vaapi there are a couple patches (which still are a work-in-progress) in bugzilla, enabling this API. In the low level, calling gst_video_gl_texture_upload_meta_upload() will call vaCopySurfaceGLX(), which will do an efficient copy of the vaAPI surface into a texture using a GLX extension.

Demo

This is an old demo, when all the pieces started to fit, but no the current performance. Still, it shows what has been achieved:

Future work

So far, all these bits are already integrated in WebKitGTK+ and GStreamer. Nevertheless there are some open issues.

  • gstreamer-vaapi et all:
    GStreamer 1.2 is not released yet, and its new API might change. Also, the port of gstreamer-vaapi to GStreamer 1.2 is still a work in progress, where the available patches may have rough areas.Also, there are many other projects that need to be updated with this new API, such as clutter-gst and provide more feedback to the community.
    Another important thing is to have more GStreamer elements implementing these new API, such as the texture upload and the caps features
  • Tearing:
    The composited video task unveiled a major problem in WebKitGTK+: it does not handle the vertical blank interval at all, causing tearing artifacts, clearly observable in high resolutions videos with high motion.WebKitGTK+ composites the scene off-screen, using X Composite redirected window, and then display it at a X Damage callback, but currently, GTK+ does not take care of the vertical blank interval, causing this tearing artifact in heavy compositions.
    At Igalia, we are currently researching for a way to fix this issue.
  • Performance:
    There is always room for performance improvement. And we are always aiming in that direction, improving the frame rate, the CPU, GPU and memory usage, et cetera.
    So, keep tuned, or even better, come and help us.

OpenMAX: a rant

I used to work with OpenMAX a while ago. I was exploring an approach to wrap OpenMAX components with GStreamer elements. The result was gst-goo. Its purpose was to test the OpenMAX components in complex scenarios and it was only focused for the Texas Instruments’ OpenMAX implementation for the OMAP3 processor.

Some time after we started gst-goo, Felipe Contreras released gst-openmax, which had a more open development, but with a hard set of performance objectives like zero-copy. And also only two implementations were supported at that moment: Bellagio and the TI’s one mentioned before.

Recently, Sebastian Dröge has been working on a redesign of gst-openmax, called gst-omx. He explained the rational behind this new design in his talk in the GStreamer Conference 2011. If you are looking for a good summary of the problems faced when wrapping OpenMAX with GStreamer, because of their semantic impedance mismatch, you should watch his talk.

In my opinion, the key purpose of OpenMAX is to provide a common application interface to a set of different and heterogeneous multimedia components: You could take different implementations, that could offer hardware-accelerated codecs or either any other specialized ones, and build up portable multimedia applications. But this objective has failed utterly: every vendor delivers a incompatible implementation with the others available. One of the causes, as Dröge explained, is because of the specification, it is too ambiguous and open to interpretations.

From my perspective, the problem arises from the need of a library like OpenMAX. It is needed because the implementer wants to hide (or to abstract if you prefer) the control and buffer management of his codification entities. By hiding this, the implementer has the freedom to develop his own stack closely, without any kind of external review.

In order to explain the problem brought by the debauchery in the hind of OpenMAX, let me narrow the scope of the problem: I will not fall on the trap of portability among different operative systems, specially in those of non-Unix. Even more, I will only focus on the ARM architecture of the Linux kernel. Thus, I will not consider the software-based codecs, only the hardware-accelerated ones. The reason upholding these constrains is that, beside the PacketVideo’s OpenCORE, I am not aware of any other successful set of non-Unix / software-based multimedia codecs, interfaced with OpenMAX.

As new and more complex hardware appears, with its own processing units, capable of off-loading the main processor, silicon vendors must deliver also the kernel drivers to operate them. This problem is very recurrent among the ARM vendors, where the seek of added value gives the competitive advantage, and the Linux kernel has the virtues required for a fast time to market.

But these virtues have turned into a ballast: It has been observed excessive churn in the ARM architecture, duplicated code, board-specific data encoded in source files, and conflicts at kernel’s code integration. In other words, every vendor has built up their own software stack without taking care of developing common interfaces among all of them. And this has been particularly true for the hardware-accelerated multimedia components, where OpenMAX promised to the user-space developers what the kernel developers could not achieve.

First we need a clear and unified kernel interface for hardware accelerated multimedia codecs, so the user-space implementations could be straight, lean and clean. Those implementations could be OpenMAX, GStreamer, libav, or whatever we possibly want and need.

But there is hope. Recently there has happened a lot of effort bringing new abstractions and common interfaces for the ARM architecture, so in the future we could expect integrated interfaces for all these new hardware, independently of the vendor.

Though, from my perspective, if we reach this point (and we will), we will have less motivation for a library like OpenMAX, because a high level library, such as GStreamer, it would cover a lot of hardware within a single element. Hence, it is a bit pointless to invest too much in OpenMAX or its wrappers nowadays.

Of course, if you think that I made a mistake along these reasons, I would love to read your comments.

And last but not least, Igalia celebrates its 10th anniversary! Happy igalian is happy 🙂

SysLink chronology

Introduction

Since a while the processor market has observed a Moore’s law decline: the processing velocity cannot be duplicated each year anymore. That is why, the old and almost forgotten discipline of parallel processing have had a kind of resurrection. GPUs, DSPs, multi-cores, are names that are kicking the market recently, offering to the consumers more horse power, more multi-tasking, but not more giga-hertz per core, as it used to be.

Also an ecologist spirit has hit the chips manufacturers, embracing the green-computing concept. It fits perfectly with the Moore’s law decay: more velocity, more power consumption and more injected heat into the environment. Not good. A better solution, they say, is to have a set of specialized processors which are activated when their specific task is requested by the user: do you need extensive graphics processing? No problem, the GPU will deal with it. Do you need decode or encode high resolution multimedia? We have a DSP for you. Are you only typing in an editor? We will turn off all the other processors thus saving energy.

Even though this scenario seems quite idilic, the hardware manufacturers have placed a lot of responsability upon the software. The hardware is there, it is already wired into the main memory, but now all the heuristics and logic to control the data flow among them is a huge and still open problem, yet with multiple partial solutions currently available.

And that is the case of Texas Instrument, who, since his first generation of OMAP, has delivered to the embedded and mobile market a multicore chip, with a DSP and a high end ARM processor. But deliver a chip is not enough, so TI had had to provide the system software capable to squeeze those heterogeneous processors.

At the begining, TI only delivered to its clients a mere proof of concept, with which the client could take it as a reference for his own implementation. That was the case of dsp-gateway[1], developed by Nokia as an open source project.

But as the OMAP processor capacities were increasing, TI was under more pressure from his customers to deliver a general mechanism to communicate with the embedded processors.

For that reason TI started to develop a mechanism of Inter-Processors Communication (IPC), whose approach is based on the concept that the general purpose processor (GPP), the host processor in charge of the user interactions, can control and interact with the other processors as if they were just another devices in the system. Those devices are called slave processors.

Thus, this designed IPC mechanism runs in the host processor’s kernel space and it fulfills the following responsibilities:

a) It allows the exchange of messages between the host processor with the slaves.

b) It can map files from the host’s file system into the memory space of the slave processor.

c) It permits the dynamic loading of basic operating systems and programs into the slave processors.

Also, TI has developed a programming library that provides an API with which the developers could build applications that use the processing power of the slave processors.

Sadly, whilst the development in the host processor, tipically Linux environments, is open and mostly free, the development for the slave cores (DSP in general) is still close and controlled by TI’s licences. Nevertheless, TI has provided, gratis but closed, binary objects which can be loaded into the slaves cores and they do multimedia processing.

Well, actually there have been some efforts to develop a GCC backend[2] for the C64x DSP, and also a LLVM backend[3], with which, least theorically, we could write programs to be loaded and executed through these IPC mechanisms. But they are not mature enough to use them seriously.

DSPLink and DSPBridge

In order to develop an general purpose IPC for his OMAP processors family, TI has designed the DSPBridge[4]. Oriented to multi-slave systems, agnostic to operating system, handle power management demands and other industrial weight
buzzwords.

But the DSPBridge was not ready for production until the OMAP3 came out to the market. That is why, another group inside TI, narrowed the scope of the DSPBridge’s design and slimmed its implementation, bringing out DSPLink[5], capable to run in OMAP2, OMAP3 and also in the DaVinci family.

DSPLink is distributed as an isolated kernel module and a programming library, along with the closed binaries that run in the DSP side. Nevertheless, the kernel module does not meet the requirements to be mainlined into the kernel tree. Also, it lacks of power management features and a dynamic MMU support.

On the other hand, DSPBridge has been brewed to be mainlined into the kernel, though it has stuck in the Greg’s staging tree for a long time. It seems that all the resources within TI are devoted to SysLink, the next generation of IPC. Nonetheless, many recent OMAP3 based devices uses this IPC mechanism for their multimedia applications.

Initially, TI offered an OpenMAX layer on top of the DSPBridge user-space API to process multimedia in the C64x DSP, but that solution was too bloated for some developers, and the project gst-dsp[6] appeared, which reuse the codecs for the DSP available in the TI’s OpenMAX implementation, along with the DSPBridge kernel module, to provide a thin and lean interface through GStreamer framework.

SysLink and OMAP4

Then OMAP4 came to existence. It is not only anymore a DSP and a high end ARM processor. It has a DSP, a dual ARM Cortex-M3, and, as host processor, a dual ARM Cortex-A9. Five processing units in a single silicon! How in hell we will share information among all of them? DSPBridge was not designed for this scenario in mind.

The ARM Cortex-M3 has the purpose to process video and images, and for that reason a tiler-based memory allocation is proposed, where the memory buffers are already perceived as 2D, where fast operations of mirroring and rotation are available.

Regretfully, in the case of the pandaboard (OMAP4330), the available DSP has lower capacities than the one in the beagleboards OMAP3, so the published codecs, for the OMAP3 DSP, can not be reused in the pandaboard. But video codecs for the M3 cores are currently available, and they are capable to process high definition resolutions.

The answer is SysLink. Where, besides the three operation developed for DSPBridge, two more core responsibilities were added:

a) Zero-copy shared memory: ability to “pass” data buffers to other processors by simply providing its location in shared memory

b) TILER-based memory allocation: allocate 2D-buffers with mirroring and rotation options.

c) Remote function calls: one processor can invoke functions on a remote processor

The stack offered is similar than the OMAP3: in the user space we start with the SysLink API, then an OpenMAX layer, called now as DOMX, and finally the gst-openmax elements for the GStreamer framework. And again, a bloated, buzzworded, stack for multimedia.

In his spare time, Rob Clark developed a proof of concept to remove the DOMX/gst-openmax layers and provide a set of GStreamer elements that talk directly with the SysLink API: libdce[7]/gst-ducati[8].

Either way, I feel more comfortable with the approach proposed by Felipe Contreras in gst-dsp: a slim and simple API to SysLink and plain GStreamer elements using that API. And because of that reason, I started to code a minimalistic API, copying the spirit of the dsp_bridge[9], for the SysLink interface: https://gitorious.org/vj-pandaboard/syslink/

1. http://sourceforge.net/projects/dspgateway/
2. http://www.qucosa.de/fileadmin/data/qucosa/documents/4857/data/thesis.pdf
3. https://www.studentrobotics.org/trac/wiki/Beagleboard_DSP
4. http://www.omappedia.org/wiki/DSPBridge_Project
5. http://processors.wiki.ti.com/index.php/Category:DSPLink
6. https://code.google.com/p/gst-dsp/
7. https://github.com/robclark/libdce
8. https://github.com/robclark/gst-ducati
9. https://github.com/felipec/gst-dsp/blob/HEAD/dsp_bridge.h

my DSP related activities

When I started to play with the Beagleboard, my objective was to poke with the DSP accelerated codecs through OpenMAX and GStreamer. But soon I realized that it would be a hard task to achieve since the framework, developed by Texas Instrument, is in part proprietary (though free), and the another is open source, but it is not developed with an open source community in mind.

When I started to pulled it out, the first decision I had to face was to choose a cross compiling environment. As you know, there are plenty: scratchbox, OpenEmbedded, buildroot, PTXdist, etc. But just because people I know from TI began to write recipes for Poky, I devoted some time learning about it. Although, after a while, I jumped into OpenEmbedded. The reason were the slow updating rate against upstream which Poky had (and as I don’t follow the project anymore, I’m not aware about its current state).

But bitbake and OpenEmbedded are not the magic wand to build up a complete image to boot up in a device. There are a lot of things to define previously. Just to mention one, the distribution to build. By default OpenEmbedded offers Angstrom. But I did not want a full featured distribution, I wondered for something thin and lean, only a serial shell to start with, something I could set as a workbench for my multimedia experiments.

And for that reason marmita born.

As you may see, I mimicked the “Poky’s way”, making an overlay of OpenEmbedded, but as soon as I was getting involved in the OE’s community, I realized that maybe that it was not the correct decision, maybe I should push my changes into Angstrom instead.

Anyway, right now I have a steady set of recipes which allows me to build images for the Beagleboard, with the latest dspbridge kernel branch, and many of the TI bytes (either proprietary and open source) required for running the DSP accelerated codecs.

On the other hand, I have revamped the DSP how-to at elinux.org wiki, with the instructions to build up a kernel with DSP/BIOS Bridge support, and the means to test the communication with the DSP through it.

Along this process I became aware of the TI problems to release his DSP user space stack to the open source community. Even though the kernel side is moving quite well towards the kernel mainline, the user space bits are not doing that well, even more, they will be completely deprecated soon, because the kernel interface is still evolving.

At the lowest layer, on the ARM side of the DSP users pace stack, there is a library known as libbridge, which is basically an abstraction of the ioctl to the dspbridge kernel module, and it offers an interface with a nice semantics, but it is too much aligned to the old win16/32 API style (a bad idea in my opinion).

But the problem does not start with the API style, it begins at locating the library inside of a chaotic bunch of files, insanely bundled in a git repository, along with binaries (either for Windows and Linux), tarballs, and all sort of unrelated documentation.

Furthermore, the image building machinery within TI is a custom set of invasive makefiles, which all the projects must include and being conform with them. As result of this highly coupled build engine, extracting and isolating a project for it release is a painful and error prone process.

Given those problems, I got lost as soon as I began. So I decided to emulate the Felipe Contreras‘ approach: get rid of libbridge and use his minimalistic dspbridge ioctl wrapper: dsp_bridge, what he uses for gst-dsp. In order to train myself in these topics I wrote a clone of the TI’s ping application, using the dsp_bridge instead. And later on I wrote the DSP socket node counterpart. Both included in the dsp-samples repository.

Currently, the ping application is also included in the Meego’s dsp-tools repository.

Nevertheless, most of the DSP multimedia codecs are exposed, out of the box, through the TI’s OpenMAX IL implementation. And it depends on libbridge. For that reason I ripped out the libbridge from the userspace-dspbridge repository and pushed it into a different repository, cleaned it up its building machinery and removed other unneeded bytes too.

Finally, I had to do the same for the OpenMAX IL, which is not only messed up with the inside building machinery, but it is not released through a git repository yet, using instead old fashioned tarballs.

The future work will be integrate gst-openmax into marmita and try to participate with gst-dsp development. Also, FelipeC came up with the idea of rewrite libbridge in terms of dsp_bridge, task that I have been exploring lately.

libgoo & gst-goo

Back in 2007 I started to work integrating OpenMAX IL components into the GStreamer platform.

OpenMAX is a set of programming interfaces, in C language, for portable multimedia processing. Specifically the Integration Layer (IL) defines the interface to communicate with multimedia codecs implemented by hardware or software.

Texas Instrument started to work on an implementation of the OpenMAX IL for their DSP accelerated codecs for OMAP platform.

A quick and rough view of the software architecture implemented to achieve this processing is more or less exposed in the next diagram:

+---------------------+
| OpenMAX IL          |
+---------------------+
| libdspbridge        |
+---------------------+
| Kernel (DSP Bridge) |
+---------------------+

The DSP Bridge driver is a Linux Kernel device driver designed to supply a direct link between the GPP program and the assigned DSP node. Basically the features offered by the driver are:

  • Messaging: Ability to exchange fixed size control messages with DSP
  • Dynamic memory management: Ability to dynamically map files to DSP address space
  • Dynamic loading: Ability to dynamically load new nodes on DSP at run time
  • Power Management: Static and dynamic power management for DSP

The libdspbridge is part of the user-space utilities of the DSP bridge, which purpose is to provide a simple programming interface to the GPP  programs for the driver services.

In the DSP side, using the C/C++ compiler for the C64x+ and the libraries contained in the user-space utilities, it is possible to compile a DSP program  and package it as a DSP node, ready to be controlled by the DSP bridge driver. But right now TI provides a set of out-of-the-box DSP multimedia codecs for non-commercial purposes. These nodes are contained in the tiopenmax package.

So, as I said before, my job was to wrap up the OpenMAX IL components delivered by TI as a GStreamer plug-in. In that way a lot of available multimedia consumers could use the hardware accelerated codecs. But also, our team did the test of the delivered OpenMAX components.

After trying several approaches we came to the conclusion that we need a new layer of software which will provide us

  • Facilitate a great testing coverage of the components without the burden of the upper framework (GStreamer in this case).
  • Improve the code reuse.
  • Use an object oriented programming through GObject.
  • Facilitate the bug’s workaround for each component and maintenance of those workarounds.
  • A playground for experimenting with features such as (OpenMAX specific) tunneling and the (TI specific) DSP Audio Software Framework (DASF).

For those reasons we started to develop an intermediate layer called GOO (GObject OpenMAX).

+---------------------+
| GStreamer / gst-goo |
+---------------------+
| libgoo              |
+---------------------+
| OpenMAX             |
+---------------------+

libgoo is a C language library that wraps OpenMAX using GObject. The follow diagram shows part of its class hierarchy.

                           +--------------+
                           | GooComponent |
                           +--------------+
                                   |
+---------------+ +---------------+ +---------------+ +---------------+
| GooTiAudioEnc | | GooTiAudioDec | | GooTiVideoDec | | GooTiVideoEnc |
+---------------+ +---------------+ +---------------+ +---------------+
        |                 |                 |                 |
 +-------------+   +-------------+  +---------------+ +---------------+
 | GooTiAACEnc |   | GooTiAACDec |  | GooTiMpeg4Dec | | GooTiMpeg4Enc |
 +-------------+   +-------------+  +---------------+ +---------------+

At the top there is GooComponent which represents any OpenMAX component. If the OMX IL implementation is neat and clean, there shouldn’t need to add subclasses underneath it, just parametrize it, and should be ready to use as any other OMX IL component. But reality, as usual, is quite different: Every implementation is different from each other; and to make it worst, each component in a same implementation might behave differently, and that was the case of the TI implementation.

Finally, over libgoo there is gst-goo, the set of GStreamer elements which use the libgoo components. GstGoo also sketched some proof of concepts such as ghost buffers (to be used with the OpenMAX interop profile), and dasfsink and dasfsrc (TI specific).

In those days, before I move to the GStreamer team, an old fellow, Felipe Contreras, worked on gomx, which is the precedent of libgoo, before he got an opportunity in Nokia and started to code on GstOpenMAX. An interesting issue at this point is that Felipec is pushing boldly for a new set of GStreamer elements which ditched OpenMAX and talks directly to the kernel’s DSP bridge: gst-dsp.

What’s the future of libgoo and GstGoo? I couldn’t say. Since I moved to Igalia, I left its development. I’ve heard about a couple companies showed some kind of interest on it, sadly, the current developers are very constrained by the TI workload.