Enabling HuC for SKL/KBL in Debian/testing

Recently, our friend Florent complained that it was impossible to set a constant bitrate when encoding H.264 using low-power profile with gstreamer-vaapi .

Low-power (LP) profiles are VA-API entry points, available in Intel SkyLake-based procesor and succesors, which provide video encoding with low power consumption.

Later on, Ullysses and Sree, pointed out that CBR in LP is ony possible if HuC is enabled in the kernel.

HuC is a firmware, loaded by i915 kernel module, designed to offload some of the media functions from the CPU to GPU. One of these functions is bitrate control when encoding. HuC saves unnecessary CPU-GPU synchronization.

In order to load HuC, it is required first to load GuC, another Intel’s firmware designed to perform graphics workload scheduling on the various graphics parallel engines.

How we can install and configure these firmwares to enable CBR in low-power profile, among other things, in Debian/testing?

Check i915 parameters

First we shall confirm that our kernel and our i915 kernel module is capable to handle this functionality:

$ sudo modinfo i915 | egrep -i "guc|huc|dmc"
firmware:       i915/bxt_dmc_ver1_07.bin
firmware:       i915/skl_dmc_ver1_26.bin
firmware:       i915/kbl_dmc_ver1_01.bin
firmware:       i915/kbl_guc_ver9_14.bin
firmware:       i915/bxt_guc_ver8_7.bin
firmware:       i915/skl_guc_ver6_1.bin
firmware:       i915/kbl_huc_ver02_00_1810.bin
firmware:       i915/bxt_huc_ver01_07_1398.bin
firmware:       i915/skl_huc_ver01_07_1398.bin
parm:           enable_guc_loading:Enable GuC firmware loading (-1=auto, 0=never [default], 1=if available, 2=required) (int)
parm:           enable_guc_submission:Enable GuC submission (-1=auto, 0=never [default], 1=if available, 2=required) (int)
parm:           guc_log_level:GuC firmware logging level (-1:disabled (default), 0-3:enabled) (int)
parm:           guc_firmware_path:GuC firmware path to use instead of the default one (charp)
parm:           huc_firmware_path:HuC firmware path to use instead of the default one (charp)

Install firmware

$ sudo apt install firmware-misc-nonfree

UPDATE: In order to install this Debian package, you should have enabled the non-free apt repository in your sources list.

Verify the firmware are installed:

$ ls -1 /lib/firmware/i915/
bxt_dmc_ver1_07.bin
bxt_dmc_ver1.bin
bxt_guc_ver8_7.bin
bxt_huc_ver01_07_1398.bin
kbl_dmc_ver1_01.bin
kbl_dmc_ver1.bin
kbl_guc_ver9_14.bin
kbl_huc_ver02_00_1810.bin
skl_dmc_ver1_23.bin
skl_dmc_ver1_26.bin
skl_dmc_ver1.bin
skl_guc_ver1.bin
skl_guc_ver4.bin
skl_guc_ver6_1.bin
skl_guc_ver6.bin
skl_huc_ver01_07_1398.bin

Update modprobe configuration

Edit or create the configuration file /etc/modprobe.d/i915.con

$ sudo vim /etc/modprobe.d/i915.conf
....
$ cat /etc/modprobe.d/i915.conf
options i915 enable_guc_loading=1 enable_guc_submission=1

Reboot

$ sudo systemctl reboot 

Verification

Now it is possible to verify that the i915 module kernel loaded the firmware correctly by looking at the kenrel logs:

$ journalctl -b -o short-monotonic -k | egrep -i "i915|dmr|dmc|guc|huc"
[   10.303849] miau kernel: Setting dangerous option enable_guc_loading - tainting kernel
[   10.303852] miau kernel: Setting dangerous option enable_guc_submission - tainting kernel
[   10.336318] miau kernel: i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
[   10.338664] miau kernel: i915 0000:00:02.0: firmware: direct-loading firmware i915/kbl_dmc_ver1_01.bin
[   10.339635] miau kernel: [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_01.bin (v1.1)
[   10.361811] miau kernel: i915 0000:00:02.0: firmware: direct-loading firmware i915/kbl_huc_ver02_00_1810.bin
[   10.362422] miau kernel: i915 0000:00:02.0: firmware: direct-loading firmware i915/kbl_guc_ver9_14.bin
[   10.393117] miau kernel: [drm] GuC submission enabled (firmware i915/kbl_guc_ver9_14.bin [version 9.14])
[   10.410008] miau kernel: [drm] Initialized i915 1.6.0 20170619 for 0000:00:02.0 on minor 0
[   10.559614] miau kernel: snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
[   11.937413] miau kernel: i915 0000:00:02.0: fb0: inteldrmfb frame buffer device

That means that HuC and GuC firmwares were loaded successfully.

Now we can check the status of the modules using sysfs

$ sudo cat /sys/kernel/debug/dri/0/i915_guc_load_status
GuC firmware status:
        path: i915/kbl_guc_ver9_14.bin
        fetch: SUCCESS
        load: SUCCESS
        version wanted: 9.14
        version found: 9.14
        header: offset is 0; size = 128
        uCode: offset is 128; size = 142272
        RSA: offset is 142400; size = 256

GuC status 0x800330ed:
        Bootrom status = 0x76
        uKernel status = 0x30
        MIA Core status = 0x3

Scratch registers:
         0:     0xf0000000
         1:     0x0
         2:     0x0
         3:     0x5f5e100
         4:     0x600
         5:     0xd5fd3
         6:     0x0
         7:     0x8
         8:     0x3
         9:     0x74240
        10:     0x0
        11:     0x0
        12:     0x0
        13:     0x0
        14:     0x0
        15:     0x0
$ sudo cat /sys/kernel/debug/dri/0/i915_huc_load_status
HuC firmware status:
        path: i915/kbl_huc_ver02_00_1810.bin
        fetch: SUCCESS
        load: SUCCESS
        version wanted: 2.0
        version found: 2.0
        header: offset is 0; size = 128
        uCode: offset is 128; size = 218304
        RSA: offset is 218432; size = 256

HuC status 0x00006080:

Test GStremer

$ gst-launch-1.0 videotestsrc num-buffers=1000 ! video/x-raw, format=NV12, width=1920, height=1080, framerate=\(fraction\)30/1 ! vaapih264enc bitrate=8000 keyframe-period=30 tune=low-power rate-control=cbr ! mp4mux ! filesink location=test.mp4
Setting pipeline to PAUSED ...
Pipeline is PREROLLING ...
Got context from element 'vaapiencodeh264-0': gst.vaapi.Display=context, gst.vaapi.Display=(GstVaapiDisplay)"\(GstVaapiDisplayGLX\)\ vaapidisplayglx0";
Pipeline is PREROLLED ...
Setting pipeline to PLAYING ...
New clock: GstSystemClock
Got EOS from element "pipeline0".
Execution ended after 0:00:11.620036001
Setting pipeline to PAUSED ...
Setting pipeline to READY ...
Setting pipeline to NULL ...
Freeing pipeline ...
$ gst-discoverer-1.0 test.mp4 
Analyzing file:///home/vjaquez/gst/master/intel-vaapi-driver/test.mp4
Done discovering file:///home/vjaquez/test.mp4

Topology:
  container: Quicktime
    video: H.264 (High Profile)

Properties:
  Duration: 0:00:33.333333333
  Seekable: yes
  Live: no
  Tags: 
      video codec: H.264 / AVC
      bitrate: 8084005
      encoder: VA-API H264 encoder
      datetime: 2017-12-07T14:29:23Z
      container format: ISO MP4/M4A

Misison accomplished!

References

A GStreamer Video Sink using KMS

The purpose of this blog post is to show the concepts related to the GstKMSSink, a new video sink for GStreamer 1.0, co-developed by Alessandro Decina and myself, done during my hack-fest time in the Igalia’s multimedia team.

One interesting thing to notice is that this element shows it is possible to write DRI clients without the burden of X Window.

Brief introduction to graphics in Linux

If you want to dump images onto your screen, you can simply use the frame buffer device. It provides an abstraction for the graphics hardware and represents the frame buffer of the video hardware. This kernel device allows user application to access the graphics hardware without knowing the low-level details [1].

In GStreamer, we have two options for displaying images using the frame buffer device; or three, if we use OMAP3: fbvideosink, fbdevsink and gst-omapfb.

Nevertheless, since the appearance of the GPUs, the frame buffer device interface has not been sufficient to fulfill all their capabilities. A new kernel interface ought to emerge. And that was the Direct Rendering Manager (DRM).

What in the hell is DRM?

The DRM layer is intended to support the needs of complex graphics devices, usually containing programmable pipelines well suited to 3D graphics acceleration [2]. It deals with [3]:

  1. A DMA queue for graphic buffers transfers [4].
  2. It provides locks for graphics hardware, treating it as shared resource for simultaneous 3D applications [5].
  3. And it provides secure hardware access, preventing clients from escalating privileges [6].

The DRM layer consists of two in-kernel drivers: a generic DRM driver, and another which has specific support for the video hardware [7]. This is possible because the DRM engine is extensible, enabling the device-specific driver to hook out those functionalities that are required by the hardware. For example, in the case of the Intel cards, the Linux kernel driver i915 supports this card and couples its capabilities to the DRM driver.

The device-specific driver, in particular, should cover two main kernel interfaces: the Kernel Mode Settings (KMS) and the Graphics Execution Manager (GEM). Both elements are also exposed to the user-space through the DRM.

With KMS, the user can ask the kernel to enable native resolution in the frame buffer, setting certain display resolution and colour depth mode. One of the benefits of doing it in kernel is that, since the kernel is in complete control of the hardware, it can switch back in the case of failure [8].

In order to allocate command buffers, cursor memory, scanout buffers, etc., the device-specific driver should support a memory manager, and GEM is the manager with more acceptance these days, because of its simplicity [9].

Beside to the graphics memory management, GEM ensures conflict-free sharing of data between applications by managing the memory synchronization. This is important because modern graphics hardware are essentially NUMA environments.

The following diagram shows the components view of the DRM layer:

Direct Rendering Infrastructure

What is the deal with KMS?

KMS is important because on it relies GEM and DRM to allocate frame buffers and to configure the display. And it is important to us because almost all of the ioctls called by the GStreamer element are part of the KMS subset.

Even more, there are some voices saying that KMS is the future replacement for the frame buffer device [10].

To carry out its duties, the KMS identifies five main concepts [11,12]:

Frame buffer:
The frame buffer is just a buffer, in the video memory, that has an image encoded in it as an array of pixels. As KMS configures the ring buffer in this video memory, it holds a the information of this configuration, such as width, height, color depth, bits per pixel, pixel format, and so on.
CRTC:
Stands for Cathode Ray Tube Controller. It reads the data out of the frame buffer and generates the video mode timing. The CRTC also determines what part of the frame buffer is read; e.g., when multi-head is enabled, each CRTC scans out of a different part of the video memory; in clone mode, each CRTC scans out of the same part of the memory.Hence, from the KMS perspective, the CRTC’s abstraction contains the display mode information, including, resolution, depth, polarity, porch, refresh rate, etc. Also, it has the information of the buffer region to display and when to change to the next frame buffer.
Overlay planes:
Overlays are treated a little like CRTCs, but without associated modes our encoder trees hanging off of them: they can be enabled with a specific frame buffer attached at a specific location, but they don’t have to worry about mode setting, though they do need to have an associated CRTC to actually pump their pixels out [13].
Encoder:
The encoder takes the digital bitstream from the CRTC and converts it to the appropriate format across the connector to the monitor.
Connector:
The connector provides the appropriate physical plug for the monitor to connect to, such as HDMI, DVI-D, VGA, S-Video, etc..

And what about this KMSSink?

KMSSink is a first approach towards a video sink as a DRI client. For now it only works in the panda-board with a recent kernel (I guess, 3.3 would make it).

For now it only uses the custom non-tiled buffers and use an overlay plane to display them. So, it is in the to-do, add support for more hardware.

 Bibliography

[1] http://free-electrons.com/kerneldoc/latest/fb/framebuffer.txt
[2] http://free-electrons.com/kerneldoc/latest/DocBook/drm/drmIntroduction.html
[3] https://www.kernel.org/doc/readme/drivers-gpu-drm-README.drm
[4] http://dri.sourceforge.net/doc/drm_low_level.html
[5] http://dri.sourceforge.net/doc/hardware_locking_low_level.html
[6] http://dri.sourceforge.net/doc/security_low_level.html
[7] https://en.wikipedia.org/wiki/Direct_Rendering_Manager
[8] http://www.bitwiz.org.uk/s/how-dri-and-drm-work.html
[9] https://lwn.net/Articles/283798/
[10] http://phoronix.com/forums/showthread.php?23756-KMS-as-anext-gen-Framebuffer
[11] http://elinux.org/images/7/71/Elce11_dae.pdf
[12] http://www.botchco.com/agd5f/?p=51
[13] https://lwn.net/Articles/440192/

GStreamer video decoder for SysLink

This blog post is another one of our series about SysLink (1, 2): finally I came with a usable GStreamer element for video decoding, which talks directly with the SysLink framework in the OMAP4’s kernel.

As we stated before, SysLink is a set of kernel modules that enables the initialization of remote processors, in a multi-core system (which might be heterogeneous), that run their own operating systems in their own memory space, and also, SysLink, enables the communication between the host processor with the remotes ones. This software and hardware setup could be viewed as an Asymmetric Multi-Processing model.

TI provides a user-space library to access the SysLink services, but I find its implementation a bit clumsy, so I took the challenge of rewrite a part of it, in a simple and straightforward fashion, as gst-dsp does for DSP/Bridge. The result is the interface syslink.h.

Simultaneously, I wrote the utility to load and monitor the operating system into the Cortex-M3 processors for the PandaBoard. This board, such as all the OMAP4-based SoCs, has two ARM Core-M3 as remote processors. Hence, this so called daemon.c, is in charge of loading the firmware images, setting the processor in its running state, allocating the interchange memory areas, and monitoring for any error message.

In order to load the images files into the processors memory areas, it is required to parse the ELF header of the files, and that is the reason of why I decided to depend on libelf, rather than write another ELF parser. Yes, one sad dependency for the daemon. The use of libelf is isolated in elf.h.

When I was developing the daemon, for debugging purposes, I needed to trace the messages generated by the images in the remote processors. For that reason I wrote tracer.c, whose only responsibility is to read and to parse the ring buffer used by the images, in the remote processors, for logging.

Now, in OMAP4, the subsystem comprised by the two Cortex-M3 processors is called Ducati. The first processor is used only for the exchange of notification messages among the host processor and the second M3 processor, where all the multimedia processing is done.

There are at least two images for the second Cortex-M3 processor: DOMX, which is closed source and focused, as far as I know, on the OMX IL interface; and, in the other hand, DCE, which is open source, it’s developed by Rob Clark, and it provides a simple interface of buffers interchange.

My work use DCE, obviously.

But, please, let me go back one step in this component description: in order to send and receive commands between the host processor and one remote processor, SysLink uses a packet based protocol, called Remote Command Messaging, or just RCM for the friends. There are two types of interfaces of RCM, the client and the server. The client interface is used by the applications running in the host processor, and they request services to the server interface, exposed by the systems running in the remote processors, it is accepting the requests and it returns results.

The RCM client interface is in rcm.h.

Above the RCM client, sits my dce.h interface, which is in charge of control the state of the video decoders and it is also in charge of handling the buffers.

But these buffers are tricky. They are not memory areas allocated by a simple malloc, instead they are buffers allocated by a mechanism in the kernel called tiler. The purpose of this mechanism is to provide buffers with capacity of 2D operations by hardware (in other words, cheap and quick in computations). These buffers are shared along all the processing pipeline, so the copies of memory areas are not needed. Of course, in order to achieve this paradise, the video renderer must handle this type of buffers too.

In my code, the interface to the tiler mechanism is in tiler.h.

And finally, the all mighty GStreamer element for video decoding: gstsyslinkvdec.c! Following the spirit of gst-dsp, this element is intended to deal with all the available video decoders in the DCE image, although for now, the H264 decoding is the only one handled.

For now, I have only tested the decoder with fakesink, because the element pushes tiled buffers onto the source pad, and, in order to have an efficient video player, it is required a video renderer that handles this type of tiled buffers. TI is developing one, pvrvideosink, but it depends on EGL, and I would like to avoid X whenever is possible.

I have not measured either the performance of this work compared with the TI’s combo (syslink user-space library / memmgr / libdce / gst-ducati), but I suspect that my approach would be little more efficient, faster, and, at least, simpler 😉

The sad news, as in every hard paced development, all these kernel mechanisms are already deprecated: SysLink and DMM-Tiler will never be mainlined into the kernel, but their successors, rproc/rpmsg and omapdrm, have a good chance. And both have a very different approach since their predecessors. Nevertheless, SysLink is already here and it is being used widely, so this effort has an opportunity for being worthy.

My current task is to decide if I should drop the 2D buffers in the video decoders or if I should develop a video renderer for them.

OpenMAX: a rant

I used to work with OpenMAX a while ago. I was exploring an approach to wrap OpenMAX components with GStreamer elements. The result was gst-goo. Its purpose was to test the OpenMAX components in complex scenarios and it was only focused for the Texas Instruments’ OpenMAX implementation for the OMAP3 processor.

Some time after we started gst-goo, Felipe Contreras released gst-openmax, which had a more open development, but with a hard set of performance objectives like zero-copy. And also only two implementations were supported at that moment: Bellagio and the TI’s one mentioned before.

Recently, Sebastian Dröge has been working on a redesign of gst-openmax, called gst-omx. He explained the rational behind this new design in his talk in the GStreamer Conference 2011. If you are looking for a good summary of the problems faced when wrapping OpenMAX with GStreamer, because of their semantic impedance mismatch, you should watch his talk.

In my opinion, the key purpose of OpenMAX is to provide a common application interface to a set of different and heterogeneous multimedia components: You could take different implementations, that could offer hardware-accelerated codecs or either any other specialized ones, and build up portable multimedia applications. But this objective has failed utterly: every vendor delivers a incompatible implementation with the others available. One of the causes, as Dröge explained, is because of the specification, it is too ambiguous and open to interpretations.

From my perspective, the problem arises from the need of a library like OpenMAX. It is needed because the implementer wants to hide (or to abstract if you prefer) the control and buffer management of his codification entities. By hiding this, the implementer has the freedom to develop his own stack closely, without any kind of external review.

In order to explain the problem brought by the debauchery in the hind of OpenMAX, let me narrow the scope of the problem: I will not fall on the trap of portability among different operative systems, specially in those of non-Unix. Even more, I will only focus on the ARM architecture of the Linux kernel. Thus, I will not consider the software-based codecs, only the hardware-accelerated ones. The reason upholding these constrains is that, beside the PacketVideo’s OpenCORE, I am not aware of any other successful set of non-Unix / software-based multimedia codecs, interfaced with OpenMAX.

As new and more complex hardware appears, with its own processing units, capable of off-loading the main processor, silicon vendors must deliver also the kernel drivers to operate them. This problem is very recurrent among the ARM vendors, where the seek of added value gives the competitive advantage, and the Linux kernel has the virtues required for a fast time to market.

But these virtues have turned into a ballast: It has been observed excessive churn in the ARM architecture, duplicated code, board-specific data encoded in source files, and conflicts at kernel’s code integration. In other words, every vendor has built up their own software stack without taking care of developing common interfaces among all of them. And this has been particularly true for the hardware-accelerated multimedia components, where OpenMAX promised to the user-space developers what the kernel developers could not achieve.

First we need a clear and unified kernel interface for hardware accelerated multimedia codecs, so the user-space implementations could be straight, lean and clean. Those implementations could be OpenMAX, GStreamer, libav, or whatever we possibly want and need.

But there is hope. Recently there has happened a lot of effort bringing new abstractions and common interfaces for the ARM architecture, so in the future we could expect integrated interfaces for all these new hardware, independently of the vendor.

Though, from my perspective, if we reach this point (and we will), we will have less motivation for a library like OpenMAX, because a high level library, such as GStreamer, it would cover a lot of hardware within a single element. Hence, it is a bit pointless to invest too much in OpenMAX or its wrappers nowadays.

Of course, if you think that I made a mistake along these reasons, I would love to read your comments.

And last but not least, Igalia celebrates its 10th anniversary! Happy igalian is happy 🙂