Pekka Paalanen: Weston on Raspberry Pi Accelerated

Raspberry Pi is a nice tiny computer with a relatively powerful VideoCore graphics processor, and an ARM core bolted on the side running Linux. Around October 2012 I was bringing Wayland to it, and in November the Weston rpi-backend was merged upstream. Unfortunately, somehow I did not get around to write about it. In spring 2013 I did a follow-on project on the rpi-backend for the Raspberry Pi Foundation as part of my work for Collabora. We are now really pushing Wayland forward on the Raspberry Pi, and strengthening Collabora's Wayland expertise on all fronts. In the following I will explain what I did and how the new rpi-backend for Weston works in technical terms. If you are more interested in why this was done, I refer you to the excellent post by Daniel Stone: Weston on Raspberry Pi.

Bringing Wayland to Raspberry Pi in 2012

Raspberry Pi has EGL and GL ES 2 support, so the easiest way to bring Wayland was to port Weston. Fortunately unlike most Android-based devices, Raspberry Pi supports normal Linux distributions, and specifically Raspbian, which is a variant of Debian. That means very standard Linux desktop stuff, and easy to target. Therefore I only had to write a new Raspberry Pi specific backend to Weston. I could not use any existing backend, because the graphics stack does not support DRM nor GBM, and running on top of (fbdev) X server would void the whole point. No other usable backends existed at the time.

The proprietary graphics API on RPi is Dispmanx. Dispmanx basically offers a full 2D compositor, but since Weston composited with GL ES 2, I only needed enough Dispmanx to get a full-screen surface for EGL. Half of the patch was just boilerplate to support input and VT handling. All that was fairly easy, but left the Dispmanx API largely unused, not hooking up to the real performance of the VideoCore. Sure, GL ES 2 is accelerated on the VideoCore, too, but it is a much more complex API.

I continued to take more advantage of the hardware compositor Dispmanx exposes. At the time, the way to do that was to implement support for Weston planes. Weston planes were developed for taking advantage of overlay hardware. A backend can take suitable surfaces out from the scenegraph and composite them directly in hardware, bypassing the GL ES 2 renderer of Weston. A major motivation behind it was to offload video display to dedicated hardware, and avoid YUV-RGB color conversion and scaling in GL shaders. Planes allow also the use of hardware cursors.

The hardware compositor on RPi is partially firmware-based. This means that it does not have a constant limit in number of overlays. Standard PC hardware has at most a few overlays if any, the hardware cursor included. The RPi hardware however offers a lot more. In fact, it is possible to assign all surfaces into overlay elements. That is what I implemented, and in an ideal case (no surface transformations) I managed to put everything into overlay elements, and the GL renderer was left with nothing to do.

The hardware compositor does have its limitations. It can do alpha blending, but it cannot rotate surfaces. It also does have a limit on how many elements it can handle, but the actual number depends on many things. Therefore, I had an automatic fallback to the GL renderer. The Weston plane infrastructure made that very easy.

The fallback had some serious downsides, though. There was no way to synchronize all the overlay elements with the GL rendering, and switches between fallback and overlays caused glitches. What is worse, memory consumption exploded through the roof. We only support wl_shm buffers, which need to be copied into GL textures and Dispmanx resources (hardware buffers). As we would jump between GL and overlays arbitrarily and per surface, and I did not want to copy each attached buffer to both of texture and resouce, I had to keep the wl_shm buffer around, just in case it needs to jump and copy as needed. That means that clients will be double-buffered, as they do not get the buffer back until they send a new one. In Dispmanx, the elements, too, need to be double-buffered to ensure that there cannot be glitches, so they needed two resources per element. In total, that means 2 wl_shm buffers, 1 GL texture, and 2 resources. That is 5 surface-sized buffers for every surface! But it worked.

The first project ended, and time passed. Weston got the pixman-renderer, and the renderer interfaces matured. EGL and GL were decoupled from the Weston core. This made the next project possible.

Introducing the Rpi-renderer in Spring 2013

Since Dispmanx offers a full hardware compositor, it was decided that the GL renderer is dropped from Weston's rpi-backend. We lose arbitrary surface transformations like rotation, but on all other aspects it is a win: memory usage, glitches, code and APIs, and presumably performance and power consumption. Dispmanx allows scaling, output transforms, and alpha channel mixed with full-surface alpha. No glitches as we do not jump between GL and overlays anymore. All on-screen elements can be properly synchronized. Clients are able to use single buffering. The Weston renderer API is more complete than the plane API. We do not need to manipulate complex GL state and create vertex buffers, or run the geometry decimation code; we only compute clips, positions, and sizes.

The rpi-backend's plane code had all the essential bits for Dispmanx to implement the rpi-renderer, so lots of the code was already there. I took me less than a week to kick out the GL renderer and have the rpi-renderer show the desktop for the first time. The rest of a month's time was spent on adding features and fixing issues, pretty much.

Graphics Details

Rpi-backend

The rpi-renderer and rpi-backend are tied together, since they both need to do their part on driving the Dispmanx API. The rpi-backend does all the usual stuff like opens evdev input devices, and initializes Dispmanx. It configures a single output, and manages its updates. The repaint callback for the output starts a Dispmanx update cycle, calls into the rpi-renderer to "draw" all surfaces, and then submits the update.

Update submission is asynchronous, which means that Dispmanx does a callback in a different thread, when the update is completed and on screen, including the synchronization to vblank. Using a thread is slightly inconvenient, since that does not plug in to Weston's event loop directly. Therefore I use a trick: rpi_flippipe is essentially a pipe, a pair of file descriptors connected together. Write something into one end, and it pops out the other end. The callback rpi_flippipe_update_complete(), which is called by Dispmanx in a different thread, only records the current timestamp and writes it to the pipe. The other end of the pipe has been registered with Weston's event loop, so eventually rpi_flippipe_handler() gets called in the right thread context, and we can actually handle the completion by calling rpi_output_update_complete().

Rpi-renderer

Weston's renderer API is pretty small:

There are hooks for surface create and destroy, so you can track per-surface renderer private state.
The attach hook is called when a new buffer is committed to a surface.
The flush_damage hook is called only for wl_shm buffers, when the compositor is preparing to composite a surface. That is where e.g. GL texture updates happen in the GL renderer, and not on every commit, just in case the surface is not on screen right now.
The surface_set_color callback informs the renderer that this surface will not be getting a buffer, but instead it must be painted with the given color. This is used for effects, like desktop fade-in and fade-out, by having a black full-screen solid color surface whose alpha channel is changed.
The repaint_output is the workhorse of a renderer. In Weston core, weston_output_repaint() is called for each output when the output needs to be repainted. That calls into the backend's output repaint callback, which then calls the renderer's hook. The renderer then iterates over all surfaces in a list, painting them according to their state as needed.
Finally, the read_pixels hook is for screen capturing.

The rpi-renderer per-surface state is struct rpir_surface. Among other things, it contains a handle to a Dispmanx element (essentially an overlay) that shows this surface, and two Dispmanx resources (hardware pixel buffers); the front and the back. To show a picture, a resource is assigned to an element for scanout.

The attach callback basically only grabs a reference to the given wl_shm buffer. When Weston core starts an output repaint cycle, it calls flush_damage, where the buffer contents are copied to the back resource. Damage is tracked, so that in theory, only the changed parts of the buffer are copied. In reality, the implementation of vc_dispmanx_resource_write_data() does not support arbitrary sub-region updates, so we are forced to copy full scanlines with the same stride as the resource was created with. If stride does not match, the resource is reallocated first. Then flush_damage drops the wl_shm buffer reference, allowing the compositor to release the buffer, and the client can continue single-buffered. The pixels are saved in the back resource.

Copying the buffer involves also another quirk. Even though the Dispmanx API allows to define an image with a pre-multiplied alpha channel, and mix that with a full-surface (element) alpha, a hardware issue causes it to produce wrong results. Therefore we cannot use pre-multiplied alpha, since we want the full-surface alpha to work. This is solved by setting the magic bit 31 of the pixel format argument, which causes vc_dispmanx_resource_write_data() to un-pre-multiply, that is divide, the alpha channel using the VideoCore. The contents of the resource become not pre-multiplied, and mixing with full-surface alpha works.

The repaint_output callback first recomputes the output transformation matrix, since Weston core computes it in GL coordinate system, and we use framebuffer coordinates more or less. Then the rpi-renderer iterates over all surfaces in the repaint list. If a surface is completely obscured by opaque surfaces, its Dispmanx element is removed. Otherwise, the element is created as necessary and updated to the new front resource. The element's source and destination pixel rectangles are computed from the surface state, and clipped by the resource and the output size. Also output transformation is taken into account. If the destination rectangle turns out empty, the element is removed, because every existing Dispmanx element requires VideoCore cycles, and it is best to use as few elements as possible. The new state is set to the Dispmanx element.

After all surfaces in the repaint list are handled, rpi_renderer_repaint_output() goes over all other Dispmanx elements on screen, and removes them. This makes sure that a surface that was hidden, and therefore is not in the repaint list, will really get removed from the screen. Then execution returns to the rpi-backend, which submits the whole update in a single batch.

Once the update completes, the rpi-backend calls rpi_renderer_finish_frame(), which releases unneeded Dispmanx resources, and destroys orphaned per-surface state. These operations cannot be done any earlier, since we need to be sure the related Dispmanx elements have really been updated or removed to avoid possible visual glitches.

The rpi-renderer implements surface_set_color by allocating a 1×1 Dispmanx resource, writing the color into that single pixel, and then scaling it to the required size in the element. Dispmanx also offers a screen capturing function, which stores a snapshot of the output into a resource.

Conclusion

While losing some niche features, we gained a lot by pushing all compositing into the VideoCore and the firmware. Memory consumption is now down to a reasonable level of three buffers per surface, or just two if you force single-buffering of Dispmanx elements. Two is on par with Weston's GL renderer on DRM. We leverage the 2D hardware for compositing directly, which should perform better. Glitches and jerks should be gone. You may still be able to cause the compositing to malfunction by opening too many windows, so instead of the compositor becoming slow, you get bad stuff on screen, which is probably the only downside here. "Too many" is perhaps around 20 or more windows visible at the same time, depending.

If the user experience of Weston on Raspberry Pi was smooth earlier, especially compared to X (see the video), it is even smoother now. Just try the desktop zoom (Win+MouseWheel), for instance! Also, my fellow collaborans wrote some new desktop effects for Weston in this project. Should you have a company needing assistance with Wayland, Collabora is here to help.

The code is available in the git branch raspberrypi-dispmanx, and in the Wayland mailing list. On May 23rd, 2013, the Raspberry Pi specific patches are already merged upstream, and the demo candy patches are waiting for review.

Further related links:
Raspberry Pi Foundation, Wayland preview
Collabora, press release

7 comments:

capisce said...: Informative write up, thanks :)

How big are the performance differences between compositing using Dispmanx and using GLES?; 26 May 2013 at 21:38
pq said...: Hi capisce,

I never benchmarked it, really. The only thing we did check is that we can get 60 fps with the new bouncing square with some static windows on screen. From what I recall of a subjective feeling, the difference was not too noticeable, but that of course says nothing about how much headroom there is for application CPU and GPU usage.; 26 May 2013 at 22:01
Unknown said...: Hi,

I am very interested in Wayland/Weston on Raspberry Pi. Could you give me an update on the development and present state of implementation on the Pi?

Thank you!; 19 December 2013 at 21:15
Henno said...: +1 for what Örs said.; 27 December 2013 at 00:59
pq said...: Hi Örs and Henno,

at Collabora we have lots of work going on with the Pi, which will come out in time. I do not want to spoil the PR value for our client, so I probably can't say much, but of course I can tell about the stuff that is already public.

The biggest thing waiting is probably the EGL Wayland support at https://github.com/raspberrypi/userland/pull/110 . That should allow you to take any GL ES 2 Wayland application and run it unmodified under Weston on Pi, as far as graphics is concerned.

As for present state, before Christmas I updated and verified the Weston build guide, so you should be able to build Weston from upstream master branch again. We are now using the real weston-launch suid-root binary to start weston, and that should restore the VT automatically even if Weston crashes for some reason. Apart from the EGL support, little has happened on the Weston rpi backend/renderer since this blog post, IIRC, so the general upstream Weston improvements should be the other new bits.; 28 December 2013 at 10:43
runeks said...: Do you know if it would be possible to implement something similar for XBMC on the RPi? Currently, video playback works fine on the RPi with XBMC (since it's using the hardware decoder), but overlaying the video with some graphics (as happens when you exit a currently playing video, and the menu is shown overlaid the video) it starts getting laggy. I suspect this is because the proper paths in the VideoCore of the Pi isn't being used.

I don't even know if XBMC has the ability to change rendering paths based on the hardware it's running on, but it would be cool to see XBMC running completely smooth on the Pi (including menus, both when they are overlaid video and by themselves).

It seems like the Pi has plenty of power to render things smoothly, as long as its built-in hardware is being utilized, instead of the CPU.; 26 January 2014 at 12:07
pq said...: Runeks, I have no idea of XBMC's internal architecture, so I cannot comment on that. Right now on rpi, hardware accelerated video decoding requires either presentation directly on the screen, or creating GLES textures from the decoded video frames which implies you have to do a copy in GLES to present it. I have never tried, but maybe there could be a way to hack an overlay on top of the direct video output. OTOH if you go the GLES way, then you can render whatever you want on top of the video, either with GLES or (a new feature for Wayland support) manually handling the presentation and using an overlay.; 27 January 2014 at 08:48

New comments are not allowed.

24 May 2013

Weston on Raspberry Pi Accelerated

7 comments: