Performance of selection with CSS Regions in WebKit and Blink (Part II – perf profiler)
After the initial post introducing this topic and describing the Performance Tests (perftests), now is time to explain how to analyze the performance issues with a profiler in order to improve the code. “Manual” measurements
First of all, you can think of doing some measurements in the source code trying to find the possible bottle necks in an application. For example you could use the following lines inside WebKit/Blink in order to measure the time required to execute a given function:
timespec ts; clock_gettime(CLOCK_REALTIME, &ts); printf("%lld.%.9ld\n", (long long)ts.tv_sec, ts.tv_nsec); // Call to the function you want to measure. timespec ts2; clock_gettime(CLOCK_REALTIME, &ts2); printf("%lld.%.9ld\n", (long long)ts2.tv_sec, ts2.tv_nsec); printf("Diff: %lld.%.9ld\n", (long long)ts2.tv_sec - ts.tv_sec, ts2.tv_nsec - ts.tv_nsec); |
Actually this is a pretty bad idea for a number of reasons:
- Checking directly times with printf() is wrong as you are adding I/O that will spoil the measurements.
- You have to modify the source code for every statement you want to measure, this can be really hard in large projects like WebKit/Blink.
- There are better tools out there, called profilers, explicitly designed for this very same purpose, let’s use them.
Using perf profiler
In this case we are going to talk about how to use perf in a GNU/Linux environment to analyze the performance of the WebKitGTK+ port. You could follow the next steps:
- Install it. It depends on your distro, but it should be simple. For example in Debian:
apt-get install linux-tools
- Run perf record to get the data. This will create a file called perf.data. Here you have different options:
- Call directly the application, it will follow the children processes, so it works properly in WebKit2 architecture with multiple processes:
perf record -g <app>
- Connect perf to an already existing process, for example to the WebProcess:
perf record -p <process-pid> -g
- Call directly the application, it will follow the children processes, so it works properly in WebKit2 architecture with multiple processes:
- Use perf report to analyze the data gathered by perf record. Simply use the following command (where -i <file-name> is optional as by default it reads perf.data file):
perf report -i <file-name>
About how to collect the data, in WebKitGTK+ you have the alternative to generate the perf data files while running the perftests. Just adding some arguments to run-perf-tests script: Tools/Scripts/run-perf-tests --platform=gtk --debug --profile
Which will create a file called test.data under WebKitBuild/Debug/layout-test-results/ folder.
Analyze profile session
perf report provides you the list of methods that have been running for more time, with the percentage of how many time was spent in each of them. Then you can get the full backtraces from the different places the method is called and know how many times it was invoked from each trace.
Let’s use a concrete example to illustrate it. This is a perf report output got from the WebProcess doing a big selection in a page with CSS Regions:
- 6.26% lt-WebKitWebPro libwebkit2gtk-3.0.so.25.5.0 [.] WebCore::LayoutUnit::rawValue() const - WebCore::LayoutUnit::rawValue() const - 33.10% WebCore::operator+(WebCore::LayoutUnit const&, WebCore::LayoutUnit const&) - 39.54% WebCore::LayoutRect::maxX() const - 83.75% WebCore::LayoutRect::intersect(WebCore::LayoutRect const&) - 98.69% WebCore::RenderRegion::repaintFlowThreadContentRectangle(WebCore::LayoutRect const&, bool, WebCore::LayoutRect const&, WebCore::LayoutRect const&, WebCore::LayoutPoint const&) const WebCore::RenderRegion::repaintFlowThreadContent(WebCore::LayoutRect const&, bool) const WebCore::RenderFlowThread::repaintRectangleInRegions(WebCore::LayoutRect const&, bool) const - WebCore::RenderObject::repaintUsingContainer(WebCore::RenderLayerModelObject const*, WebCore::IntRect const&, bool) const - 75.49% WebCore::RenderSelectionInfo::repaint() WebCore::RenderView::setSelection(WebCore::RenderObject*, int, WebCore::RenderObject*, int, WebCore::RenderView::SelectionRepaintMode)
As you can see, 6.26% of the time is spent in LayoutUnit::rawValue() method, this is used in lots of places. 33.10% of the time it’s called from WebCore::operator+() which is also quite generic. We should keep going deeper in the call-graph till we reach some methods that are interesting in our particular case.
In this case, the selection starts in RenderView::setSelection(), so we should investigate further the methods called from there. Of course, in order to do that you need to have some understanding of the code where you’re moving or you’ll end up completely lost.
Improve source code
Thanks to this data I realized that in each RenderSelectionInfo::repaint() it’s used the RenderObject::containerForRepaint(). Which most times returns the parent RenderNamedFlowThread for all its children in the render tree.
This causes that for every element under RenderNamedFlowThread the method RenderFlowThread::repaintRectangleInRegions() is called. Taking a look to this method it has a loop over all the regions forcing a repaint. This means that if you have 1000 regions, even if you’re just selecting in one of them, a repaint in the rest of regions is executed.
So, I’ve provided a patch that repaints only the affected regions, it means around 12%, 18% and 73% improvement in Layout/RegionsSelection.html, Layout/RegionsSelectAllMixedContent.html and Layout/RegionsExtendingSelectionMixedContent.html perftest respectively.
Doing tests with more regions using this example we got the following results:
Regions | Without patch (ms) | With patch (ms) | Improvement |
---|---|---|---|
100 | 923 | 338 | 63% |
150 | 2712 | 727 | 73% |
200 | 5952 | 1285 | 78% |
500 | 81731 | 7868 | 90% |
As expected, results are better as we increase the number of regions.
Conclusions
This was just one specific example in order to explain how to use the available tools, trying to provide the required context to understand them properly.
For sure, there’s still plenty of work to be done in order to improve the performance of selection with CSS Regions. Nonetheless, we still have to settle a final implementation for the selection in CSS Regions before going on with the optimization efforts, as it was explained in my previous post we’re on the way to fix it.
This work has been done inside the collaboration between Adobe and Igalia around CSS Regions.
In Igalia we have a great experience in performance optimization for CSS standards like CSS Flexible Box Layout, CSS Grid Layout and CSS Regions. Please don’t hesitate to contact us if you need some help around these topics.
Comments
Very interesting post, specially because it explains with a real example how to use a tool to improve the performance of an application.
Eager to see other examples using other tools :)