V8 profiling instrumentation overhead

In the last 2 years, I have been contributing to improve the performance of V8 profiling instrumentation in different scenarios, both for Linux (using Perf) and for Windows (using Windows Performance Toolkit).

My work has been mostly improving the stack walk instrumentation that allows Javascript generated code to show in the stack traces, referring to the original code, in the same way native compiled code in C or C++ can provide that information.

In this blog post I run different benchmarks to determine how much overhead this instrumentation introduces.

How can you enable profiling instrumentation?

V8 implements system native profile instrumentation for the Javascript code, Both for Linux Perf and for Windows Performance Toolkit. Though, those are disabled by default.

For Windows, the command line switch --enable-etw-stack-walking instruments the generated Javascript code, to emit the source code information for the compiled functions using ETW (Event Tracing for Windows). This gives a better insight of time spent in different functions, specially when stack profiling is enabled in Windows Performance Toolkit.

For Linux Perf, there are several command line switches:
--perf-basic-prof: this emits, for any Javascript generated code, its memory location and a descriptive name (that includes the source location of the function).
--perf-basic-prof-only-functions: same as the previous one, but only emitting functions, excluding other stubs as regular expressions.
--perf-prof: this is a different way to provide the information for Linux Perf. It generates a more complex format specified here. On top of the addresses of the functions, it also includes source code information, even with details about the exact line of code inside the function that is executed in a sample.
--perf-prof-annotate-wasm: this one extends --perf-prof, to add debugging information to WASM code.
--perf-prof-unwinding-info: This last one also extends --perf-prof, but providing experimental support for unwinding info.

And then, we have interpreted code support. V8 does not generate JIT code for everything, and in many cases it runs interpreted code that calls common builtin code. So, in a stack trace, instead of seeing the Javascript method that eventually runs those methods through the interpreter, we see those common methods. The solution? Using --interpreted-frames-native-stack, that adds additional information that identifies which Javascript method in the stack is actually interpreted. This is basic to understand the attribution of running code, especially if it is not considered hot enough to be compiled.

They are disabled by default, is it a problem?

We would ideally want to have all of this profiling support always enabled when we are profiling code. For that, we would want to avoid having to pass any command line switch, so we can profile Javascript workloads without any additional configuration.

But all these command line switches are disabled by default. Why? There are several reasons.

What happens on Linux?

In Linux, Perf profiling instrumentation will unconditionally generate additional files, as it cannot know if Perf is running or not. --perf-basic-prof backend writes the generated information to .map files, and --perf-prof backend writes additional information to jit-*.dump files that will be used later with perf inject -j.

Additionally, the instrumentation requires code compaction to be disabled. This is because code compaction will change the memory location of the compiled code. V8 generates CODE_MOVE internal events for profiler instrumentation. But ETW and Linux Perf backends do not handle those events for a variety of reasons. It has an impact in memory as there is no way to compact the space allocated for code without code move.

So, when we enable any of the Linux Perf command line switches we both generate extra files and take more memory.

And what about Windows?

In Windows we have the same problem with code compaction. We do not support generating CODE_MOVE events so, while profiling, we cannot enable code compaction.

And the emitted ETW events with JIT code position information do still take memory space and make the profile recordings bigger.

But there is an important difference: Windows ETW API allows applications to know when they are being profiled, and even filter that. V8 only emits the code location information for ETW if the application is being profiled, and code compaction is only disabled when profiling is ongoing.

That means the overhead for enabling --enable-etw-stack-walking is expected to be minimal.

What about --interpreted-frames-native-stack? It has also been optimized to generate the extra stack information only when a profile is being recorded.

Can we enable them by default?

For Linux, the answer is a clear no. Any V8 workload would generate profiling information in files, and disabling code compaction makes memory usage higher. These problems happen no matter if you are recording a profile or not.

But, Windows ETW support adds overhead only when profiling! It looks like it could make sense to enable both --interpreted-frames-native-stack and --enable-etw-stack-walking for Windows.

Are we there yet? Overhead analysis

So first, we need to verify the actual overhead because we do not want to introduce a regression by enabling any of these options. I also want to confirm the actual overhead in Linux matches the expectation. To do that I am showing the result of running CSuite benchmarks, included as part of V8. I am capturing both the CPU and memory usage.

Linux benchmarks

The results obtained in Linux are…

Legend:
– REC: recording a profile (Yes, No).
– No flag: running the test without any extra command line flag.
– BP: passing --perf-basic-prof.
– IFNS: passing --interpreted-frames-native-stack.
– BP+IFNS: passing both --perf-basic-prof and --interpreted-frames-native-stack.

Linux results – score:

Test REC Better No flag BP IFNS BP+IFNS
Octane No Higher 9101.2 8953.3 9077.1 9097.3
Yes 9112.8 9041.7 9004.8 9093.9
Kraken No Lower 4060.3 4108.9 4076.6 4119.9
Yes 4078.1 4141.7 4083.2 4131.3
Sunspider No Lower 595.7 622.8 595.4 627.8
Yes 598.5 626.6 599.3 633.3

Linux results – memory (in Kilobytes)

Test REC No flag BP IFNS BP+IFNS
Octane No 244040.0 249152.0 243442.7 253533.8
Yes 242234.7 245010.7 245632.9 252108.0
Kraken No 46039.1 46169.5 46009.1 46497.1
Yes 46002.1 46187.0 46024.4 46520.5
Sunspider No 90267.0 90857.2 90214.6 91100.4
Yes 90210.3 90948.4 90195.9 91110.8

What is seen in the results?
– There is apparently no CPU or memory overhead of --interpreted-frames-native-stack alone. Though there is an outlier while recording Octane memory usage, that could be an error in the sampling process. This is expected as the additional information is only generated while any profiling instrumentation is enabled.
– There is no clear extra cost in memory for recording vs not recording. This is expected as the extra information only depends on having the switches enabled. It does not detect when recording is ongoing.
– There is, as expected, a cost in CPU when Perf is ongoing, in most of the cases.
--perf-basic-prof has CPU and memory impact. And combined with --interpreted-frames-native-stack, the impact is even higher as expected.

This is on top of the fact that files are generated. The benchmarks back not enabling Perf --perf-basic-prof by default. But as --interpreted-frames-native-stack has only overhead in combination with profiling switches, it could make sense to consider enabling it on Linux.

Windows benchmarks

What about Windows results?

Legend:
– No flag: running the test without any extra command line flag.
– ETW: passing --enable-etw-stack-walking.
– IFNS: passing --interpreted-frames-native-stack.
– ETW+IFNS: passing both --enable-etw-stack-walking and --interpreted-frames-native-stack.

Windows results – score:

Test REC Better No flag ETW IFNS ETW+IFNS
Octane No Higher 8323.8 8336.2 8308.7 8310.4
Yes 8050.9 7991.3 8068.8 7900.6
Kraken No Lower 4273.0 4294.4 4303.7 4279.7
Yes 4380.2 4416.4 4433.0 4413.9
Sunspider No Lower 671.1 670.8 672.0 671.5
Yes 693.2 703.9 690.9 716.2

Windows results – memory:

Test REC No flag ETW IFNS ETW+IFNS
Octane Not recording 202535.1 204944.9 200700.4 203557.8
Recording 205125.8 204801.3 206856.9 209260.4
Kraken Not recording 76188.2 76095.2 76102.1 76188.5
Recording 76031.6 76265.0 76215.6 76662.0
Sunspider Not recording 31784.9 31888.4 31806.9 31882.7
Recording 31848.9 31882.1 31802.7 32339.0

What is observed?
– No memory or CPU overhead is observed when recording is not ongoing, with --enable-etw-stack-walking and/or --interpreted-frames-native-stack.
– Even recording, there is no clearly visible overhead of --interpreted-frames-native-stack.
– When recording, --enable-etw-stack-walking has an impact on CPU. But not on memory.
– When --enable-etw-stack-walking and --interpreted-frames-native-stack are combined, while recording, both memory and CPU overheads are observed.

So, again, it should be possible to enable --interpreted-frames-native-stack on Windows by default as it only has impact while recording a profile with --enable-etw-stack-walking. And it should be possible to enable --enable-etw-stack-walking by default too as, again, the impact happens only when a profile is recorded.

Windows: there are still problems

Is it that simple? Is it just OK only adding overhead when recording a profile?

One problem with this is that there is an impact in system wide recordings, even if the focus is not on Javascript execution.

Also, the CPU impact is not linear. Most of it happens when a profile recording starts, V8 emits the code position of all the methods in all the already existing Javascript functions and builtins. Now, imagine a regular desktop with several V8 based browsers as Chrome and Edge, with all their tabs. Or a server with many NodeJS instances. All that generates the information of all their methods at the same time.

So the impact of the overhead is significant, even if it only happens while recording.

Next steps

For --interpreted-frames-native-stack it looks like it would be better to propose enabling it by default in all studied platforms (Windows and Linux). It significantly improves the profiling instrumentation, and it has impact only when actual instrumentation is used. And then it still allows disabling it for the specific cases where the original builtins recording is preferred.

For --enable-etw-stack-walking, it could make sense to also propose enabling it by default, even with the known overhead while recording profiles. And just make sure there are no surprise regressions.

But, likely, the best option is trying to reduce that overhead. First, by allowing to filter better which processes generate the additional information from Windows Performance Recorder. And then, also, by reducing the initial overhead as much as possible. I.e. a big part of it is emitting the builtins and V8 snapshot compiled functions for each of the contexts again and again. Ideally we want to avoid that duplication if possible.

Wrapping up

While on Linux, enabling profile instrumentation with command line switches is not an option, Windows dynamically enabling instrumentation allows to consider enabling ETW support by default. But there are still optimization opportunities that could be addressed before.

Thanks!

This report has been done as part of the collaboration between Bloomberg and Igalia. Thanks!