Today in Wayland compositor profiling! Turns out closing a shm pool file descriptor can result in a fat stall of up to like 6 ms with the kernel waiting on some spinlocks. Which is extra fun when you realize it covers the entire frame budget of your 165 Hz screen, and some clients are sometimes doing it every frame!
I'm trying a "dropping thread" workaround where the fd closing happens on a separate thread. Appears to work at the first glance.
Dmabuf screencasting is crazy good. Here's a histogram of the screencasting overhead on my 2560×1600@165 screen—the median is 300 microseconds, and the worst across 12,669 frames was just below 1 ms. Most of that time is spent rendering the frame, perhaps something could even be further optimized in Smithay.
And yeah, if you look at the profiling timeline, I zoomed it in such a way that almost the entire width is taken by one frame, that is 6.05 ms long. Most of it is completely empty!
Thought of another thing to plot in Tracy: target presentation time offset! This is the difference between when a frame was shown on screen and the target time that we were rendering for.
Here you can see data across 17 seconds of runtime while recording with OBS. Offset on both monitors fluctuates within a few microseconds around zero, which means that our rendering lands right on time.
I'm still working on niri btw (and using it myself too). Today I finally finished a window layout refactor that was due from very early on.
Now the layout always works correctly, with all the paddings, struts, fullscreen windows and animations. It's tricky because while most of the logic operates only on the "working area" (view excluding struts), fullscreen windows in particular must cover the entire view area, while otherwise acting as just another regular window column.
It's also common to see one frame worth of offset like on this zoomed-out screenshot. This happens when the compositor wakes up from idling too late into the monitor refresh cycle and doesn't manage to render a new frame in time.
Decided to make a new demo video for niri, finally. The last one was so old that niri didn't even have cursors implemented, it showed an orange rectangle instead. 🫠
Very happy I've come this far writing my own compositor from scratch. Honestly thought my motivation would only last for two weeks max, but here we are. :blobcattea:
Learned a ton in the process, and now this experience helps me with Mutter & Shell profiling.
@lanodan it's a tech demo / kitchen sink compositor from Smithay: https://github.com/Smithay/smithay It's by no means fully fledged or optimized, so don't take this as a measure of Smithay's potential performance, but I just wanted to check it out of curiosity.
Next, a more interesting test: editors. For terminal editors I used Alacritty, and I've also added the fast and the slow baselines from the previous tests.
Here Neovim and Helix in text mode are the fastest, followed by nano, which has more spread for some reason. Next we have G-T-E and Builder with quite a bit of spread (@hergertme, any idea what's going on here?), then Helix and Neovim with IDE functionality, and finally VSCode.
2 years ago VSCode was better; maybe my extension setup changed
Moar measurements: compositors. Since for this test the key presses are slow and there's no continuous redrawing, this should boil down to the amount of work a compositor does on screen update.
Un-vsynced X11 is obviously the fastest; thankfully work to add tearing flips to kernel and Wayland is ongoing.
Surprised to see GNOME Shell be a bit slower than raw Mutter, especially in fullscreen, since it doesn't really do much extra there. Extra surprised GNOME X11 is faster; might be noise.
This time I wanted to do some more thorough looking at the data before deciding on the thresholding approach, but it seems that the plotly frontend starts to really struggle when you feed it several seconds of data sampled multiple times per millisecond 🙃
As expected, VTE is *really slow* on big window sizes on Wayland due to Cairo and weird repaint timing logic. But what is Black Box doing to lose more than a refresh cycle?
Glad to see Alacritty still on top, but apparently Foot is a tiny bit faster on this test. Kitty loses one refresh cycle for some reason.
Now for something different: emulators! Here "New Highscore" is the work-in-progress Highscore rewrite @alice is working on, "Old Highscore" is the current latest Highscore git commit, and "GNOME Games" is the latest Games from Flathub.
It's quite interesting how RetroArch seems to have a two-frame spread rather than one, something's off in its processing. Also interesting how MGBA is one frame slower than Gambatte. For Highscore, good to see GTK 4 improving the latency.
Today I've been visited by kchibisov (Alacritty maintainer) and we've spent several hours benchmarking terminals and editors. 😴
For this test we measured a complex drawing test from vtebench[1]. Key press fills the screen with a complex pattern. I measure the latency from the key press to seeing the pattern at the end of the screen.
Foot ended up firmly ahead, followed by Kitty and Alacritty. Other terminals struggle a bit more with it.