Implemented in Swift, w/Swift concurrency + Metal.
Video has 1 billion sample points per video frame (i.e. randomly apply transforms 1 billion times, count how many times point lands in each pixel).
Started with single-threaded CPU version: ~50 megapoints/sec on my 2019 Intel MBP.
Parallel versions (all CPU, CPU/GPU mix) achieved only ~1.5-2x speedup despite 8 cores. Why? Memory-bound! Cost of counting grid hits > cost of actual calculations.
All-GPU version achieved ~600 Mpoints/sec. Wow!