Today I made use of a tool I wrote several years ago. I wrote it to test performance of my code.
A new client is using an i.MX 6 (Cortex-A9) CPU for their product and my gcc_perf tool allowed them to see the direct benefit of being sensitive to the functioning of the CPU write buffer and in-cache vs DRAM latency: