there was a jpeg optimization paper that noted about this. they use a geometric mean of like PSNR and SSIM or something. they also added a second pass over the block borders to include those blocky jpeg artifacts in the comparison.
mostly cause PSNR only really evaluates color slopes well, and SSIM evaluates hard edges, and the human brain kind of does a bit of both. although things also get fucky because human eyes don't prioritize each color spectrum uniformly either. tis why a lot of old games had weird shit like 4 bit blues and 3 bit greens
the color space of the RGB channels was sacrificed to drop the image size by 1/3rd, and one of the channels ends up with more precision than the others.
similar stuff is done in audio. mp3 and opus are based on "psychoaccoustic" models where they basically make a computer model of how a human hears most sounds and then they re-prioritize the bits around the important bands.