Wanderer über dem Nebelmeer (wandereruber@poa.st)'s status on Thursday, 11-Jun-2026 11:42:55 JST
-
Embed this notice
update:
Running beellama (cuda) on the same config is faster than llama-cpp vulkan, which is already faster than vanilla llama-cpp cuda.
I can't use TurboQuant because it's slower. It needs cpu-moe = true and apparently my cpu is NOT moe. Costs me ~15% t/s
I have not had ANY success with the dflash drafting. Logs show a lot of rejections. Maybe that's it. It's slow.
The absolute kicker why I will keep using it:
A 3X increase in prompt processing speed, that's on top of the inference speed increase. I have no idea what they did to achieve this.