@dachary Qwen 2.5 is the best of the local models in my testing for the same purpose — and the same gap w/gtp4o mini. The biggest thing that jumped out for me was to stay alert for taxonomy decisions that rely on intentions outside of the documents themselves (ie this is for beginners). Nice to see you’re seeing similar results!