If you’re generating code, and you’re *not* doing it with an LLM, is it reasonable to use metrics like F1 and recall to measure how well the tools you use are doing? This is bothering me because it feels a bit weird to apply metrics like this to static analyses, build tooling frameworks, or things that just plain don’t have any recall to begin with.