If you’re generating code, and you’re *not* doing it with an LLM, is it reasonable to use metrics like F1 and recall to measure how well the tools you use are doing? This is bothering me because it feels a bit weird to apply metrics like this to static analyses, build tooling frameworks, or things that just plain don’t have any recall to begin with.
Conversation
Notices
-
Embed this notice
kaoudis (kaoudis@infosec.exchange)'s status on Wednesday, 18-Sep-2024 17:56:04 JST kaoudis -
Embed this notice
Ryan Castellucci :nonbinary_flag: (ryanc@infosec.exchange)'s status on Wednesday, 18-Sep-2024 17:56:04 JST Ryan Castellucci :nonbinary_flag: @kaoudis generating code, like, with build time scripts?
-
Embed this notice
kaoudis (kaoudis@infosec.exchange)'s status on Wednesday, 18-Sep-2024 18:05:29 JST kaoudis @ryanc yeah, if you want a bunch of semi-reasonable test cases for a compiler or something and you generate a bunch of build variants, is the case I was thinking about
-
Embed this notice
kaoudis (kaoudis@infosec.exchange)'s status on Wednesday, 18-Sep-2024 18:24:15 JST kaoudis @ryanc the thing in question is a talk I’m watching about using LLMs to figure out if code variants are equivalent, and as their baseline they seem to have used precision, recall, and F1 to measure how well methods that leverage non-ML things do at determining when code variants are equivalent
-
Embed this notice
Ryan Castellucci :nonbinary_flag: (ryanc@infosec.exchange)'s status on Wednesday, 18-Sep-2024 18:24:15 JST Ryan Castellucci :nonbinary_flag: @kaoudis I have a number of personal projects that use build time code generation, some of it parameterized. Not sure if it would be useful for you to look at. The most complicated is a cryptographic hash library that generates HMAC, PBKDF2, and HKDF functions. I validate via test vectors.
-
Embed this notice