@ryanc the thing in question is a talk I’m watching about using LLMs to figure out if code variants are equivalent, and as their baseline they seem to have used precision, recall, and F1 to measure how well methods that leverage non-ML things do at determining when code variants are equivalent