PSA: LLM's are not trained on a knowledge base, they are trained on a text corpus - a collection of strings. What you get out is another collection of strings, generated using a a prompt, another text string. The output only contains "knowledge" to the extent that a human edits and curates it
Amazing how many brilliant scientists will look at an LLM "passing" a standardized test and think, "wow, this computer is very smart" and not "standardized tests are very bad at measuring intelligence"
Like, any lawyer will tell you the relationship between what they do to pass the bar exam and what they do at their jobs. The bar exam is easy for a computer to pass because it is designed to be easy to grade with computer assistance.
Teachers don't use multiple choice questions because they are the best way to measure knowledge or skill, they use them because teachers are underpaid and overworked and multiple choice is easy to grade
These are people with PhDs. What's their excuse for looking at an LLM's scores on a test and thinking anything other than, "this test sucks, our entire system of deciding who gets access to higher education is horrible" OH SNAP
MAYBE the unwillingness to question the basis of the entire academic hierarchy they are perched on top of makes researchers embarrassingly credulous when it comes to the "knowledge" of computer programs that lightly remix a few hundred answers to very similar questions?
@bookwar@vaurora Jeff Schmidt's _Disciplined Minds_ go me thinking that the tests filter out people who refuse to think like machines. It's like people have to accommodate themselves to sociopaths for a while as they jump through hoops. And then if they want to get paid well for their trouble it's best not to snap out of the accommodation.
@vaurora Most of the time when i complain about tests, i get the replies like: but tests are objective, and uniform, they help us fight biases and remove the human factor. That is a good thing, right?
The problem with this answer is that, even if you belive it*, it doesn't matter how objective and unbiased your measurements are if they don't measure the correct thing. And tests simply don't work as a measure of understanding.
@apodoxus@vaurora a softer form of this is confirmation bias: I’m objectively a smart person, doing well on this test was important to my career & most of my peers’, therefore it must be measuring something real.
I used to support neuroscientists & one thing they’d remind everyone of is that humans are notoriously prone to believing we’re being rational when we’re constructing a story that explains the present.
@vaurora Another way to put the same thing is to say they would be discrediting themselves by doing so. Their ego and reputation relies on the fact that they passed these tests, so the tests mus tbe good or else their ego and reputation are less secure than they thought. Misaligned incentives here.
@vaurora Also from this perspective, LLM is actually a good vulnerability scanner for various grading and review processes.
It is just that the reaction to it should be not "Let's ban ChatGPT from doing X" but rather the generic rule should be: "If ChatGPT can do this job/pass this test, this is the job/test not worth doing"
@vaurora I’m always amused at still having multiple choice questions for quizzes I need to do occasionally for work, some of which are made painfully obvious because they really don’t want you to fail stuff so basic.
It sometimes feels like:
A coworker has had their foot caught under the wheel of a one-ton power lifting equipment device. Do you:
@vaurora if an LLM CAN’T pass standardized tests easily I’d wonder how badly it was designed, honestly. It’s all rote, exactly where spicy autocomplete should excel.
@UlrikeHahn there is a loooong messy history of essay answer grading, filled with attempts to standardize human grading using simple heuristics and computer assistance. What we are learning is that essay questions that can be easily and uniformly graded are also often questions that can be answered by remixing the answers to similar essay questions
but the essay parts of the bar exam are *not* easy for computers to pass, they are not graded by computer, and they do actually reflect well what lawyers do in their job: take descriptions of things that happened and work out which legal rules apply to them in order to seek a legal resolution
@reagle it's genuinely cool that computers can output plausible sounding remixes of existing essays and analysis. I'm just noting that that doesn't mean that it's doing the things that humans did to produce the training text corpus
@vaurora I’m reminded of how James Randi realized that scientists were bad at examining claims of the paranormal, because the scientists weren’t expecting to be deceived, so they weren’t setting up their experiments to account for the possibility of deception.
@45xiatai what are you trying to measure and why? Because in this society, "intelligence test" means "plausible reason to steer resources to people who already have the most"
@45xiatai "can recall facts in a specific knowledge area selected by the test designer" is a thing yes. One purpose for this is "find people who can compete with each other on TV to see who is best at this skill for the purpose of selling ads to the people watching" :)
@vaurora Your reply here (for which I thank you) kind of goes in a different direction from where I thought the discussion might go when I saw your first comment to which I replied. Supposing, for example, there are a whole bunch of people who grew up with similarly privileged backgrounds, would there be any reason that someone (who?) might want to know who has the better knowledge base among those people? How would someone find out about that?
From the bird site, a professor interrogates the difference between the skills he taught and the skills he graded for: "My exams, however, rewarded discursive fluency and verbal glibness over diligent study." https://twitter.com/alfiekohn/status/1640684775873576961?s=20
I am a usually cautious about generic claims how rules or hierarchy turns you into a mindless gear in The Machine.
It is not that black and white and, honestly, self-taught independent thinkers are more likely to reinvent the Perpetuum Mobile than to make a breakthrough in String Theory.
But passing tests is indeed a very specific muscle to train. And has nothing to do with reasoning or researching - the skills which we are supposed to look for.
@acdha@apodoxus@vaurora My unified theory of the Less Wrong Rationalist AI Doomer complex is that these are people whose whole sense of self worth derives from being good test takers in school, which led them to invest a lot in the idea of intelligence as a single, innate, quantifiable thing. Source: this was almost me.
@misc@acdha@apodoxus same. If you're young and the only affirmation you receive is for your performance on intellectual tasks, it's easy to go down the IQ rabbit hole. My escape was just noticing how annoying those people were and deciding to look for another option
@denzilferreira it's no accident that performing well on the test requires the expenditure of hundreds of hours of mind numbing practice and tutoring. Who has those resources? ?