Conversation

Notices

Embed this notice
Elisabet Roselló (lisrosello@mastodon.social)'s status on Monday, 23-Dec-2024 00:52:38 JST Elisabet Roselló

Qué palo más grande la gente fan de OpenAI con otro nuevo anuncio diciendo que su modelo ha sobrepasado el % de puntuación con tests, cuando son tests que los elabora la misma empresa (que ya sabemos que ni es open, ni por amor a la ciencia ni sin aspiraciones lucrativas, ni sin una estructura de gobernanza libre de presiones a dar resultados financieros)...

In conversation about 2 months ago from mastodon.social permalink
- Embed this notice
  Elisabet Roselló (lisrosello@mastodon.social)'s status on Monday, 23-Dec-2024 00:52:37 JST Elisabet Roselló
  in reply to
  
  ¿Sabéis el caso hace una década del concurso Imagenet y el escándalo de Baidu?
  ImageNet era un concurso donde diferentes universidades y startups o el grupo que quisiera que estuviera entrenando sus modelos Deep Learning de reconocimiento de imágenes podían competir, para demostrar quién tenía el modelo de IA más buena
  El tema era entrenarlas con una base de datos determinada, pero en la prueba final usarían otra para demostrar qué bien lo hacían (% de aciertos en describir la figura central)
  
  In conversation about 2 months ago permalink
  
  fenix repeated this.
- Embed this notice
  Elisabet Roselló (lisrosello@mastodon.social)'s status on Monday, 23-Dec-2024 00:52:37 JST Elisabet Roselló
  in reply to
  
  Tenían unas normas, incluyendo no entrenar las IA con los datos de la prueba final
  Pues al final el equipo de Baidu, que sacó muy buenos resultados, habían entrenado a la IA haciéndola hacer y rehacer la prueba para, en el fondo, entrenarla, haciéndose pasar por otros usuarios o equipos.
  Los descalificaron cuando los pillaron
  Eso que llaman "benchmark" es básicamente tests de respuestas relativamente cerradas, en ocasiones se están utilizando tests oficiales de humanos como el bar exam
  
  In conversation about 2 months ago permalink
- Embed this notice
  Elisabet Roselló (lisrosello@mastodon.social)'s status on Monday, 23-Dec-2024 00:52:38 JST Elisabet Roselló
  in reply to
  
  *elaborar no, quiero decir que los observa ella misma y dice que superao
  Luego vienen agentes independientes, re-observan su actuación con los mismos test tratando de entender qué proceso hicieron OpenAI para entrenarla para el test (opaco, no lo explican), y ven que bien, no va
  
  In conversation about 2 months ago permalink
- Embed this notice
  Elisabet Roselló (lisrosello@mastodon.social)'s status on Monday, 23-Dec-2024 00:52:38 JST Elisabet Roselló
  in reply to
  
  P.e. hace un año y pico anunciaba openAI que aprobaba el bar test de abogacía, y al rehacer la prueba, el test, no tan bien, y tareas que se salían de los casos de los exámenes, peor
  Aquí el estudio y paper revisado p2p
  https://link.springer.com/article/10.1007/s10506-024-09396-9
  In conversation about 2 months ago permalink
  Attachments
  1. Domain not in remote thumbnail source whitelist: static-content.springer.com
    
    Re-evaluating GPT-4’s bar exam performance - Artificial Intelligence and Law
    
    from Martínez, Eric
    
    Perhaps the most widely touted of GPT-4’s at-launch, zero-shot capabilities has been its reported 90th-percentile performance on the Uniform Bar Exam. This paper begins by investigating the methodological challenges in documenting and verifying the 90th-percentile claim, presenting four sets of findings that indicate that OpenAI’s estimates of GPT-4’s UBE percentile are overinflated. First, although GPT-4’s UBE score nears the 90th percentile when examining approximate conversions from February administrations of the Illinois Bar Exam, these estimates are heavily skewed towards repeat test-takers who failed the July administration and score significantly lower than the general test-taking population. Second, data from a recent July administration of the same exam suggests GPT-4’s overall UBE percentile was below the 69th percentile, and $$\sim$$ ∼ 48th percentile on essays. Third, examining official NCBE data and using several conservative statistical assumptions, GPT-4’s performance against first-time test takers is estimated to be $$\sim$$ ∼ 62nd percentile, including $$\sim$$ ∼ 42nd percentile on essays. Fourth, when examining only those who passed the exam (i.e. licensed or license-pending attorneys), GPT-4’s performance is estimated to drop to $$\sim$$ ∼ 48th percentile overall, and $$\sim$$ ∼ 15th percentile on essays. In addition to investigating the validity of the percentile claim, the paper also investigates the validity of GPT-4’s reported scaled UBE score of 298. The paper successfully replicates the MBE score, but highlights several methodological issues in the grading of the MPT + MEE components of the exam, which call into question the validity of the reported essay score. Finally, the paper investigates the effect of different hyperparameter combinations on GPT-4’s MBE performance, finding no significant effect of adjusting temperature settings, and a significant effect of few-shot chain-of-thought prompting over basic zero-shot prompting. Taken together, these findings carry timely insights for the desirability and feasibility of outsourcing legally relevant tasks to AI models, as well as for the importance for AI developers to implement rigorous and transparent capabilities evaluations to help secure safe and trustworthy AI.
  fenix repeated this.
- Embed this notice
  fenix (librebits@masto.nobigtech.es)'s status on Monday, 23-Dec-2024 00:53:41 JST fenix
  in reply to
  
  @lisrosello interesantisima historIA ∆∆∆ ,
  gràcies, merci
  
  In conversation about 2 months ago permalink

Public

Conversation

Notices

Feeds