OCRing handwriting is a vastly more valuable use of LLMs than chatbots or image generation. I spent years of my career on OCRing big corpuses of text, and boy was it bad. I love the idea of a small LLM optimized for handwriting recognition. The National Archives and the Library of Congress both contain huge amounts of valuable information that’s hard to read for humans and unsearchable (and I'm sure there are lots of other such collections). It's nice seeing a legitimately good LLM use case.
Conversation
Notices
-
Embed this notice
Waldo Jaquith (waldoj@mastodon.social)'s status on Thursday, 05-Sep-2024 09:58:11 JST Waldo Jaquith -
Embed this notice
Waldo Jaquith (waldoj@mastodon.social)'s status on Thursday, 05-Sep-2024 09:58:12 JST Waldo Jaquith Over a decade ago, I worked on a presidential papers project. The audacious goal was to scan in all presidential papers, make them available for download, and extract any possible data. But until the advent of the typewriter, virtually no data *could* be extracted, other than the odd letterhead. My proposal was to collect the images, build a processing pipeline, and when OCR of handwriting was possible, do it then.
Well, ChatGPT *nailed* this. So many handwritten documents can be discoverable!
Anil Dash repeated this.
-
Embed this notice