Conversation
Notices
-
Embed this notice
kaia (kaia@brotka.st)'s status on Saturday, 28-Dec-2024 00:08:28 JST kaia
does someone know a good PDF to JSON parser by chance? :jelpeek: -
Embed this notice
kaia (kaia@brotka.st)'s status on Saturday, 28-Dec-2024 00:09:45 JST kaia
sorry didn't want to cause you physical pain @snacks snacks likes this. -
Embed this notice
翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Saturday, 28-Dec-2024 00:10:59 JST 翠星石
@kaia That's how you summon Satan himself.
Ghostscript can do many operations on the pdf format. -
Embed this notice
snacks (snacks@netzsphaere.xyz)'s status on Saturday, 28-Dec-2024 00:11:37 JST snacks
@kaia pdf is mostly a display format and not really a document format and it tends to show when you try shit like this... kaia likes this. -
Embed this notice
kaia (kaia@brotka.st)'s status on Saturday, 28-Dec-2024 00:11:44 JST kaia
@Suiseiseki
you will hate this a lot, but I want JSON so I can pass it to AI API (proprietary) -
Embed this notice
:blobcatflower: (methyltheobromine@netzsphaere.xyz)'s status on Saturday, 28-Dec-2024 00:12:18 JST :blobcatflower:
@kaia @snacks kaia please don't -
Embed this notice
narcolepsy and alcoholism :flag: (hj@shigusegubu.club)'s status on Saturday, 28-Dec-2024 00:15:13 JST narcolepsy and alcoholism :flag:
@lucy @kaia @snacks
.mp4 -
Embed this notice
Bricky (thatbrickster@shitposter.world)'s status on Saturday, 28-Dec-2024 00:15:22 JST Bricky
@kaia I would personally convert to a simpler format like EPUB first and then extract the text.
@Suiseisekikaia likes this. -
Embed this notice
snacks (snacks@netzsphaere.xyz)'s status on Saturday, 28-Dec-2024 00:15:37 JST snacks
@kaia you can get good results with some hand tuned heuristics kaia likes this. -
Embed this notice
:blank: (i@declin.eu)'s status on Saturday, 28-Dec-2024 00:15:45 JST :blank:
@kaia depends on how the pdf is compiled, try https://github.com/VikParuchuri/marker/ kaia likes this. -
Embed this notice
翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Saturday, 28-Dec-2024 00:16:19 JST 翠星石
@thatbrickster @kaia >simpler
>EPUB
I don't see how a format that requires a HTML renderer to show text is simpler.
If you can extract the text well enough to correctly convert it, you can directly extract the text instead.In conversation permalink -
Embed this notice
Phantasm (phnt@fluffytail.org)'s status on Saturday, 28-Dec-2024 00:16:49 JST Phantasm
@kaia You are playing with fire, but apparently some madman tried to do this before. It's 4 years old Python, so it might not work at all.
https://github.com/antoinecarme/pdf_to_jsonIn conversation permalink Attachments
kaia likes this. -
Embed this notice
kaia (kaia@brotka.st)'s status on Saturday, 28-Dec-2024 00:38:09 JST kaia
a one page PDF converted with `dedoc` is 274277 characters JSON object, so I call that big success :Sheew: In conversation permalink -
Embed this notice
ロミンちゃん (romin@shitposter.world)'s status on Saturday, 28-Dec-2024 00:39:07 JST ロミンちゃん
@kaia somebody post the perni.. oh it's already there :l_sure: In conversation permalink kaia likes this. -
Embed this notice
purple 👊✊💨 (purple@nya.social)'s status on Saturday, 28-Dec-2024 00:39:51 JST purple 👊✊💨
@kaia@brotka.st noted. i was about to suggest 'unstructured' which is pretty cursed. and overkill.
In conversation permalink kaia likes this. -
Embed this notice
snacks (snacks@netzsphaere.xyz)'s status on Saturday, 28-Dec-2024 00:40:26 JST snacks
@kaia @Suiseiseki hey, the exact same reason i needed to work with pdfs :suicide: In conversation permalink kaia likes this. -
Embed this notice
kaia (kaia@brotka.st)'s status on Saturday, 28-Dec-2024 00:41:47 JST kaia
@snacks @Suiseiseki
how did you solve it????In conversation permalink -
Embed this notice
snacks (snacks@netzsphaere.xyz)'s status on Saturday, 28-Dec-2024 00:42:02 JST snacks
@kaia @Suiseiseki luckily ai tends to gloss over small issues like hyphens in words In conversation permalink kaia likes this. -
Embed this notice
snacks (snacks@netzsphaere.xyz)'s status on Saturday, 28-Dec-2024 00:46:10 JST snacks
@kaia @Suiseiseki all the pdfs where created wtih the same .docx template, so i wrote a programm that just reads all the characters with their size and tries to guess if what it's reading is a headline or not, skipped the impressum and had a bunch of other special handling. Tables where still garbled nonsense, hyphens everywhere, page numbers in the text etc but it turned out ok embeddings.
Got my hands on the source docx files a while later and those i could actually get the text from as i wanted after i stopped reading microsofts documentation for the format and just looked through them in a text editor to figure out how to parse itIn conversation permalink -
Embed this notice
snacks (snacks@netzsphaere.xyz)'s status on Saturday, 28-Dec-2024 00:46:37 JST snacks
@kaia @Suiseiseki also, didn't need json, just plain text In conversation permalink -
Embed this notice
Bricky (thatbrickster@shitposter.world)'s status on Saturday, 28-Dec-2024 00:51:59 JST Bricky
@Suiseiseki It was an example you absolute autist. PDF is full of stuff that needs to be stripped away to reveal plaintext, provided it's not all images.
@kaiaIn conversation permalink -
Embed this notice
翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Saturday, 28-Dec-2024 00:51:59 JST 翠星石
@thatbrickster Unfortunately, I'm diagnosed not autistic. In conversation permalink -
Embed this notice
lainy (lain@lain.com)'s status on Saturday, 28-Dec-2024 03:29:03 JST lainy
@kaia ChatGPT and Claude can both do it In conversation permalink kaia likes this. -
Embed this notice
di (di@fsebugoutzone.org)'s status on Saturday, 28-Dec-2024 03:54:36 JST di
@lain @kaia
This, pdf to json/html/other structured format, is a good application of neural networks. Probably there's a model built for this purpose specifically.In conversation permalink kaia likes this.
-
Embed this notice