Public
- Public
- Network
- Groups
- Featured
- Popular
- People

Conversation

Notices

Embed this notice
kaia (kaia@brotka.st)'s status on Saturday, 28-Dec-2024 00:08:28 JST kaia

does someone know a good PDF to JSON parser by chance? :jelpeek:

In conversation about 3 months ago from brotka.st permalink
- Embed this notice
  kaia (kaia@brotka.st)'s status on Saturday, 28-Dec-2024 00:09:45 JST kaia
  in reply to
  - snacks
  sorry didn't want to cause you physical pain @snacks
  In conversation about 3 months ago permalink
  Attachments
  1. Untitled attachment
    https://media.brotka.st/media/80552078a818cee587adf2536fd1122840dadd48bea27121380b2057f83d1096.png
  snacks likes this.
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Saturday, 28-Dec-2024 00:10:59 JST 翠星石
  in reply to
  
  @kaia That's how you summon Satan himself.
  
  Ghostscript can do many operations on the pdf format.
  
  In conversation about 3 months ago permalink
  
  kaia and snacks like this.
- Embed this notice
  snacks (snacks@netzsphaere.xyz)'s status on Saturday, 28-Dec-2024 00:11:37 JST snacks
  in reply to
  
  @kaia pdf is mostly a display format and not really a document format and it tends to show when you try shit like this...
  
  In conversation about 3 months ago permalink
  
  kaia likes this.
- Embed this notice
  kaia (kaia@brotka.st)'s status on Saturday, 28-Dec-2024 00:11:44 JST kaia
  in reply to
  - 翠星石
  @Suiseiseki
  you will hate this a lot, but I want JSON so I can pass it to AI API (proprietary)
  
  In conversation about 3 months ago permalink
- Embed this notice
  :blobcatflower: (methyltheobromine@netzsphaere.xyz)'s status on Saturday, 28-Dec-2024 00:12:18 JST :blobcatflower:
  in reply to
  - snacks
  @kaia @snacks kaia please don't
  
  In conversation about 3 months ago permalink
  
  kaia and snacks like this.
- Embed this notice
  narcolepsy and alcoholism :flag: (hj@shigusegubu.club)'s status on Saturday, 28-Dec-2024 00:15:13 JST narcolepsy and alcoholism :flag:
  in reply to
  - :blobcatflower:
  - snacks
  @lucy @kaia @snacks
  .mp4
  In conversation about 3 months ago permalink
  Attachments
  1. .mp4
  kaia and snacks like this.
- Embed this notice
  Bricky (thatbrickster@shitposter.world)'s status on Saturday, 28-Dec-2024 00:15:22 JST Bricky
  in reply to
  - 翠星石
  @kaia I would personally convert to a simpler format like EPUB first and then extract the text.
  
  @Suiseiseki
  
  In conversation about 3 months ago permalink
  
  kaia likes this.
- Embed this notice
  snacks (snacks@netzsphaere.xyz)'s status on Saturday, 28-Dec-2024 00:15:37 JST snacks
  in reply to
  - snacks
  @kaia you can get good results with some hand tuned heuristics
  
  In conversation about 3 months ago permalink
  
  kaia likes this.
- Embed this notice
  :blank: (i@declin.eu)'s status on Saturday, 28-Dec-2024 00:15:45 JST :blank:
  in reply to
  
  @kaia depends on how the pdf is compiled, try https://github.com/VikParuchuri/marker/
  In conversation about 3 months ago permalink
  Attachments
  1. Domain not in remote thumbnail source whitelist: opengraph.githubassets.com
    
    GitHub - VikParuchuri/marker: Convert PDF to markdown + JSON quickly with high accuracy
    
    Convert PDF to markdown + JSON quickly with high accuracy - VikParuchuri/marker
  kaia likes this.
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Saturday, 28-Dec-2024 00:16:19 JST 翠星石
  in reply to
  - Bricky
  @thatbrickster @kaia >simpler
  >EPUB
  I don't see how a format that requires a HTML renderer to show text is simpler.
  
  If you can extract the text well enough to correctly convert it, you can directly extract the text instead.
  
  In conversation about 3 months ago permalink
- Embed this notice
  Phantasm (phnt@fluffytail.org)'s status on Saturday, 28-Dec-2024 00:16:49 JST Phantasm
  in reply to
  
  @kaia You are playing with fire, but apparently some madman tried to do this before. It's 4 years old Python, so it might not work at all.
  
  https://github.com/antoinecarme/pdf_to_json
  In conversation about 3 months ago permalink
  Attachments
  1. Domain not in remote thumbnail source whitelist: opengraph.githubassets.com
    
    GitHub - antoinecarme/pdf_to_json: Python module to Convert a PDF file to a JSON format
    
    Python module to Convert a PDF file to a JSON format - antoinecarme/pdf_to_json
  kaia likes this.
- Embed this notice
  kaia (kaia@brotka.st)'s status on Saturday, 28-Dec-2024 00:38:09 JST kaia
  in reply to
  
  a one page PDF converted with `dedoc` is 274277 characters JSON object, so I call that big success :Sheew:
  
  In conversation about 3 months ago permalink
- Embed this notice
  ロミンちゃん (romin@shitposter.world)'s status on Saturday, 28-Dec-2024 00:39:07 JST ロミンちゃん
  in reply to
  
  @kaia somebody post the perni.. oh it's already there :l_sure:
  
  In conversation about 3 months ago permalink
  
  kaia likes this.
- Embed this notice
  purple 👊✊💨 (purple@nya.social)'s status on Saturday, 28-Dec-2024 00:39:51 JST purple 👊✊💨
  in reply to
  
  @kaia@brotka.st noted. i was about to suggest 'unstructured' which is pretty cursed. and overkill.
  
  In conversation about 3 months ago permalink
  
  kaia likes this.
- Embed this notice
  snacks (snacks@netzsphaere.xyz)'s status on Saturday, 28-Dec-2024 00:40:26 JST snacks
  in reply to
  - 翠星石
  @kaia @Suiseiseki hey, the exact same reason i needed to work with pdfs :suicide:
  
  In conversation about 3 months ago permalink
  
  kaia likes this.
- Embed this notice
  kaia (kaia@brotka.st)'s status on Saturday, 28-Dec-2024 00:41:47 JST kaia
  in reply to
  - 翠星石
  - snacks
  @snacks @Suiseiseki
  how did you solve it????
  
  In conversation about 3 months ago permalink
- Embed this notice
  snacks (snacks@netzsphaere.xyz)'s status on Saturday, 28-Dec-2024 00:42:02 JST snacks
  in reply to
  - 翠星石
  - snacks
  @kaia @Suiseiseki luckily ai tends to gloss over small issues like hyphens in words
  
  In conversation about 3 months ago permalink
  
  kaia likes this.
- Embed this notice
  snacks (snacks@netzsphaere.xyz)'s status on Saturday, 28-Dec-2024 00:46:10 JST snacks
  in reply to
  - 翠星石
  @kaia @Suiseiseki all the pdfs where created wtih the same .docx template, so i wrote a programm that just reads all the characters with their size and tries to guess if what it's reading is a headline or not, skipped the impressum and had a bunch of other special handling. Tables where still garbled nonsense, hyphens everywhere, page numbers in the text etc but it turned out ok embeddings.
  Got my hands on the source docx files a while later and those i could actually get the text from as i wanted after i stopped reading microsofts documentation for the format and just looked through them in a text editor to figure out how to parse it
  
  In conversation about 3 months ago permalink
- Embed this notice
  snacks (snacks@netzsphaere.xyz)'s status on Saturday, 28-Dec-2024 00:46:37 JST snacks
  in reply to
  - 翠星石
  - snacks
  @kaia @Suiseiseki also, didn't need json, just plain text
  
  In conversation about 3 months ago permalink
- Embed this notice
  Bricky (thatbrickster@shitposter.world)'s status on Saturday, 28-Dec-2024 00:51:59 JST Bricky
  in reply to
  - 翠星石
  @Suiseiseki It was an example you absolute autist. PDF is full of stuff that needs to be stripped away to reveal plaintext, provided it's not all images.
  
  @kaia
  
  In conversation about 3 months ago permalink
  
  snacks and mangeurdenuage :gnu: :trisquel: :gondola_head: 🌿 :abeshinzo: :ignucius: like this.
- Embed this notice
  翠星石 (suiseiseki@freesoftwareextremist.com)'s status on Saturday, 28-Dec-2024 00:51:59 JST 翠星石
  in reply to
  - Bricky
  @thatbrickster Unfortunately, I'm diagnosed not autistic.
  
  In conversation about 3 months ago permalink
- Embed this notice
  lainy (lain@lain.com)'s status on Saturday, 28-Dec-2024 03:29:03 JST lainy
  in reply to
  
  @kaia ChatGPT and Claude can both do it
  
  In conversation about 3 months ago permalink
  
  kaia likes this.
- Embed this notice
  di (di@fsebugoutzone.org)'s status on Saturday, 28-Dec-2024 03:54:36 JST di
  in reply to
  - lainy
  @lain @kaia
  
  This, pdf to json/html/other structured format, is a good application of neural networks. Probably there's a model built for this purpose specifically.
  
  In conversation about 3 months ago permalink
  
  kaia likes this.

Public

Conversation

Notices

Feeds