Public
- Public
- Network
- Groups
- Featured
- Popular
- People

Conversation

Notices

Embed this notice
jonny (jonny@social.coop)'s status on Monday, 10-Jun-2024 17:27:15 JST jonny

More fun publisher surveillance:
Elsevier embeds a hash in the PDF metadata that is *unique for each time a PDF is downloaded*, this is a diff between metadata from two of the same paper. Combined with access timestamps, they can uniquely identify the source of any shared PDFs.
In conversation about 8 months ago from social.coop permalink
Attachments
1. [A list of metadata for a PDF, the important field being two "Unknown:<long random character string>" fields that are color coded to indicate that they have been changed between versions.
  https://social-coop-media.ams3.cdn.digitaloceanspaces.com/media_attachments/files/107/685/726/594/579/059/original/053de9506bb007c6.jpg
- Embed this notice
  jonny (jonny@social.coop)'s status on Tuesday, 11-Jun-2024 04:18:30 JST jonny
  in reply to
  
  you go to school to study "the brain" and then the next thing you know you're learning how to debug surveillance in PDF rendering to understand how publishers have so contorted the practice of science for profit. how can there be "normal science" when this is normal?
  
  In conversation about 8 months ago permalink
- Embed this notice
  jonny (jonny@social.coop)'s status on Tuesday, 11-Jun-2024 04:18:31 JST jonny
  in reply to
  
  updated the above gist with correctly extracted tags, and included python code to extract your own, feel free to add them in the comments. since we don't know what they contain yet not adding other metadata. definitely patterned, not a hash, but idk yet.
  https://twitter.com/json_dirs/status/1486289288115359747?t=QwmBvbOgh2fCkjSOZSh3Fw&s=19
  In conversation about 8 months ago permalink
  Attachments
  1. Untitled attachment
  Abhiseck Paira :gnu: :gnuhurd: repeated this.
- Embed this notice
  jonny (jonny@social.coop)'s status on Tuesday, 11-Jun-2024 04:18:32 JST jonny
  in reply to
  
  of course there's smarter watermarking, the metadata is notable because you could scan billions of pdfs fast. this comment on HN got me thinking about this PDF /OpenAction I couldn't make sense of earlier, on open, access metadata, so something with sizes and layout...
  In conversation about 8 months ago permalink
  Attachments
  1. [top comment on HN thread] So just take pics of the pages and convert the pics back to a PDF [first sub-comment] A motivated publisher could embed codes by altering in subtle ways the differences in distances or color between adjacent characters, so that they would survive most color or grey scale conversions; a seemingly innocuous frame drawn around a photo could be either larger or smaller by say one millimeter, representing de facto a bit, therefore using enough pages they could identify a book among billions. Unfortunately there's no way to be 100% sure that a complex document doesn't contain some form of embedded code. [second sub-comment] Easier to just strip out the metadata
    https://social-coop-media.ams3.cdn.digitaloceanspaces.com/media_attachments/files/107/688/129/541/792/457/original/8eb65aca095b33e0.jpg
  2. I don't really know what I'm looking at so I can't really describe it. There's a top part that says "Suspicious elements: /OpenAction" and then when I list its properties there is an access to the metadata, some changes to a crop box, etc.
    https://social-coop-media.ams3.cdn.digitaloceanspaces.com/media_attachments/files/107/688/129/613/344/537/original/73cc376e721c9287.jpg
- Embed this notice
  jonny (jonny@social.coop)'s status on Tuesday, 11-Jun-2024 04:18:33 JST jonny
  in reply to
  
  this is the way to get the correct tags:
  (on mac i needed to install gnu grep with homebrew `brew install grep` and then use `ggrep` )
  will follow up with dataset tomorrow.
  https://twitter.com/horsemankukka/status/1486268962119761924?s=20
  In conversation about 8 months ago permalink
  Attachments
  1. Untitled attachment
- Embed this notice
  jonny (jonny@social.coop)'s status on Tuesday, 11-Jun-2024 04:18:34 JST jonny
  in reply to
  
  https://twitter.com/SchmiegSophie/status/1486206774159970305?t=GT8fV5QG-4SGTkLadYpCNQ&s=19
  In conversation about 8 months ago permalink
  Attachments
  1. Untitled attachment
- Embed this notice
  jonny (jonny@social.coop)'s status on Tuesday, 11-Jun-2024 04:18:36 JST jonny
  in reply to
  
  https://twitter.com/kmagnacca/status/1486209676979032064?t=GT8fV5QG-4SGTkLadYpCNQ&s=19
  In conversation about 8 months ago permalink
  Attachments
  1. Untitled attachment
- Embed this notice
  jonny (jonny@social.coop)'s status on Tuesday, 11-Jun-2024 04:18:37 JST jonny
  in reply to
  
  for any security researchers out there, here are a few more "hashes" that a few have noted do not appear to be random and might be decodable. exiftool apparently squashed the whitespace so there is a bit more structure to them than in the OP:
  https://gist.github.com/sneakers-the-rat/6d158eb4c8836880cf03191cb5419c8f
  In conversation about 8 months ago permalink
  Attachments
  1. Domain not in remote thumbnail source whitelist: github.githubassets.com
    
    Elsevier PDF "hashes"
    
    from sneakers-the-rat
    
    Elsevier PDF "hashes"
- Embed this notice
  jonny (jonny@social.coop)'s status on Tuesday, 11-Jun-2024 04:18:38 JST jonny
  in reply to
  
  The metadata appears to be preserved on papers from sci-hub. since it works by using harvested academic credentials to download papers, this would allow publishers to identify which accounts need to be closed/secured
  https://twitter.com/json_dirs/status/1486135162505072641?t=Wg5XAzujycz79Cop_ap8vQ&s=19
  In conversation about 8 months ago permalink
  Attachments
  1. Untitled attachment
- Embed this notice
  jonny (jonny@social.coop)'s status on Tuesday, 11-Jun-2024 04:18:44 JST jonny
  in reply to
  
  here's a shell script that recursively removes metadata from pdfs in a provided (or current) directory as described above. For mac/*nix-like computers, and you need to have qpdf and exiftool installed:
  https://gist.github.com/sneakers-the-rat/172e8679b824a3871decd262ed3f59c6
  In conversation about 8 months ago permalink
  Attachments
  1. [Screenshot of code at URL in tweet, the script first uses "find" to locate the files, and passes them to a while loop. It creates a clean PDF at a temporary file, removing it if one exists already. Code follows] # Color Codes so that warnings/errors stick out GREEN="\e[32m" RED="\e[31m" CLEAR="\e[0m" # loop through all PDFs in first argument ($1), # or use '.' (this directory) if not given DIR="${1:-.}" echo "Cleaning PDFs in directory $DIR" # use find to locate files, pip to while read to get the # whole line instead of space delimited # Note -- this will find pdfs recursively!! find $DIR -type f -name "*.pdf" | while read -r i do # output file as original filename with suffix _clean.pdf TMP=${i%.*}_clean.pdf # remove the temporary file if it already exists if [ -f "$TMP" ]; then rm "$TMP"; fi exiftool -q -q -all:all= "$i" -o "$TMP" qpdf --linearize --replace-input "$TMP" echo -e $(printf "${GREEN}Processed ${RED}${i} ${CLEAR}as ${GREEN}${TMP}${CLEAR}"
    https://social-coop-media.ams3.cdn.digitaloceanspaces.com/media_attachments/files/107/686/442/772/750/424/original/f5b43f49b4762cd1.jpg
  2. Domain not in remote thumbnail source whitelist: github.githubassets.com
    
    Strip PDF Metadata
    
    from sneakers-the-rat
    
    Strip PDF Metadata
- Embed this notice
  jonny (jonny@social.coop)'s status on Tuesday, 11-Jun-2024 04:18:45 JST jonny
  in reply to
  
  Links:
  exiftool: https://www.exiftool.org/
  qpdf: https://qpdf.sourceforge.io/
  dangerzone (GUI, render PDF as images, then re-OCR everything): https://dangerzone.rocks/
  mat2 (render PDF as images, don't OCR): https://0xacab.org/jvoisin/mat2
  In conversation about 8 months ago permalink
  Attachments
  1. No result found on File_thumbnail lookup.
    
    ExifTool by Phil Harvey
    
    A command-line application and Perl library for reading and writing EXIF, GPS, IPTC, XMP, makernotes and other meta information in image, audio and video files. For Windows, MacOS, and Unix systems.
  2. No result found on File_thumbnail lookup.
    
    QPDF: A Content-Preserving PDF Transformation System
  3. Domain not in remote thumbnail source whitelist: dangerzone.rocks
    
    Dangerzone
    
    Take potentially dangerous PDFs, office documents, or images and convert them to a safe PDF.
  4. Domain not in remote thumbnail source whitelist: 0xacab.org
    
    jvoisin / mat2 · GitLab
    
    mat2 is a metadata removal tool, supporting a wide range of commonly used file formats, written in python3: at its core, it's a library, used by an eponymous...
- Embed this notice
  jonny (jonny@social.coop)'s status on Tuesday, 11-Jun-2024 04:18:46 JST jonny
  in reply to
  
  Also present in the metadata are NISO tags for document status indicating the "final published version" (VoR), and limits on what domains it should be present on. Elsevier scans for PDFs with this metadata, so good idea to strip it any time you're sharing a copy.
  In conversation about 8 months ago permalink
  Attachments
  1. Internet takedown programs Elsevier partners with a technology vendor to continuously search the Internet for unauthorized posting of our book and journal content. In accordance with the Digital Millennium Copyright Act (DMCA), we issue “takedown” notices to the operators of websites hosting such unauthorized content. To complement this automated searching, Elsevier maintains online tools for staff to report an infringed work. Our partner then works to expedite reporting, investigation, and removal of specific infringing content. If you discover, or learn about pirated content online, don’t hesitate to let your contact at Elsevier know about it; he or she can use our internal systems to make sure the problem is quickly addressed.
    https://social-coop-media.ams3.cdn.digitaloceanspaces.com/media_attachments/files/107/685/909/936/540/836/original/9040527c7e60d24d.jpg
- Embed this notice
  jonny (jonny@social.coop)'s status on Tuesday, 11-Jun-2024 04:18:47 JST jonny
  in reply to
  
  You can see for yourself using exiftool.
  To remove all of the top-level metadata, you can use exiftool and qpdf:
  exiftool -all:all= <path.pdf> -o <output1.pdf>
  qpdf --linearize <output1.pdf> <output2.pdf>
  To remove *all* metadata, you can use dangerzone or mat2
  
  In conversation about 8 months ago permalink

Feeds