extracting images from PDFs

pdfimages, packaged in poppler-utils in Debian, can extract images from PDFs.

Its default output format is weird.  It would be nice to have it just extract the images, in whatever format they're in, by default.  But to get this behaviour you need something like the option -all.  But even that isn't specified as "what comes in, goes out".  Instead, it is specified as:

 Write  JPEG,  JPEG2000,  JBIG2, and CCITT images in their

native format. CMYK files are written as TIFF files. All  other  images are  written as PNG files.  

(I can't get blogger to remove that double spacing from the paste). 

The reader is left hoping that "all other images" permitted in the PDF whatever-version format (PDF spec is a bigly moving target) are PNG.  But I wouldn't bet on it.  So some obscuro format might be getting converted to PNG still.

The other gotcha is this from the SYNOPSIS and command-line usage summary:

pdfimages [options] PDF-file image-root

You have to read on to discover image-root is a basename for the output files.  Not a directory.  And why not have a default, instead of having this as a required argument?  So a common flow will be:

  • user tries pdfimages foo.pdf
  • it doesn't do it, complaining with usage message; image-root is required
  • user interprets as directory, wants to use current directory, runs pdfimages foo.pdf .
  • user looks for recently-written files with ls -ltrc
  • they're not there
  • investigation, tries it again, etc
  • user eventually works out they're hidden files in the Unix utilities sense, starting with dot, because the dot they gave was a basename for filenames, not a directory name, and they're visible with ls -a
  • user removes the files starting with dot, probably producing some error message when the pattern they give also matches . and .., gives the basename they want with pdfimages foo.pdf foo.images
Such is the reality of "interface design" in the unix world.

Comments

Popular posts from this blog

the persistent idiocy of "privileged ports" on Unix

google is giving more and more 500 errors

7 minute workout: a straightforward audio recording (and two broken google web sites)