extracting images from PDFs
pdfimages, packaged in poppler-utils in Debian, can extract images from PDFs.
Its default output format is weird. It would be nice to have it just extract the images, in whatever format they're in, by default. But to get this behaviour you need something like the option -all. But even that isn't specified as "what comes in, goes out". Instead, it is specified as:
Write JPEG, JPEG2000, JBIG2, and CCITT images in their
native format. CMYK files are written as TIFF files. All other images are written as PNG files.
(I can't get blogger to remove that double spacing from the paste).
The reader is left hoping that "all other images" permitted in the PDF whatever-version format (PDF spec is a bigly moving target) are PNG. But I wouldn't bet on it. So some obscuro format might be getting converted to PNG still.
The other gotcha is this from the SYNOPSIS and command-line usage summary:
pdfimages [options] PDF-file image-root
You have to read on to discover image-root is a basename for the output files. Not a directory. And why not have a default, instead of having this as a required argument? So a common flow will be:
- user tries pdfimages foo.pdf
- it doesn't do it, complaining with usage message; image-root is required
- user interprets as directory, wants to use current directory, runs pdfimages foo.pdf .
- user looks for recently-written files with ls -ltrc
- they're not there
- investigation, tries it again, etc
- user eventually works out they're hidden files in the Unix utilities sense, starting with dot, because the dot they gave was a basename for filenames, not a directory name, and they're visible with ls -a
- user removes the files starting with dot, probably producing some error message when the pattern they give also matches . and .., gives the basename they want with pdfimages foo.pdf foo.images
Comments
Post a Comment