2022年7月13日星期三

Interoperating between JPGs and PDFs

I've been working on some paperwork of late that involves dealing with pictures in PDF files. I believe the process is worth documenting, at least for the sake of future reference.


Extracting pictures from a PDF file

If this is only a one-time job, there are online services out there that get the job done. PDF24 is one such service. You upload a PDF file. You wait for it to analyze your file. You download the resulting pictures. Simple as that. The quality, though, is a bit questionable, as the sizes of the pictures are about half of their original, which means the pictures are compressed.

PDFCandy is a similar service. It even gives you the original pictures. No compression. But the free version has a limit of processing 1 file per hour. Anything above that requires a paid plan.

Being a keyboard-over-mouse kind of guy, I Googled and found a command line tool that also does this job, without compressing embedded pictures. The -all option instruct the tool to save pictures in their original format.

pdfimages -all in.pdf out_dir


Trimming extra white space from a picture

My wife scanned our passports, page by page, as part of our recent project. She put the passport under a piece of A4 paper, probably to position it similarly for every page. The output pictures therefore have got a large portion of white margin. I then need to trim the extra white space off all those pictures. Photoshopping them one by one manually is of course not acceptable. The set of tools provided by imagemagick get the job done nicely.

convert -crop 1015x1400+635+5 +repage in.jpg out.jpg

The command has also got a -trim option that automatically detects the size of white space and gets rid of it. But given the nature of the scanned pictures, the output could vary in dimension. So -crop is the way to go in my case.


Combine multiple pictures into a PDF file

I still need to combine all the cropped pictures of passport pages back into a PDF file. The convert command from above could do it, but it uses ghostscript under the hood and gs will decode and encode JPEGs which result in a loss of quality, even if specifying a high quality.

Working on macOS, I soon realized this is a builtin feature.