hey, pdfcpu, relax
The innards of PDF files are surprisingly complex. My heartfelt respect goes to the libraries out there that handle parsing and converting PDF files.
What adds salt to the wound is that this complexity in PDF exists to ensure high-fidelity rendering of pages across devices — not to provide a semantic structure of the content. This mismatched complexity is bad news for RAG applications, but what bit me today is far smaller in scope.
I'm browsing the PDF standard spec as I write this post. Among the features defined in the international standard, there are operators for styling and formatting. There are even operators for computation (arithmatic, boolean, bitwise, conditional, stack/array). The spec defines several data types as well: integer, real numbers, boolean, and so on.
Relevant for our case study today is that boolean values are represented by two keywords: true and false.
I wanted to combine a few PDF files together while conditionally arranging the pages. I asked GitHub Copilot (with Claude Sonnet 4.5) to write a Go program to do that. While reading one of the files, pdfcpu threw an error while dereferecing a malformed boolean field.
GitHub Copilot went on to try a different library (unipdf) before subsequently giving up because unipdf required a license.
It decided on its own to use Python instead. It did the job using pypdf.
That was a roundabout way to fix it. Taking a closer look, the problem involved this error message:
dereferenceBoolean: wrong type <(False)>
The issue in the source PDF file is that the boolean value should be false and not (False).
And it turned out pdfcpu could have allowed it if I (or rather Copilot) simply did this:
1conf := pdfcpu.NewDefaultConfiguration()
2conf.ValidationMode = pdfcpu.ValidationRelaxed
