Header logo.
small hallucinations
homeyearstagsaboutrss

hey, pdfcpu, relax

The innards of PDF files are surprisingly complex. My heartfelt respect goes to the libraries out there that handle parsing and converting PDF files.

What adds salt to the wound is that this complexity in PDF exists to ensure high-fidelity rendering of pages across devices — not to provide a semantic structure of the content. This mismatched complexity is bad news for RAG applications, but what bit me today is far smaller in scope.

I'm browsing the PDF standard spec as I write this post. Among the features defined in the international standard, there are operators for styling and formatting. There are even operators for computation (arithmatic, boolean, bitwise, conditional, stack/array). The spec defines several data types as well: integer, real numbers, boolean, and so on.

Relevant for our case study today is that boolean values are represented by two keywords: true and false.

I wanted to combine a few PDF files together while conditionally arranging the pages. I asked GitHub Copilot (with Claude Sonnet 4.5) to write a Go program to do that. While reading one of the files, pdfcpu threw an error while dereferecing a malformed boolean field.

GitHub Copilot went on to try a different library (unipdf) before subsequently giving up because unipdf required a license.

It decided on its own to use Python instead. It did the job using pypdf.

That was a roundabout way to fix it. Taking a closer look, the problem involved this error message:

dereferenceBoolean: wrong type <(False)>

The issue in the source PDF file is that the boolean value should be false and not (False).

And it turned out pdfcpu could have allowed it if I (or rather Copilot) simply did this:

1conf := pdfcpu.NewDefaultConfiguration()
2conf.ValidationMode = pdfcpu.ValidationRelaxed

TIL 251014

Claude Sonnet 4 generates good Elixir code, except when it adds return at the end of a function.

You can run a .exs file from terminal by doing this:

mix run my_script.exs "arg"

If you have set up your script like this:

1defmodule MyModule do
2  def main(args) do
3    ## omitted
4  end
5end
6
7MyModule.main(System.argv())

TIL 251004

I've been reading “Designing Data-Intensive Applications”.

Some interesting things I’ve learned so far:

Human error accounts for the vast majority of outages. To quote the book:

...one study of large internet services found that configuration errors by operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages.

Hardware failures happen more often than I'd expected. Each piece of hardware is eventually going to fail. Two useful metrics are “mean time to failure” (if you throw it away when it fails) and “mean time between failures” (if you repair it when it fails). The values of these metrics aren’t infinite. With so many CPUs, RAM modules, GPUs, and hard drives, something will be failing all the time.

One example given in the book goes like this:

Hard disks are reported as having a mean time to failure (MTTF) of about 10 to 50 years. Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.

In addition to the above, since modern cloud services prioritize flexibility and elasticity over the stability of any single machine, you need to anticipate these factors when designing your software.

One of the techniques mentioned in the book for handling faults is process isolation.

In ancient times, when software ran closer to the bare metal, this concept meant one CPU process should not touch memory addresses or other resources used by another process. In our modern-day context, this concept extends to technologies like containerization.

What I learned building a scraper and RSS generator

I promised a friend I'd build a tool to monitor changes on a website and convert the updated articles to an RSS feed.

At first I tried using Django, which boasts “batteries included”. There is a Django library that handles scheduled tasks. I forgot the name—it was such a long time ago after all. What I remember is that it caused circular library dependencies and required setting up migrations, since it managed tasks and their run records in the database.

Months passed before I attempted the project again.

This time I used Go. I had built small projects in Go prior to this. I could unapologetically say “I know Go,” because who doesn’t, with its syntax being so transparent?

Yet it still took me a long time to finish the project.

There were conflicting incentives. On top of building the project, I wanted to learn new things. And it’s fair (even good) to learn new things along the way. I read about HTMX then opted for Alpine.js after comparing their respective syntaxes. At this point I didn’t want to build too much of a UI. Both promised interactivity with minimal scripting in HTML pages. Yet after some struggling with templating in Go, I missed JSX. I also found it difficult to wrap my head around embedding data into an HTML element using a custom attribute.

Then there was mission creep. When I set out to work on this project, the initial goal was to monitor one section on one website. Then I asked myself, wouldn’t it be more useful if I allowed people to add websites to track?

In the end product, you can add websites and sections. The app monitors website changes, scrapes pages whose URLs match a pattern, and extracts the title, author, publication date, and content using CSS selectors. All the updates are displayed in the RSS feed view.

Then I thought, who has time to read all this word soup? So I decided to add an API call to ask OpenAI to summarize the full text for me. Now, with these added features, I moved the UI from Go templates and Alpine.js to a full-blown React project.

GitHub Copilot helped a lot during development. One shift in my mindset especially helped me accelerate the development process.

At the beginning, the questions I asked LLMs were “how do I do this?” Upon getting a response, I'd read it carefully, trying to understand the suggested approach and the reasoning behind it.

While good for learning, this significantly slowed me down. As coding agents became more capable, I soon slipped into asking “Do that for me.” Then the whole process became much faster and more pleasant.

I have a habit of taking notes and creating Anki cards. I thought conversations with LLMs were a good source of knowledge. In the end I realized most of these conversations are transient, scenario-specific, and not worth memorizing.

I’m sure there is a lot of background knowledge behind how each function is called and how each code block is structured, and such knowledge is useful for someone like me who’s relatively new to Go.

But there's a cadence to learning and building. To use a painting analogy, laying out the perspective and applying colors are both important. “How do I do it?” questions are the latter. When you let coding agents solve these problems, you can focus on the perspective part, which is more relevant to the structure of the whole picture.

In a real problem I faced, “How do I handle a nil value when I parse a row of SQL query results?” is about a detail. The fact that you need to handle the nil value is more about the whole. As long as you know you need to handle that, I figure it's fine to delegate the details to coding agents.

Fixing a 503 error caused by health probe

I was working with a container app deployed on Azure recently. This container app provides a REST endpoint that allows users to upload files for processing.

A few days ago, uploading larger files started failing repeatedly. These files weren't particularly large either. One that constantly failed to upload was only 4 MB.

I tried uploading this file both via curl and through the web UI. Both attempts failed with a 503 status code. That ruled out a CORS issue, which would have resulted in a different status code and would not have caused curl to fail.

Interestingly, we didn't find these POST requests in the logs. This suggested the requests never reached the container app.

By inspecting the configation of this and other container apps deployed on Azure, I noticed a health probe setting for this app.

It turned out Azure was checking whether the service was alive every 10 seconds. While the app became temporarily unresponsive during the upload and processing, the health probe likely timed out.

Azure would have interpreted this as a sign that the container was down, and either removed it from the load balancer or tried to restarted it. Either way, the request was abruptly terminated, resulting in the 503 error.

Sell something useless

Apparently Labubu, the cute plushie with a wicked smile, has become a thing.

Wang Ning, the founder of Pop Mart, believes in selling what is “useless”. One thought experiment he uses to illustrate this idea goes like this:

Would we sell as many Molly toys if we added a USB flash drive to them?

That would certainly give the toys some kind of “use”. But that also reminds potential buyers that they do not actually need that “use”. Who really needs another USB thingy after all?

A similar idea is being discussed in Japan’s retail industry, where businesses are said to be transitioning from selling “mono” (もの) to selling “koto” (こと). Both words translate to “things” in English. The distinction lies in that “mono” means a tangible object. While “koto” means something intangible, for example, an event or an experience.

Fixing a `form-data` boundary error

I'm starting to use err tag on this blog to document these small things.

I was trying out an endpoint that takes a file field. The code looked something like this.

 1import requests
 2
 3file = {'document': open('document.pdf', 'rb')}
 4
 5headers = {
 6    'Content-Type': 'multipart/form-data',
 7    'Accept': 'application/json'
 8}
 9
10response = requests.post(url,
11    files=file,
12    headers=headers)

Upon sending this request, I was greeted with a "boundary error".

The reason why this is happening is requests will try to write the Content-Type and boundary strings in the post request. If you manually set Content-Type, the boundary strings will be missing.