Originally posted on June 10th 2013
As my interest in type grows, I can see what pro typographers mean about it being a disease. Sizes, fonts, line heights and column widths matter to me, now more than ever. It’s nice but it’s also a curse; today, rather than studying the horrible pdf that my history of art teacher outputted straight from powerpoint, I spend hours extracting the text from between the slides, and the text on the pictures.
For someone like my teacher who can have such deep thoughts about a still work of art, it’s hard to imagine how he can make such bad presentations, and spend so little time on the body of his course.
Still he considers that the slides that he puts up on the screen while giving the course in the auditorium are the textbook. That is the only file he gives to us, one (mvfk) huge pdf with a years course packed in.
The burning point is the encryption on the pdf though. No way to extract, copy and paste the text into something else, no way to get out of the pdf view, no print ready version. So here was my days work;
Extracting the text, fixing the pdf, editing it, reprinting to PDF was a dead end, so I went old school. OCR, Optical Caracter Recognition. Some scanners come built in with these apps that probably use some flavor of Tesseract as foundation. Enabling you to scan an sheet of printed text, and OCR being able to translate it back into some form of maleable digital text, ready for a new layout.
So in this instance, I ended up making screenshots of the pdf text pages, then running then through GOCR, under GPL licence. The simple applet accepts most file formats to then output a lovely .txt file. One cat command later and I’m presented with a full output.txt of the pdf.
I made this sound fast, but it was not. Mainly, the process of screen shooting near to 200 pages (manually as the text flows over different parts of different pictures) was a pain, especially when I realized that the system wide screen capturing function would not give me enough resolution to get gocr to work properly. There is one thing the pdf has, and that is ligatures, the teacher uses the ‘hum’ lovely calibri font, which has a ton of ttf ligatures built in.
Anyway, procrastination is great, and I did get some work done, but this had to be done. I can republish the new fluid text versions of the textbooks to our schools sharing platform, and claim some fame. (I probably won’t thought, realistically).