Friday, October 18, 2013

Removing yellow highlighter from color document scans

tl;dr - Just using the red channel works pretty well.

I have a lot of academic books and no shelf space, so I am gradually scanning them. (When I don't have a guillotine available to cut up books, I use 1dollarscan, which I recommend.)

Unfortunately, many of the books are highlighted (because I actually read them at the time).  I happen to have used yellow highlighter consistently, but in many cases the "yellow" was laid on too thickly, was a bit orange (different brands), or darkened a bit over time.  Dark highlighter is often quite visible in bilevel or grayscale scans, sometimes obscuring the text and interfering with OCR.

If you have a color scan of  what was originally a black & white document, simply extract the red channel from a color scan (using, e.g., pamchannel 0) and the result is basically a grayscale scan with the yellow or orange highlighting removed!  (Note that the cheapest option for 1dollarscan is to scan in 300dpi color.)

Why?  In RGB, "white" is strong red, green, and blue; "black" is weak red, green, and blue.  Most of your color-scanned b&w document is 255-255-255 (white) and 0-0-0 (black).  In RGB, "yellow" is strong red and strong green with weak blue; "orange" is strong red with moderate green and weak blue.  So bright yellow highlighting is something like 255-255-0.  If you just take the first channel (red) and treat it as grayscale, the yellow highlighting literally becomes "white"!  (Grayscale is a single channel, where white is 255 and black is 0.) This picture from Wikipedia is helpful:

As an added bonus, since non-acid-free paper turns yellow or yellowish-brown, this trick often removes the background color that is visible in color scans of old-ish books.  (The more brown, the less well this works.)

You can see from the color wheel that the green channel and blue channel can also be useful, depending on the range of highlighter colors you used.

(If you used several highlighter colors on your books, things get more complicated.  There are mediocre research papers on the topic of removing arbitrary colors, but the short answer is that you can often take advantage of the fact that "colors" have unequal R-G-B values whereas the original black type and white paper had roughly equal R-G-B values.  So if max(R,G,B) - min(R,G,B) > some_delta, set the pixel to white.  A generalization of the red channel hack that is easier to compute - but that is only useful for bright colors, as with highlighter - is simply taking max(R,G,B) as your grayscale value; this has no effect on gray values and strictly lightens all other colors.)