Have you ever been in the situation where you're sitting on a bunch of voucher codes for the Nintendo eShop, but you can't redeem them because they've been passed through JSFuck and then printed out, in a non-fixed width font?
Fear not, because this post will help you get at those codes!
Your first idea when faced with this problem is probably to just use some OCR package. Big mistake. Not only does conventional OCR rely on the premise that the text to be recognized has some sort of meaning in a natural language, but it also assumes that it's OK to get a character slightly wrong every now and then.
Getting a single character wrong means we're not getting those codes, and off-the-shelf OCR will get most of them wrong. Unacceptable. Next!
Having established that off-the-shelf OCR won't help us, we'll have to implement
our own character recognition algorithm specifically for this task.
Fortunately, the problem is very limited in scope: the only characters we need
to recognize are
!+(), and since we just care about getting those codes
we're only going to handle the particular font they're printed in.
First, though, we need to scan our document (the higher the resolution the better) and chop the resulting images into pieces, one for each character. This is relatively straightforward: we begin by splitting the image into lines, by reading rows of pixels from the image until we reach a row that's almost empty — a line that has, say, only 5-10 non-white pixels. To split the resulting lines into characters, we simply rotate them 90 degrees and do the same thing again, but this time with a lower threshold of non-white pixels for a "line" to be considered empty.
Oh, and before we continue: yes, I know I could have just trained a neural network or something on the problem at this point and likely gotten fairly good results with some tinkering, but where's the fun in that?
Now that we have pictures of the individual characters, we need to somehow figure out what each of those pictures means.
One way would be to manually identify one of each unique character in our image
(there are only six of them, so this is straightforward) and use the difference
between an identified character and a character we want to recognize to guess
which character we're looking at.
If the character being scrutinized shares the largest amount of pixels with
a the character we identified as
+, then we can be reasonably sure we're
looking at another plus sign.
Unfortunately, this seems to work rather poorly. While it is appealing to use characters from the image itself as the base for recognition, as this would make the algorithm easier to apply to documents in other fonts (yeah, I know I said we're not doing this, but...), it seems that many characters share just enough pixels that noise becomes a major factor in the recognition process. In particular, parentheses and brackets just look too alike for this to be practical, generating huge amounts of false positives. Next!
If we can't just stupidly diff pixels, we have to actually look at the
characters and identify some features unique to each one.
For instance, what makes a
+? Well, in this character set,
is significantly lower than all other characters.
Furthermore, it's also the only character to have a vertical bar from top to
bottom somewhere in the center, and likewise for the horizontal bar about
! is characterized by being relatively narrow and by having a gap
between the bar and the dot; a
( has black pixels in the top and bottom right
corners and halfway down on the left edge, but none in the top and bottom left
corners or halfway down the right edge; a
] has a vertical bar on the right
and two horizontal ones at the top and bottom respectively; and so on.
By writing a manual predicate for each character —
isOpenParen, etc. — and checking each character against each predicate,
we can uniquely identify most characters.
Unfortunately, while this approach doesn't have anywhere near the false positives
problem of the previous solution, some characters still match more than one
predicate due to noise present in the picture.
So what can we do about that?
It turns out that almost all ambiguously identified characters are
that means we can rule some of them out: if the last opening character
[) was a parenthesis, we know that we will always see a
corresponding closing parenthesis before we see a closing bracket, and vice
versa. Additionally, we know that
! can never appear in a postfix position,
as well as some other things.
To incorporate this knowledge, we add a context to our character recognition, which keeps track of the stack of currently open parantheses/brackets, as well as which characters are permitted in the current position, considering the previous recognized character. This makes a lot of the aforementioned amguities go away, but we still have one major hurdle left: characters that are almost completely illegible due to noise.
Fortunately, the cases where the method described up until now can't find a match seem to be relatively limited: maybe 10 characters out of a total 33,000 characters per voucher code, which could be further reduced by being a bit more meticulous about the character predicates. Further tuning the predicates is a chore though, and since we're only interested in getting those codes, not making the process of getting them 100% automated, general, and foolproof, 10 instances of human interaction is fine.
This part is actually completely trivial: whenever a character is found that doesn't match any character predicate, or that matches more than one, display all information we have about the character and ask the user to identify it for us. To make this process easier, we extend the context with a trace of the most recent characters recognized, allowing the user to have a look at the program so far before making a decision.
Implementing this method in the horribly ugly and disorganized jsfuck-ocr program, I was able to easily extract the voucher codes from two separate jsfucked printouts.
If you ever find yourself in a similar situation, I hope this post helped you do the same.