I attempted the impossible. I dared the gods of mathematics and took on Dr Dobb’s Million Random Digit Challenge. I was even the subject of a small office gambling ring, with the odds set at 100:1 against. Somehow, this proved oddly motivating.
Infinity:1 would have been more accurate.
The essence of Dr Dobb’s challenge is to try and find a way to compress a random blob of numbers by even just a single byte.
Coming from a coding background, I just knew that somewhere in the infinite set of possible algorithms existed a way to describe this seemingly simple block of digits more succinctly. Cue showers of confetti and ticker tape parades.
It is just data after all, right?
I was not alone in this perception. I spoke with colleagues, friends, family, and online strangers. A common response was that surely we could just look at the data a different way.
Ideas ranged from the abstract:
What if instead of looking at what’s there, we look at what’s not there?
To the creative:
What if you treat it as an image and compress it with JPEG. Or audio with FLAC?
To outright face-palmery:
Surely the numbers are just ASCII, why not convert them to binary.
Looking back over the past year, I get some amusement from my first naive attempts, some of which make the above suggestions seem somehow plausible. The code was written in scrapbook fashion, clobbering together far-flung ideas. It isn’t pretty, resembling the ravings of a madman more than the elegance of a craftsman.
Conceptually, the various dead-ends are painfully obvious in my seemingly desperate bid to find something new in this well-trodden landscape. There are threads that I started and ended on the same day.
And yet despite the ultimate failure of this fool’s errand, somehow in the process I discovered a whole lot more than I imagined and found so many concepts from school and university clicking into place.
Anatomy of a Failure
In “Divide by two”, I figured I could read the dataset as a decimal number and repeatedly divide by two. By halving the number many times, I’d eventually reach one. To get the original number back later, I’d just need to store the number of division operations and multiply by two that many times.
However there was one elementary flaw with this brilliant plan — the result of dividing an even number by two is sometimes odd. 30 / 2 = 15, for example. But again that was easily dealt with. I’d just add one if the starting value were odd, and then divide by two.
The resulting function:
All I needed to complete the algorithm was a playbook of operations and the order in which they were applied. The format I came up with was a binary dictionary where a zero meant “The number was even, so I just divided by 2” and a one meant “The number was odd, so I added 1 first and then divided by 2”.
If we were to start at 1000 as an example, the algorithm progressed like this:
1000 => 0 (even, just divide by 2)
500 => 0 (even again…)
250 => 0 (again…)
125 => 1 (Ooh! An odd! Let’s add one and…)
63 => 1 (Half of 126 is also odd. Okay then…)
32 => 0 (Ah, back in the calm, peaceful land of the evens.)
16 => 0
8 => 0
4 => 0
2 => 0
1 => end (We made it! No need to store an operation.)
This gave me a dictionary of 0001100000.
The algorithm worked, kind of. I could easily read the dictionary, play it back in reverse and recover the original number.
It also didn’t result in a larger file, but here’s the kicker:
The dictionary was exactly the same length to the bit as the original file.
After some more experimentation, this appears to hold true for any number, if you’d like to play with this jsfiddle and see for yourself (Just change the value of the first variable (nInitial) and hit the play button to re-run the algorithm — be careful, large numbers might cause your browser to hang).
Perhaps this relationship to a Base 2 representation is a commonly understood feature of division, so without further experience and research I can only conjecture that the tally of iterations of this function is always equal to the count of binary digits in the initial value of n. I have no proof but I am sure one exists somewhere.
Regardless, ruminating upon this odd relationship led me to question many of the assumptions I had about mathematics and it shook my faith in my ability to solve any problem with code.
For the first time, although I’d heard it said many times, I seriously entertained the prospect that some problems might actually never have an ideal solution. This made me appreciate and respect that semester on algorithmic complexity all the more.
But I Wasn’t Quite Convinced
By this point I had realised that in order to solve the compression problem, I’d have to tick three major boxes all at once:
- Reduction — A formula must be found that represents the original data with less information. There are a couple of short repeated sequences in the data when read as 8 bit words, so there’s a slim chance to squeeze a couple of bytes out. The issue was finding a way to represent those sequences using less data. By searching the block for short sequences that it doesn’t contain, then using those as replacement tokens, I was able to reduce the data ever so slightly. If memory serves, I was able to replace a few instances of 3 word sequences with 2 byte words. Unfortunately I could not represent both the data and token dictionary in a smaller space. This is before even dreaming of making the file self-assembling.
- Representation — The artifacts of the formula must be stored in an organised structure that, when combined with the transformed source data, results in a total size reduction. Such a structure is needed so the algorithm can tell the difference between data and instructions, and knows how to reassemble the original data. Having a structure that did not increase file size was the main blocker in my attempts.
- Reversibility —This one is more obvious, but needs to be included for completeness. The algorithm’s artifacts and representation need to be able to reconstruct the original data. The original challenge I read required that the decoding software itself was to be included in the size calculation as a means of defeating crafty folk who’d store chunks of the original data in their code, etc. The general solution does not stipulate this requirement since it applies to any arbitrary random data.
TL;DR: Shrink the data so that it and its wrapping are smaller than the original, and be able to unwrap and expand it again.
At most I was only ever able to satisfy two of the above requirements. Imagine my delight and surprise to discover I’d compressed the file massively, only to be swiftly knocked back down by the heavy realisation that I’d thrown away half the data due to a logic error. Yes, dividing by 3 seems like a great idea, but if the code isn’t taking into account fractional numbers with undefined parity, you’re going to have a bad time.
A few months into my hobby, I started playing around with Shannon Entropy, which is when I truly began to realise that, barring a revolution in information theory, I was quite unlikely to find a solution to the general case even if I were able to compress the original file, since whatever specific solution I may find is almost certainly unlikely to work on any other set of data.
At an impasse, I sat on the problem for a few days.
Those days stretched to weeks.
Weeks became months.
Now it’s a year after I accepted the challenge, the bookies have closed, and I had to swallow some pride. But I also think that’s a small price to pay to learn some valuable lessons and gain a new respect and passion for a field I’d long disregarded.
I’ve also found a brand new problem to work on.
Perhaps I should have learned my lesson the first time, but there’s something about the allure of the intractable, maybe I am rebelling against all those people who have ever told me “it can’t be done”; I just can’t help it.
I’m addicted to the impossible.
Onwards and Upwards!
While researching the strange coincidence between the operation count and binary length for the “divide by 2” approach, I stumbled blindly upon the Collatz Conjecture.
While not exactly the same as the function I used, it follows a similar pattern with a 2 operation structure based on integer parity (oddness or evenness). It states that for any positive even integer, simply divide by two, or if odd then multiply by 3 and add 1.
Eventually, the conjecture states, you’ll end up at an infinitely repeating sequence of 4, 2 then 1.
The Collatz function looks like this:
Compare the Collatz function with my failed experiment:
Both are similar in the following ways:
- They perform different operations based on the parity of n
- They divide by two if the number is even
- They also add one if the number is odd
- If repeated iteratively, they both end up at 1 for any positive integer (assuming the conjecture holds true of course)
- Both end in an infinite loop. While at first glance my function simply stops at 1, repeated applications do result in a loop of sorts; a sequence with a single member, as (1 + 1) / 2 = 1. The Collatz function’s end-loop has 3 members.
The differences are:
- I do not multiply by 3 before adding one.
- My function’s “total stopping time” (or the count of iterations prior to the first “loop”) is predictable; simply count the digits in the initial number’s binary representation. The Collatz function’s “total stopping time” is not yet solved, as there is no proof (yet?) or counterexamples.
Given the similarities in our functions and the relationship between binary length of decimal numbers and recursive division operations, I feel compelled to investigate.
Maybe there is a relationship, some bridge to cross the divide.
My gut instinct, which has never failed me in the past (right?) tells me that the function I used and the Collatz function are two specific forms of a more abstract function. Finding that abstraction seems a fairly simple task of variable substitution.
Here’s the Collatz function:
If we add a new parameter, “m” or “multiplier”:
It now acts just like the binary reduction function where m = 1, or the Collatz function where m=3.
However we can go one step further and also open up the possibility of investigating the relationship between the divisor, “d’, and the increment “i”, and their possible effects on stopping time:
With this function, it is possible to explore across a number of dimensions to seek out relationships and potentially useful patterns.
But how can we make sense of this landscape?
Welcome to the Fourth Dimension
We added a number of dimensions. We’ve introduced the possibility for fractional results, and potentially shifted the scope for additional end-loops.
We can try and look for relationships between the dimensions and the resulting count of odd vs even values of n.
In total the dimensions I am interested in correlating are:
- Divisor value
- Increment value
- Multiplier value
- Number (initial)
- Parity (initial)
- Stopping time
- Odd count
- Even count
- Loop sequence length
- Loop offset
But, how do you even start to visualise things in 10d?
Thankfully, this is not a new problem or I would have been lost, perhaps trying to carve intricate balsa-wood models of the data in the dim light of my tool shed as the years flip by.
While struggling to imagine some kind of space on which 10 dimensions could be projected, I happened across Parallel Coordinates, a graphing technique used for efficient visualisation of many dimensions.
This video demonstrates its power and versatility:
That Brings us to Today
Having settled on a means to visualise and explore for relationships between dimensions, many late nights and cups of coffee later, I finally had my first result!
My axis are messed up, the color gradient isn’t working. I’m definitely a long way from rotating my very own tesseract in interdimensional space-time.
Yet by rearranging the columns and brushing over the axis, I’m starting to see some interesting things.
Like what’s this all about?
It looks to me like a visual representation of a strong relationship between the multiplier and the overall odds and evens in the sequence.
This style of graphing will certainly take some getting used to, and I’d be interested to see how binary length relates to different multipliers and divisors. I may even try feeding some of this data to IBM’s watson to see if it can glean any meaningful relationships.
Either way I’m hooked.
What did we Learn?
Thinking back to a year ago, I had no real clue, I thought I was good with maths and computers but this was a whole new level of difficulty.
What started off as a simple wager made in jest became a year-long obsession.
By facing down those odds I now know I have even less of a clue than I thought I had. That is a great thing for me, because it means there is still so much left to explore!
Despite my lack of authority on basically anything, I guess my message for others who’ve been put off by maths or indeed daunted by any prospect and feel it’s just too hard or too not “you”… well yes it IS hard, and it will change you in some way usually for the better; that’s both the challenge and the reward.
If this latest venture turns into another red herring I’d welcome that smelly old fish with open arms. Failure is essential to progress and dealing with it is like training a muscle to improve resilience, flexibility and strength.
This only leaves one question: I wonder what the next year will bring?