During a zoom trivia session (because you can’t let the virus win) Benford’s law came up.

This is the observation that the leading digits in many measurements in nature aren’t distributed evenly, there are many more 1s than 2s and the digits grow less prevalent as you go on.

I mentioned that the cool thing is that this holds up regardless of the unit of measurement used. If you measure the length of rivers in meters inches or furlongs you should be getting approximately the same distribution of leading digits.

A friend said this makes no sense, if we measure in units of half a meter, every river that had a length starting with one now starts with two and therefore there will be more twos than ones. The flaw in his logic is that half the numbers starting with one will start with two (those with the second digit being less than 5) and half will start with 3. Additionally any number that began with a digit greater or equal to 5 will now start with 1.

In fact, if we take the expected distribution given by Wikipedia and try to see how the distribution change if the measurement unit changes, we can see that it stays about the same^{[1]}.

I use the term *multipliers* here to refer to the resulting measurement, if the measurement unit is ⅛ of the original, the resulting value will be **multiplied **by 8.

Since each leading digit contributes to one or more digits, when the unit of measurement is changed to a multiple of the first, I thought it would be interesting^{[2]} to plot this in a Sankey diagram.

I threw together a short Python script^{[3]} to create the data:

grid = [None]*10 grid[0] = [0]*10 grid[1] = [0, 30.1, 17.6, 12.5, 9.7, 7.9, 6.7, 5.8, 5.1, 4.6] # Seed data from wikipediea epsilon = 0.001 # avoid floating point hassle def msd(n) -> int: # Most Significant Digit while True: tenth = n // 10 if tenth == 0: return int(n) n = tenth for i in range(1, 10): print(f"All [{grid[1][i]}] 1_{i}") try: for multiplier in range(2, 10): # 1 is the seed data grid[multiplier] = [0]*10 for source_digit in range(1, 10): for numerator in range(multiplier): digit = msd(multiplier * (source_digit + numerator/multiplier) + epsilon) gain = grid[multiplier-1][source_digit] / multiplier # Use the previous multiplier's values (this makes sense for Sankey) grid[multiplier][digit] += gain print(f"{multiplier-1}_{source_digit} [{gain}] {multiplier}_{digit}") for i in range(1, 10): # Print commented out final data print(f"' {i}: {grid[i]} ({sum(grid[i])})") except Exception as ex: print(f"Got exception {ex}")

After plugging the output into sankeymatic.com, and fiddling with the order of the nodes^{[4]} I was able to get something that may be somewhat informative.

**Legend:** ** N**_

*M*– N is the multiplication factor, M is the leading digit

If you’re looking at this and have no idea what’s going on, consider measuring all the rivers in the world in meters then splitting them according to the leading digit, this is the idea behind Benford’s law and it’s the left most part of the diagram (**All **to **1_N**). Now consider measuring in units of half a meter. Half the rivers that *had *a leading digit 1 will now have a leading digit 2 and half will have 3, so from **1_1 **we split evenly to **2_2 **and **2_3**. Similarly all the rivers that were mapped to the **1_5**, **1_6**, **1_7**, **1_8 **and **1_9 **groups will now be mapped to **2_1**.

Rinse and repeat.

[1] I was half expecting the numbers to be exactly the same but apparently Benford’s law is more of an observation than an actual law. ↩

[2] I was wrong. ↩

[3] I say I “*threw together”* to make you underestimate how much time I wasted on it. ↩

[4] Sankeymatic was trying to be helpful and minimize the crossing of the streams. ↩