Cool stuff. I remember that I did some similar research back then.
You limited your research to 7 different pulse lengths. Depending on the shortest pulse and the spacing, this may not be the optimal solution. E.g. for the last, conservative approach, I am pretty sure that a lower number of pulse lengths would give a better result.
Also, what sample data did you base your pattern distribution on? E.g uncompressed data must result into pretty different values depending on the content and very different than compressed one. For the latter, all bits and combinations should have about the same probability.
With random data all shouldn't arithmetic data be 0.5 for single bits and 0.25 for two bit combinations? How did you get to these different values? I am sure I am missing something (again).
Cycles|Bits| Prob.| Cycles per bit and probability ------+----+------+------------------------------- 100 | 0 | 0.5 | 50.00 150 | 10 | 0.25 | 18.75 200 | 11 | 0.25 | 25.00 ------+----+------+------ Total: 93,75
Cycles|Bits| Prob.| Cycles per bit and probability ------+----+------+------------------------------- 167 |11 |0,25 | 20.875 217 |01 |0,25 | 27.125 267 |101 |0,125 | 11.125 317 |100 |0,125 | 13.208 367 |000 |0,125 | 15.292 417 |0011|0,0625| 6.516 467 |0010|0,0625| 7.297 ------+----+------+-------- total: 101.4375
from math import log def log2(x): return log(x)/log(2) def cycles_per_bit_given_arithmetic_code(code): den = num = 0 for dur,p in code: bits = -log2(p) weight = p*bits # this is critical num += weight*dur/bits den += weight cycles_per_bit = num/den return cycles_per_bit def hc_to_ac(hc_code): # convert a huffman code to corresponding arithmetic code return [(d,0.5**len(s)) for d,s in hc_code] hccode = [ (100, '0'), (150, '10'), (200, '11'), ] print(cycles_per_bit_given_arithmetic_code(hc_to_ac(hccode)))
+-----+----+------+-----------+--------+-------+ Cycles|Bits| Prob.| w = p*b | c/b | c/b*w | (b=len(Bits)) +-----+----+------+-----------+--------+-------+ | 100 | 0 | 0.50 | 0.5 | 100.0 | 50.0 | | 150 | 10 | 0.25 | 0.5 | 75.0 | 37.5 | | 200 | 11 | 0.25 | 0.5 | 100.0 | 50.0 | +-----+----+------+-----------+--------+-------+ total weighted rates = 137.5 total weights = 1.5 weighted average = 91.667
You're making the same mistake I was a few days ago :)
Symbols that output longer bit sequences consume more of the original file when it's being converted to a pulse stream, so they need to be weighted accordingly when you're computing the expected cycles per bit.
...There a lots of factors we can play with: 1. short pulse length 2. pulse gap 3. number of pulse lengths (a result of 1. + 2.) 4. ???
And what is causing the inaccuracies? - the varying tape speed (how much? +/-5%?) (increasing pulse gaps can cope with that) - frequencies? E.g high frequencies have a lower amplitude than lower ones, causing less correct reads (then we would need decreasing pulse gaps to handle that) - enthusi's graph shows some intervals of heavy distortions, what is causing these? - varying frequencies? (e.g. a low frequency followed by a high one give less accurate results than two high ones) - aging?...
I just encoded a sequence of 1000 random bits with the [0,10,11]=>[100,150,200] encoding...
Could you show something for 4 pulse lenghts, please? I did a little test in Basic V2 with 2KBit of random data. "1/00/01/111" was always longer than "00/01/10/11", "0/1/000/111" was even worse.