Experiments in Data Compression VIII (or "How Long is It") by John Kormylo E.L.F. Software I have never been very happy with the way LZA handles length codes. As shown in the Appendix of Report #5, the probability of finding matches longer than 40 bytes is close to zero. This means that it takes thousands of matches before the count array begins to reflect the true statistics. However, attempts to initialize the count array with a priori statistics have not worked well. One idea (stolen from ZIP, more or less) is to group several lengths together to equalize the a priori statistics for each group as well as the statistics within each group. The length would then require 2 codes. The first would select the group and the second would select a length from within that group. Since the counts used to generate the codes would be roughly equal, adaptation should be achieved far more rapidly. After adaptation, coding efficiency should be about the same as the current system, differing only by round-off errors and rescaling of the count arrays (if any). It seemed a good idea at this point to generate some really good a priori statistics, such as from all the files on an entire hard disk. The resulting statistics (shown in the following table) were somewhat surprising. Nearly half of the total (9292178) came from the one entry (length = 2). Lengths | Counts ---------+---------------------------------------------- 2 - 6 | 4325118 1617905 859488 511012 343780 7 - 11 | 236698 199665 140135 110728 82843 12 - 16 | 74012 65304 51805 40470 37401 17 - 21 | 29718 27220 25820 21753 16989 22 - 26 | 16375 20409 14590 13343 13938 27 - 31 | 13234 11118 10291 10264 10352 32 - 36 | 9677 8037 8219 11015 7829 37 - 41 | 6743 7696 7496 7193 6962 42 - 46 | 7115 6969 7467 7957 8416 47 - 51 | 7462 7346 6662 6131 6034 52 - 56 | 6554 7134 7152 6934 7870 57 - 61 | 7371 7499 7308 8562 11092 62 - 65 | 10851 13761 20357 65529 The following grouping was chosen in order to minimize the number of bits needed for adaptive compression for the above sample (derived in the Appendix). The optimization procedure was to first attempt to equalized the counts for each group, then compute the change in the objective function as lengths were swapped between groups. The process was repeated for different numbers of groups. Number of groups = 9 Bits saved using groups = 163.18 Lengths | Counts ---------+--------- 2 | 4325118 3 | 1617905 4 | 859488 5 | 678244 6 | 450862 7-10 | 783789 11-15 | 281958 16-26 | 208192 27-65 | 395659 Note that the first 5 "groups" contain a single length each, making a second code unnecessary. These 5 lengths comprise 82% of all codes generated, reducing the computational effort. The compression results for this method are shown in the following table as "Groups." | DEMONSTR.DOC | ELFBACK.PRG | WORDLIST.TXT | -----------+--------------+-------------+--------------+ Original | 84009 | 60823 | 1628291 | ZIP | 25134 | 32417 | 496876 | -----------+--------------+-------------+--------------+ Previous | 22179 | 32037 | 484390 | Groups | 22155 | 32007 | 484302 | -----------+--------------+-------------+--------------+ As expected the savings were small, since this primarily affects the compression at the beginning of files. The important point is that the improvement does not appear to be particularly data dependent. (The Appendices are formated for TeX.) \line{\hfil Appendix \hfil} \line{\hfil Adaptive Compression \hfil} \line{\hfil} From Shannon we know the minimum number of bits needed to code N characters from a set of M possible codes is given by $$ H = N \log_2 N - \sum_{i=1}^M n_i \log_2 n_i $$ where $n_i$ is the character count for each code. The number of bits needed for adaptive compression is given by $$ H_a = \log_2\left((N+M-1)! \over (M-1)!\right) - \sum_{i=1}^M \log_2 n_i! \qquad. $$ This is obtained by summing the bits needed for each character starting with $$ H_a(1) = \log_2 M - \log_2 1 = \log_2 M $$ through $$ H_a(N) = \log_2 (N+M-1) - \log_2 n_i $$ for whatever code $i$ comes last. No matter what order in which the characters arrive, the final sum will be the same. Now we divide the codes into L groups containing $M_i$ codes and $N_i$ characters each, where $$ N_i = \sum_{j=k_i}^{k_{i+1}-1} n_j $$ where $k_i = 1 + \sum_{j=1}^{i-1} M_i$ is the first code in each group. Note that $k_1 = 1$ and $k_{L+1}-1 = M$. The number of bits needed for the group codes is given by $$ H_0 = \log_2 \left({(N+L-1)!\over(L-1)!}\right) - \sum_{i=1}^L log_2 N_i! $$ and the number of bits needed to select one code from within a given group is given by $$ H_i = \log_2 \left({(N_i+M_i-1)!/(M_i-1)!}\right) - \sum_{j=k_i}^{k_{i+1}-1} \log_2 n_j! \qquad. $$ The total number of bits needed for adaptive compression by groups is therefore given by $$ \eqalign{ H_g &= H_0 + \sum_{i=1}^L H_i \cr &= \log_2 \left({(N+L-1)!\over(L-1)!}\right) + \sum_{i=1}^L \log_2 \left({(N_i+M_i-1)! \over(M_i-1)!\,N_i!}\right) - \sum_{i=1}^M \log_2 n_i! \qquad. } $$ Finally, subtracting the number of bits needed for normal adaptive compression gives us $$ \eqalign{ H_g - H_a &= \log_2 \left({(N+L-1)!\,(M-1)! \over(N+M-1)!\,(L-1)!}\right) + \sum_{i=1}^L \log_2 \left({(N_i+M_i-1)! \over(M_i-1)!\,N_i!}\right) \qquad . } $$ These quantities can be easily computed since almost all of the terms cancel out. \end