|InterJournal Complex Systems, 35
|Manuscript Number: |
Submission Date: 963011
|Entropy and compressibility of symbol sequences|
Category: Brief Article
The purpose of this paper is to investigate long-range correlations in symbol sequences using methods of statistical physics and nonlinear dynamics. Beside the principal interest in the analysis of correlations and fluctuations comprising many letters, our main aim is related here to the problem of sequence compression. In spite of the great progress in this field achieved in the work of Shannon, Fano, Huffman, Lempel, Ziv and others many questions still remain open. In particular one must note that since the basic work by Lempel and Ziv the improvement of the standard compression algorithms is rather slow not exceeding a few percent per decade. One the other hand several experts expressed the idea that the long range correlations, which clearly exist in texts, computer programs etc. are not sufficiently taken into account by the standard algorithms. Thus, our interest in compressibility is twofold:
First, the higher-order Shannon entropies are calculated. For repeat sequences analytic estimates are derived, which apply in some approximation to DNA sequences too. For symbolic strings obtained from special nonlinear maps and for several long texts a characteristic root law for the entropy scaling is detected. Then the compressibilities are estimated by using grammar representations and several standard computing algorithms. In particular we use the Lempel-Ziv compression algorithm. Further the mean square fluctuations of the composition with respect to the letter content and several characteristic scaling exponents are calculated. We show finally that all these measures are able to detect long-range correlations. However, as demonstrated by shuffling experiments, different measuring methods operate on different length scales. The algorithms based on entropy or compressibility operate mainly on the word and sentence level. The characteristic scaling exponents reflect mainly the long-wave fluctuations of the composition which may comprise a few hundreds or thousands of letters.
|Submit referee report/comment|