Estimating and Comparing Entropies Across Written Natural Languages Using PPM Compression

We extend previous work measuring the entropy of English to include the following written natural languages: Arabic, Chinese, French, Japanese, Korean, Russian, and Spanish. We observe that translations of the same document have approximately the same size when compressed, even though they have widely varying uncompressed sizes. This provides further evidence of the popular linguistic hypothesis that different natural languages have the same descriptive ability. It also provides a possible tool to identify poor machine translations.