Zipf–Mandelbrot law

From Infogalactic: the planetary knowledge core
Jump to: navigation, search
Zipf–Mandelbrot
Parameters N \in \{1,2,3\ldots\} (integer)
q \in [0;\infty) (real)
s>0\, (real)
Support k \in \{1,2,\ldots,N\}
pmf \frac{1/(k+q)^s}{H_{N,q,s}}
CDF \frac{H_{k,q,s}}{H_{N,q,s}}
Mean \frac{H_{N,q,s-1}}{H_{N,q,s}}-q
Mode 1\,
Entropy \frac{s}{H_{N,q,s}}\sum_{k=1}^N\frac{\ln(k + q)}{(k + q)^s} +\ln(H_{N,q,s})

In probability theory and statistics, the Zipf–Mandelbrot law is a discrete probability distribution. Also known as the Pareto-Zipf law, it is a power-law distribution on ranked data, named after the linguist George Kingsley Zipf who suggested a simpler distribution called Zipf's law, and the mathematician Benoit Mandelbrot, who subsequently generalized it.

The probability mass function is given by:

f(k;N,q,s)=\frac{1/(k+q)^s}{H_{N,q,s}}

where H_{N,q,s} is given by:

H_{N,q,s}=\sum_{i=1}^N \frac{1}{(i+q)^s}

which may be thought of as a generalization of a harmonic number. In the formula, k is the rank of the data, and q and s are parameters of the distribution. In the limit as N approaches infinity, this becomes the Hurwitz zeta function \zeta(s,q). For finite N and q=0 the Zipf–Mandelbrot law becomes Zipf's law. For infinite N and q=0 it becomes a Zeta distribution.

Applications

The distribution of words ranked by their frequency in a random text corpus is approximated by a power-law distribution, known as Zipf's law.

If one plots the frequency rank of words contained in a moderately sized corpus of text data versus the number of occurrences or actual frequencies, one obtains a power-law distribution, with exponent close to one (but see Powers, 1998 and Gelbukh & Sidorov, 2001). Zipf's law implicitly assumes a fixed vocabulary size, but the Harmonic series with s=1 does not converge, while the Zipf-Mandelbrot generalization with s>1 does. Furthermore, there is evidence that the closed class of functional words that define a language obeys a Zipf-Mandelbrot distribution with different parameters from the open classes of contentive words that vary by topic, field and register.[1]

In ecological field studies, the relative abundance distribution (i.e. the graph of the number of species observed as a function of their abundance) is often found to conform to a Zipf–Mandelbrot law.[2]

Within music, many metrics of measuring "pleasing" music conform to Zipf–Mandelbrot distributions.[3]

Notes

  1. Powers, David M W (1998). "Applications and explanations of Zipf's law". Association for Computational Linguistics: 151–160.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
  2. Lua error in Module:Citation/CS1/Identifiers at line 47: attempt to index field 'wikibase' (a nil value).
  3. Manaris, B; Vaughan, D; Wagner, CS; Romero, J; Davis, RB. "Evolutionary Music and the Zipf-Mandelbrot Law: Developing Fitness Functions for Pleasant Music". Proceedings of 1st European Workshop on Evolutionary Music and Art (EvoMUSART2003). 611.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>

References

  • Mandelbrot, Benoît (1965). "Information Theory and Psycholinguistics". In B.B. Wolman and E. Nagel. Scientific psychology. Basic Books.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles> Reprinted as
    • Mandelbrot, Benoît (1968) [1965]. "Information Theory and Psycholinguistics". In R.C. Oldfield and J.C. Marchall. Language. Penguin Books.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
  • Powers, David M W (1998). "Applications and explanations of Zipf's law". Association for Computational Linguistics: 151–160.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>
  • Zipf, George Kingsley (1932). Selected Studies of the Principle of Relative Frequency in Language. Cambridge, MA: Harvard University Press.<templatestyles src="Module:Citation/CS1/styles.css"></templatestyles>

External links