Bigram

A bigram or digram is every sequence of two adjacent elements in a string of tokens, which are typically letters, syllables, or words; they are n-grams for n=2. The frequency distribution of bigrams in a string are commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on.

Gappy bigrams or skipping bigrams are word pairs which allow gaps (perhaps avoiding connecting words, or allowing some simulation of dependencies, as in a dependency grammar).

Head word bigrams are gappy bigrams with an explicit dependency relationship.

Bigrams help provide the conditional probability of a token given the preceding token, when the relation of the conditional probability is applied:

$P(W_n|W_{n-1}) = { P(W_{n-1},W_n) \over P(W_{n-1}) }$

That is, the probability $P()$ of a token $W_n$ given the preceding token $W_{n-1}$ is equal to the probability of their bigram, or the co-occurrence of the two tokens $P(W_{n-1},W_n)$ , divided by the probability of the preceding token.

Applications

Bigrams are used in one of the most successful language models for speech recognition.^[1] They are a special case of N-gram.

Bigram frequency attacks can be used in cryptography to solve cryptograms. See frequency analysis.

Bigram frequency is one approach to statistical language identification.

Bigram frequency in the English language

The frequency of the most common letter bigrams in a small English corpus is:^[2]

th 1.52       en 0.55       ng 0.18
he 1.28       ed 0.53       of 0.16
in 0.94       to 0.52       al 0.09
er 0.94       it 0.50       de 0.09
an 0.82       ou 0.50       se 0.08
re 0.68       ea 0.47       le 0.08
nd 0.63       hi 0.46       sa 0.06
at 0.59       is 0.46       si 0.05
on 0.57       or 0.43       ar 0.04
nt 0.56       ti 0.34       ve 0.04
ha 0.56       as 0.33       ra 0.04
es 0.56       te 0.27       ld 0.02
st 0.55       et 0.19       ur 0.02

Complete bigram frequencies for a larger corpus are available.^[3]

Bigram frequency in the Turkish language

The frequeny of most common letter bigrams in Turkish are illustrated below ^[4]

ar 0.0192        ya 0.0098         or 0.0064
la 0.0175        di 0.0093         nı 0.0063
an 0.0173        ma 0.0091         li 0.0063
er 0.0152        nd 0.0089         me 0.0062
in 0.0151        ra 0.0086         rı 0.0061
le 0.0134        al 0.0084         ta 0.0059
en 0.0132        ak 0.0079         ne 0.0058
de 0.0126        ri 0.0077         el 0.0058
ın 0.0121        il 0.0070         am 0.0058
da 0.0116        ni 0.0067         ek 0.0057
bi 0.0114        ba 0.0065         dı 0.0057
ir 0.0110        rd 0.0065         yo 0.0055
ka 0.0103        ay 0.0064         ki 0.0054

References

↑ Michael Collins. A new statistical parser based on bigram lexical dependencies. In Proceedings of the 34th Annual Meeting of the Association of Computational Linguistics, Santa Cruz, CA. 1996. pp.184-191.
↑ Cornell Math Explorer's Project – Substitution Ciphers
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Sefik Ilkin Serengil. Attacking Turkish Texts Encrypted by Homophonic Cipher. MSc thesis, Galatasaray University, 2011.

[1] Michael Collins. A new statistical parser based on bigram lexical dependencies. In Proceedings of the 34th Annual Meeting of the Association of Computational Linguistics, Santa Cruz, CA. 1996. pp.184-191.

[2] Cornell Math Explorer's Project – Substitution Ciphers

[3] Lua error in package.lua at line 80: module 'strict' not found.

[4] Sefik Ilkin Serengil. Attacking Turkish Texts Encrypted by Homophonic Cipher. MSc thesis, Galatasaray University, 2011.

[1]

[2]

[3]

[4]

v t e Natural language processing
General terms	Text corpus Speech corpus Stopwords Bag-of-words AI-complete n-gram (Bigram, Trigram)
Text analysis	Text segmentation Part-of-speech tagging Text chunking Compound term processing Collocation extraction Stemming Lemmatisation Named-entity recognition Coreference resolution Sentiment analysis Concept mining Parsing Word sense disambiguation Terminology extraction Truecasing
Automatic summarization	Multi-document summarization Sentence extraction Text simplification
Machine translation	Computer-assisted Example-based Rule-based
Automatic identification and data capture	Speech recognition Speech synthesis Optical character recognition Natural language generation
Topic model	Pachinko allocation Latent Dirichlet allocation Latent semantic indexing
Computer-assisted reviewing	Automated essay scoring Concordancer Grammar checker Predictive text Spell checker Syntax guessing
Natural language user interface	Automated online assistant Chatterbot Interactive fiction Question answering

Bigram

Contents

Applications

Bigram frequency in the English language

Bigram frequency in the Turkish language

See also

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools