Canterbury corpus

From Infogalactic: the planetary knowledge core
Jump to: navigation, search

The Canterbury corpus is a collection of files intended for use as a benchmark for testing lossless data compression algorithms. It was created in 1997 at the University of Canterbury, New Zealand and designed to replace the Calgary corpus. The files were selected based on their ability to provide representative performance results.[1]

Contents

In its most commonly used form, the corpus consists of 11 files, selected as "average" documents from 11 classes of documents,[2] totaling 2,810,784 bytes as follows.

Size (bytes) File name Description
152,089 alice29.txt English text
125,179 asyoulik.txt Shakespeare
24,603 cp.html HTML source
11,150 fields.c C source
3,721 grammar.lsp LISP source
1,029,744 kennedy.xls Excel spreadsheet
426,754 lcet10.txt Technical writing
481,861 pl‌rabn12.txt Poetry
513,216 ptt5 CCITT test set
38,240 sum SPARC executable
4,227 xargs.1 GNU manual page

See also

References

  1. Lua error in package.lua at line 80: module 'strict' not found.
  2. Lua error in package.lua at line 80: module 'strict' not found.

External links

<templatestyles src="Asbox/styles.css"></templatestyles>