List of Text Corpora

From CLSP Wiki
Revision as of 16:10, 1 September 2017 by Cnapoles (Talk | contribs)

(diff) ←Older revision | view current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

Unannotated English Texts

  • AP 89-90, a large monolingual English news corpus
  • ENRON corpus

Unannotated Non-English Texts

  • ECI (European Corpus Initiative Mulitilingual Corpus I)
  • LACE
  • Mandarin Chinese News Text (LDC95T13)
  • TIGER corpus of German newspaper text

Parallel Texts

Category:Sentence-aligned parallel corpora

Annotated Corpora

Category:Syntactically annotated corpora
Category:Part-of-speech tagged corpora
Syntactically Annotated Corpora
  • CoNLLX shared task (multilingual dependency treebanks)
  • French treebank
  • Tagged WSJ (40M+ words)

What do the different tags and nonterminals mean?