List of Text Corpora

From CLSP Wiki
Revision as of 16:10, 1 September 2017 by Cnapoles (Talk | contribs)

(diff) ←Older revision | view current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

Unannotated English Texts

  • AP 89-90, a large monolingual English news corpus
  • ENRON corpus
/export/corpora/ENRON
/export/corpora4/wacky

Unannotated Non-English Texts

  • ECI (European Corpus Initiative Mulitilingual Corpus I)
  • LACE
  • Mandarin Chinese News Text (LDC95T13)
  • TIGER corpus of German newspaper text
/export/corpora/TIGER

Parallel Texts

Category:Sentence-aligned parallel corpora
/export/corpora/ACL07Shared

Annotated Corpora

Category:Syntactically annotated corpora
Category:Part-of-speech tagged corpora
Syntactically Annotated Corpora
/export/corpora/CONLL03
/export/corpora/CONLL07
  • CoNLLX shared task (multilingual dependency treebanks)
/export/corpora/CONLLX
  • French treebank
/export/corpora/PARIS7
Machine-Annotated
  • Tagged WSJ (40M+ words)
   /usr/tools/lab/corpora/wsj.40M/wsj_tagged
   /usr/tools/lab/corpora/wsj.40M/wsj/README

What do the different tags and nonterminals mean?