List of Text Corpora

From CLSP Wiki
Jump to: navigation, search

This page does not include an exhaustive list of the LDC corpora available at /export/corpora*/LDC. To find whether an LDC corpus is available, please refer to How to find LDC corpora on the CLSP network.

Unannotated English Texts

/export/corpora4/wacky

Unannotated Non-English Texts

  • ECI (European Corpus Initiative Mulitilingual Corpus I)
  • LACE
  • Mandarin Chinese News Text (LDC95T13)
  • TIGER corpus of German newspaper text
  • WaCKy corpus (English, German, and Italian)
/export/corpora4/wacky

Parallel Texts

Category:Sentence-aligned parallel corpora
/export/corpora/ACL07Shared

Annotated Corpora

Category:Syntactically annotated corpora
Category:Part-of-speech tagged corpora
Syntactically Annotated Corpora
/export/corpora/CONLL03
/export/corpora/CONLL07
  • CoNLLX shared task (multilingual dependency treebanks)
/export/corpora/CONLLX
Machine-Annotated
  • Tagged WSJ (40M+ words)
   /usr/tools/lab/corpora/wsj.40M/wsj_tagged
   /usr/tools/lab/corpora/wsj.40M/wsj/README

What do the different tags and nonterminals mean?