List of Text Corpora

From CLSP Wiki
Jump to: navigation, search

This page does not include an exhaustive list of the LDC corpora available at /export/corpora*/LDC. To find whether an LDC corpus is available, please refer to How to find LDC corpora on the CLSP network.

Unannotated English Texts


Unannotated Non-English Texts

  • ECI (European Corpus Initiative Mulitilingual Corpus I)
  • LACE
  • Mandarin Chinese News Text (LDC95T13)
  • TIGER corpus of German newspaper text
  • WaCKy corpus (English, German, and Italian)

Parallel Texts

Category:Sentence-aligned parallel corpora

Annotated Corpora

Category:Syntactically annotated corpora
Category:Part-of-speech tagged corpora
Syntactically Annotated Corpora
  • CoNLLX shared task (multilingual dependency treebanks)
  • Tagged WSJ (40M+ words)

What do the different tags and nonterminals mean?