Chinese Treebank 5.0
Chinese Treebank 5.0 contains 507,222 words, 824,983 Hanzi, 18,782 sentences, and 890 data files.
All files are GB encoded. The format of Chinese Treebank 5.0 is the same as the Penn English Treebank. All files have been annotated at least twice. The first pass was done by one annotator, and the resulting files were checked by a second annotator (second pass). Some files were also double-blind annotated and then adjudicated to create gold standard files.
The corpus provides four versions of files: bracketed, raw, segmented and postagged. The raw, segmented and postagged versions are generated from the bracketed version and so do not reflect the previous annotation stages. The bracketed files are sequentially named as follows: chtb_nnnn.fid, where nnnn is a sequential file number.
The 5.1 update contains corrections to errors found in the earlier version. Specifically, sentences which had more than one top-level node have been modified. Additionally, some GB-encoded white spaces have been converted to ASCII. The 5.1 package is available as an additional download to all those who have licensed CTB5.0.
- Link time
- 2020-05-21 19:17:00 UTC
- Principal investigator
- More detail URL
- Resource type
- Single study
- Art & Culture