CharacterDB:FAQ
From CharacterDB
Frequently Asked Questions (FAQ) about CharacterDB
Contents |
[edit] General
[edit] What is CharacterDB?
CharacterDB is a database of facts about Chinese characters (also named Hanzi, Kanji, Hanja). It is maintained in collaboration and released under an open content license.
[edit] What can I do with this wiki?
Foremost as like any other wiki you can browse and at the same time easily edit data. This will allow for easily adding new data and maintaining and correcting existing information.
The underlying system adds powerful querying capabilities which allows to easily search for and aggregate character data. This provides a direct way to making use of the content.
Additionally the wiki will serve as a foundation for providing gathered content to the Linked Open Data cloud.
[edit] What similar databases/sources do exist?
- Unihan database
- We derive much data from Unihan, which is maintained with Unicode. It offers data on pronunciation, stroke count, radicals, semantic similars, traditional/simplified relationships.
- Unicode Ideographic Variation Database
- The IVD registers glyph variations for encoded Han-characters.
- CHISE
- The CHISE project gathers component data.
- GlyphWiki
- This project, also a wiki, gathers glyph visuals for Han-characters (in particular Japanese).
- Wenlin Character Description Language
- The CDL was designed to offer data on glyphs down to the stroke level and supports a huge range of Chinese Hanzi.
- KanjiVG
- This project offers handwriting samples, component and stroke data for Japanese Kanji.
- Commons Stroke Order Project
- This open source project creates stroke order images for teaching and also gathers information on strokes.
[edit] Technical
[edit] I want to use the data. How can I get it?
There is a small script written in Python called export.py to download certain views of the data. If you have Python installed you can run the following to e.g. get a list of all stroke orders:
python export.py strokeorder_all
The resulting list is in form of comma-separated values (CSV) and includes a column for the character, the actual stroke order of the glyph, the glyph index and a reserved value.
For example
"⺁","P-SP","0",""
is returned for Glyph ⺁/0 which has stroke order ㇒㇓ (P-SP).
You can alternatively download the list yourself, using the following URL as pattern: http://characterdb.cjklib.org/wiki/Special:Ask/-5B-5BCategory:Glyph-5D-5D-20-5B-5BStrokeOrder::-21-5D-5D/-3FStrokeOrder/format=csv/sep=-2C/headers=hide/limit=500/offset=0. As this link will only download 500 entries at a time you would need to cycle through the list of currently ~12000 entries by increasing the offset by 500 and then reiterate the process.
[edit] I can't see all characters. What fonts do I need?
The following fonts should cover the most important character blocks:
- Chinese
- AR PL SungtiL GB & AR PL KaitiM GB (ttf-arphic-gbsn00lp & ttf-arphic-gkai00mp, http://www.arphic.com.tw/)
- AR PL UKai & AR PL UMing (ttf-arphic-ukai & ttf-arphic-uming, http://www.freedesktop.org/wiki/Software/CJKUnifonts/Download)
- General (fallback)
- Unifont (ttf-unifont, http://unifoundry.com/unifont.html)
- Han Nom A & Han Nom B (http://vietunicode.sourceforge.net/fonts/fonts_hannom.html)

