Current location - Loan Platform Complete Network - Big data management - What is the thesaurus file format of Baidu Input Method?
What is the thesaurus file format of Baidu Input Method?
The thesaurus file format of Baidu Input Method is bdict format which is relatively simple, the content of the whole bdict format is: header information, thesaurus introduction, the list of entries, and the Chinese characters in the entries are encoded in Unicode.

Sogou cell thesaurus adopts scel format, which uses Unicode to encode Chinese characters and pinyin. The content of the entire scel format is: header information, thesaurus introduction, pinyin combination list, word list. scel format word data structure design is still better, it uses the pinyin pointer to avoid duplicate pinyin to occupy the content of the word in the entry, but also will merge the homophonic words together to save space.

The QQ Classifieds thesaurus uses the qpyd format, which is a zip-compressed list of entries. qpyd format has the following contents: header information, thesaurus introduction, and a compressed list of entries. qpyd format is zip-compressed, so the whole file is smaller than the other formats of thesaurus when the number of entries is the same. However, unlike Sogou's scel format, qpyd format has a pinyin equivalent for each entry, and the words are encoded in UTF8, but the pinyin is encoded in Unicode.