Institute of Computing Technology, Chinese Academy IR
A Seed-Based Method for Generating Chinese Confusion Sets | |
Liu, Liangliang1; Cao, Cungen2 | |
2016-12-01 | |
发表期刊 | ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING |
ISSN | 2375-4699 |
卷号 | 16期号:1页码:16 |
摘要 | In natural language, people often misuse a word (called a "confused word") in place of other words (called "confusing words"). In misspelling corrections, many approaches to finding and correcting misspelling errors are based on a simple notion called a "confusion set." The confusion set of a confused word consists of confusing words. In this article, we propose a new method of building Chinese character confusion sets. Our method is composed of two major phases. In the first phase, we build a list of seed confusion sets for each Chinese character, which is based on measuring similarity in character pinyin or similarity in character shape. In this phase, all confusion sets are constructed manually, and the confusion sets are organized into a graph, called a "seed confusion graph" (SCG), in which vertices denote characters and edges are pairs of characters in the form (confused character, confusing character). In the second phase, we extend the SCG by acquiring more pairs of (confused character, confusing character) from a large Chinese corpus. For this, we use several word patterns (or patterns) to generate new confusion pairs and then verify the pairs before adding them into a SCG. Comprehensive experiments show that our method of extending confusion sets is effective. Also, we shall use the confusion sets in Chinese misspelling corrections to show the utility of our method. |
关键词 | Confusion set pattern matching context probability pinyin similarity shape similarity |
DOI | 10.1145/2933396 |
收录类别 | SCI |
语种 | 英语 |
资助项目 | National Natural Science Foundation of China[91224006] ; National Natural Science Foundation of China[61173063] ; Ministry of Science and Technology of China[201303107] |
WOS研究方向 | Computer Science |
WOS类目 | Computer Science, Artificial Intelligence |
WOS记录号 | WOS:000391438500005 |
出版者 | ASSOC COMPUTING MACHINERY |
引用统计 | |
文献类型 | 期刊论文 |
条目标识符 | http://119.78.100.204/handle/2XEOYT63/7715 |
专题 | 中国科学院计算技术研究所期刊论文_英文 |
通讯作者 | Liu, Liangliang |
作者单位 | 1.Shanghai Univ Int Business & Econ, Sch Business Informat, Shanghai 201620, Peoples R China 2.Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China |
推荐引用方式 GB/T 7714 | Liu, Liangliang,Cao, Cungen. A Seed-Based Method for Generating Chinese Confusion Sets[J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING,2016,16(1):16. |
APA | Liu, Liangliang,&Cao, Cungen.(2016).A Seed-Based Method for Generating Chinese Confusion Sets.ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING,16(1),16. |
MLA | Liu, Liangliang,et al."A Seed-Based Method for Generating Chinese Confusion Sets".ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING 16.1(2016):16. |
条目包含的文件 | 条目无相关文件。 |
个性服务 |
推荐该条目 |
保存到收藏夹 |
查看访问统计 |
导出为Endnote文件 |
谷歌学术 |
谷歌学术中相似的文章 |
[Liu, Liangliang]的文章 |
[Cao, Cungen]的文章 |
百度学术 |
百度学术中相似的文章 |
[Liu, Liangliang]的文章 |
[Cao, Cungen]的文章 |
必应学术 |
必应学术中相似的文章 |
[Liu, Liangliang]的文章 |
[Cao, Cungen]的文章 |
相关权益政策 |
暂无数据 |
收藏/分享 |
除非特别说明,本系统中所有内容都受版权保护,并保留所有权利。
修改评论