Based on extended machine learning and deep learning,this paper proposes a method for term extraction and new word discovery for the Intangible Cultural Heritage (ICH) project corpus,builds a domain thesaurus and explores its application in digital humanities. Firstly,it uses natural language processing methods to pre-process the ICH ceramics corpus and annotate the corpus according to the domain terminology lexicon. Secondly,it uses the Random-CRFs model to investigate how the term extraction is influenced by dictionary (DICT),part-of-speech (POS),radical (Radical),and pinyin (Pinyin) features,and compares the impact of four models,Random-CRFs,Random-BiLSTM,Random-BiLSTM-CRFs,and BERT-BiLSTM-CRFs,on term extraction. Finally,a trained model is used to identify new words from the test corpus,and the extracted candidate words are manually evaluated. A terminology database of 1,173 terms in the field of ICH ceramics is developed and applied to ICH project portraits,ICH ceramics knowledge graphs and ICH ceramics term retrieval.
Key words
intangible cultural heritage /
domain terminology /
new word discovery /
digital humanities
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
References
[1] 韩美群,周小芹. 近二十年来非物质文化遗产数字化传承研究回顾与展望[J]. 中南民族大学学报(人文社会科学版),2022,42(1):65-74,184.
[2] 刘浏,秦天允,王东波. 非物质文化遗产传统音乐术语自动抽取[J]. 数据分析与知识发现,2020,4(12):68-75.
[3] 王颖洁,张程烨,白凤波,等. 中文命名实体识别研究综述[J]. 计算机科学与探索,2023,17(2):324-341.
[4] 李冬梅,罗斯斯,张小平,等. 命名实体识别方法研究综述[J]. 计算机科学与探索,2022,16(9):1954-1968.
[5] 李娜. 面向方志类古籍的多类型命名实体联合自动识别模型构建[J]. 图书馆论坛,2021,41(12):113-123.
[6] 龚德山. 命名实体识别在中药名词和方剂名词识别中的比较研究[D]. 北京:北京中医药大学,2019.
[7] 肖瑞,胡冯菊,裴卫. 基于BiLSTM-CRF的中医文本命名实体识别[J]. 世界科学技术-中医药现代化,2020,22(7):2504-2510.
[8] 张卫,王昊,邓三鸿,等. 面向数字人文的古诗文本情感术语抽取与应用研究[J]. 中国图书馆学报,2021,47(4):113-131.
[9] 刘昱彤,吴斌,谢韬,等. 基于古汉语语料的新词发现方法[J]. 中文信息学报,2019,33(1):46-55.
[10] 耿骞,邓斯予,靳健. 融合词语义表示和新词发现的领域本体演化——以产品评论数据为例[J]. 图书情报工作,2021,65(8):85-96.
[11] 赵耀全,车超,张强. 基于新词发现和Lattice-LSTM的中文医疗命名实体识别[J]. 计算机应用与软件,2021,38(1):161-165,249.
[12] 张爽,陈莉,李铮. 融合相似性判断的网络新词发现算法[J]. 西北大学学报(自然科学版),2022,52(2):239-247.
[13] 申兆媛,巢翌,李晓龙,等. 针对特定领域的新词发现方法研究[J]. 计算机仿真,2022,39(6):269-273,335.
[14] 康怡琳,孙璐冰,朱容波,等. 深度学习中文命名实体识别研究综述[J]. 华中科技大学学报(自然科学版),2022,50(11):44-53.
[15] 赵梓博,王昊,邓三鸿,等. 文本语义化表示对其识别准确率的影响研究——以中华美食本体库构建为例[J]. 情报理论与实践,2021,44(10):8-17.
[16] SUI D,CHEN Y,LIU K,et al.Leverage Lexical Knowledge for Chinese Named Entity Recognition via Collaborative Graph Network[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong:Association for Computational Linguistics,2019:3830-3840.
[17] HUANG Z,XU W,YU K.Bidirectional LSTM-CRF models for sequence tagging[EB/OL].[2022-12-12].https://doi.org/10.48550/arXiv.1508.01991.
[18] DEVLIN J,CHANG M-W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies:V.1 (Long and Short Papers). Minneapolis,Minnesota:Association for Computational Linguistics,2019:4171-4186.
[19] 翟羽佳,田静文,赵玥. 基于BERT-BiLSTM-CRF模型的算法术语抽取与创新演化路径构建研究[J]. 情报科学,2022,40(4):71-78.
[20] 王昊,王密平,苏新宁. 面向本体学习的中文专利术语抽取研究[J]. 情报学报,2016,35(6):573-585.
[21] 蒋勋,朱晓峰,肖连杰. 大数据环境领域知识组织方法研究[J]. 情报资料工作,2021,42(5):6-13.
{{custom_fnGroup.title_en}}
Footnotes
{{custom_fn.content}}