A Method for Creating Structural Models of Text Documents Using Neural Networks.

Dmitriy V. Berezkin, Ilya A. Kozlov, Polina A. Martynyuk, Artyom M. Panfilkin


The article describes modern neural network BERT-based models and considers their application for Natural Language Processing tasks such as question answering and named entity recognition. The article presents a method for solving the problem of automatically creating structural models of text documents. The proposed method is hybrid and is based on jointly utilizing several NLP models. The method builds a structural model of a document by extracting sentences that correspond to various aspects of the document. Information extraction is performed by using the BERT Question Answering model with questions that are prepared separately for each aspect. The answers are filtered via the BERT Named Entity Recognition model and used to generate the contents of each field of the structural model. The article proposes two algorithms for field content generation: Exclusive answer choosing algorithm and Generalizing answer forming algorithm, that are used for short and voluminous fields respectively. The article also describes the software implementation of the proposed method and discusses the results of experiments conducted to evaluate the quality of the method.

Ключевые слова

information extraction; neural network; named entity recognition; question-answering system

Полный текст:

PDF (English)


Mansouri A., Affendey L.S., Mamat A. Named entity recognition approaches. International Journal of Computer Science and Network Security. 2008. Vol. 8, no. 2. P. 339–344.

Brown D.E., Liu X. Extracting Addresses from News Reports Using Conditional Random Fields. Proceedings of the 15th IEEE International Conference on Machine Learning and Applications, ICMLA, Anaheim, California, USA, December 18–20, 2016. IEEE, 2016. P. 791–795. DOI: 10.1109/ICMLA.2016.0141.

Benson E., Haghighi A., Barzilay R. Event discovery in social media feeds. Association for Computational Linguistics: Human Language Technologies, 49th Annual Meeting, HLT’11, Portland, Oregon, USA, June 19–24, 2011. Proceedings. Vol. 1. Association for Computational Linguistics, 2011. P. 389–398.

Turmo J., Ageno A., Catala N. Adaptive information extraction. ACM Computing Surveys. 2006. Vol. 38, no. 2. P. 1–47. DOI: 10.1145/1132956/1132957.

Chai J.Y., Biermann A.W., Guinn C.I. Two dimensional generalization in information extraction. Proceedings of the Sixteenth National Conference on Artificial Intelligence, AAAI-99, Orlando, Florida, USA, July 18–22, 1999. American Association for Artificial Intelligence, 1999. P. 431–438.

Garcia-Constantino M., Atkinson K., Bollegala D., et al. CLIEL: Context-based information extraction from commercial law documents. Proceedings of the 16th International Conference on Artificial Intelligence and Law, ICAIL’17, London, UK, June 12–16, 2017. Association for Computing Machinery, 2017. P. 79–87. DOI: 10.1145/3086512.3086520.

Kadhim K.J., Sadiq A.T., Abdulah H.S. Unsupervised-Based Information Extraction from Unstructured Arabic Legal Documents. Opcion: Revista de Ciencias Humanas y Sociales. 2019. Vol. 35, no. 20. P. 1097–1117.

Freitag D. Machine learning for information extraction in informal domains. Machine learning. 2000. Vol. 39, no. 2. P. 169–202. DOI: 10.1023/A:1007601113994.

Borkar V., Deshmukh K., Sarawagi S. Automatic segmentation of text into structured records. Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, SIGMOD’01, Santa Barbara, California, USA, May 21–24, 2001. Association for Computing Machinery, 2001. P. 175–186. DOI: 10.1145/375663.375682.

McCallum A. Efficiently inducing features of conditional random fields. Uncertainty in Artificial Intelligence, Proceedings of the Nineteenth Conference, UAI03, Acapulco, Mexico, August 7–10, 2003. Morgan Kaufmann, 2003. P. 403–410.

Feldman R., Sanger J. Probabilistic Models for Information Extraction. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, 2006. P. 131–145.

Wang A., Singh A., Michael J., et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding. Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, Louisiana, USA, May 6–9, 2019. P. 1–20. DOI: 10.18653/v1/w18-5446.

Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, Minnesota, USA, June 2–7, 2019. Vol. 1: Long and Short Papers. Association for Computational Linguistics, 2019. P. 4171–4186. DOI: 10.18653/v1/n19-1423.

Pennington J., Socher R., Manning C.D. Glove: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP, Doha, Qatar, October 25–29, 2014. Association for Computational Linguistics, 2014. P. 1532–1543. DOI: 10.3115/v1/d14-1162.

Burtsev M., Seliverstov A., Airapetyan R., et al. DeepPavlov: Open-Source Library for Dialogue Systems. Association for Computational Linguistics-System Demonstrations, Proceedings of the 56th Annual Meeting, Melbourne, Australia, July 15–20, 2018. Association for Computational Linguistics, 2018. P. 122–127. DOI: 10.18653/v1/p18-4021.

Xue K., Zhou Y., Ma Z., et al. Fine-tuning BERT for joint entity and relation extraction in Chinese medical text. Proceedings of the 2019 IEEE International Conference on Bioinformatics and Biomedicine, BIBM, San Diego, California, USA, November 18–21, 2019. IEEE, 2019. P. 892–897. DOI: 10.1109/bibm47256.2019.8983370.

Wang Q., Yang L., Kanagal B., et al. Learning to Extract Attribute Value from Product via Question Answering: A Multi-task Approach. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD’20, USA, August 23–27, 2020. Association for Computing Machinery, 2020. P. 47–55. DOI: 10.1145/3394486.3403047.

Banerjee P., Pal K.K., Devarakonda M.V., Baral C. Biomedical Named Entity Recognition via Knowledge Guidance and Question Answering. ACM Transactions on Computing for Healthcare. 2021. Vol. 2, no. 4. P. 1–24. DOI: 10.1145/3465221.

Li X., Yin F., Sun Z., et al. Entity-Relation Extraction as Multi-Turn Question Answering. Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28 – August 2, 2019. Vol. 1: Long Papers. Association for Computational Linguistics, 2019. P. 1340–1350. DOI: 10.18653/v1/p19-1129.

Qiu L., Ru D., Long Q., Zhang W., Yu Y. QA4IE: A Question Answering Based Framework for Information Extraction. Proceedings of the 17th International Semantic Web Conference, ISWC 2018, Monterey, California, USA, October 8–12, 2018. Vol. 11136 / ed. by D. Vrandecic, K. Bontcheva, M.C. Suarez-Figueroa, et al. Springer, 2018. P. 198–216. Lecture Notes in Computer Science. DOI: 10.1007/978-3-030-00671-6_12.

Rajpurkar P., Jia R., Liang P. Know What You Don’t Know: Unanswerable Questions for SQuAD. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018. Vol. 2: Short Papers. Association for Computational Linguistics, 2018. P. 784–789. DOI: 10.18653/v1/p18-2124.

Weischedel R., Hovy E., Marcus R., et al. OntoNotes: A large training corpus for enhanced processing. Handbook of Natural Language Processing and Machine Translation: DARPA Global Autonomous Language Exploitation / ed. by J. Olive, C. Christianson, J. McCary. Springer, 2011.

Google Research Github Account. TensorFlow code and pre-trained models for BERT. URL: https://github.com/google-research/bert (accessed: 31.10.2022).

DeepPavlov lab Github Account. An open source library for deep learning end to end dialog systems and chatbots. URL: https://github.com/deeppavlov/DeepPavlov (accessed: 31.10.2022).

Reimers N., Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP, Hong Kong, China, November 3–7, 2019. Association for Computational Linguistics, 2019. P. 3982–3992. DOI: 10.18653/v1/D19-1410.

Ubiquitous Knowledge Processing Lab Github Account. Multilingual Sentence & Image Embeddings with BERT. URL: https://github.com/UKPLab/sentence-transformers (accessed: 31.10.2022).

An open source machine learning framework PyTorch. URL: https://pytorch.org/ (accessed: 31.10.2022).

DOI: http://dx.doi.org/10.14529/cmse230102