Developing Intelligent Assistants to Search for Content on Websites of a Certain Genre

Vladislav D. Rublev, Elena A. Sidorova

Аннотация


This paper discusses an approach to automatic generation of intelligent assistants, which provide information search on the content of a website. A feature of the approach is to use genre models, developed for a given type of resource (educational, informational, etc.), on the basis of which the genre structuring and subsequent thematic clustering of the content of the target website is performed. The resulting genre structures allow us to define more precisely the boundaries of thematic clusters related to the topic of the user’s search query. The search quality evaluation for the Russian-language websites showed an F-score of 87.8% and originality of 80.9%, which exceeds the Yandex search engine results by 1.1% and 9.1%, respectively. In order to predict user information needs, a method for refining the resulting sample is proposed. It allows a user to get information implicitly, based on current and previous queries, about what the user was not satisfied with in the previous search results. A model of user’s search intentions has been developed and its computational component includes a method for evaluating query closeness based on the FRiS function. Based on the proposed methods, a chatbot was created on the Telegram messenger platform to search the websites of educational institutions. The experiments showed that the user needs the average of 1.75 qualifying questions to find the necessary information.

Ключевые слова


information retrieval; intelligent assistant; website genre model; thematic analysis; information retrieval system; user search intent model

Полный текст:

PDF (English)

Литература


Mehler A., Sharoff S., Santini M. Genres on the Web. Computational Models and Empirical Studies. Dordrecht, Springer, 2010. 362 p.

Dong L.,Watters C., Duffy J., Shepherd M. An Examination of Genre Attributes forWeb Page Classification. Proceedings of the 41st Annual Hawaii International Conference on System Sciences (HICSS’08). IEEE, 2008. P. 133–143. DOI: 10.1109/HICSS.2008.53.

Kutovenko A. Professional internet search. St. Petersburg, Piter Publishing House, 2011. P. 70–73. (in Russian)

Osinski S., Weiss D. Carrot2 Project. Carrot2 – Open Source Search Results Clustering Engine. URL: http://project.carrot2.org/ (accessed: 30.08.2022).

Kutovenko A. Professional internet search. St. Petersburg, Piter Publishing House, 2011. P. 74–77. (in Russian)

Official website of the question and answer search engine AskNet. URL: http://asknet.ru/ (accessed: 30.08.2022). (in Russian)

Radhakrishnan A. Hakia’s Semantic Search: The Answer to Poor Keyword Based Relevancy. Search Engine Journal. URL: https://www.searchenginejournal.com/hakias-semantic-search-the-answer-to-poor-keyword-based-relevancy/5246/ (accessed: 30.08.2022).

Introducing the Knowledge Graph: things, not strings. URL: https://blog.google/products/search/introducing-knowledge-graph-things-not (accessed: 30.08.2022).

The Palekh Algorithm: how neural networks help Yandex search. URL: https://yandex.ru/blog/company/algoritm-palekh-kak-neyronnye-seti-pomogayut-poisku-yandeksa (accessed: 30.08.2022). (in Russian)

Technical Approaches for Building Conversational AI. URL: https://www.topbots.com/building-conversational-ai/ (accessed: 30.08.2022).

Nimavat K., Champaneria T. Chatbots: an overview of types, architecture, tools and future possibilities. International Journal for Scientific Research and Development. 2017. Vol. 5, no. 7. P. 1019–1024.

Wu Y., Wu W., Xing C., et al. Sequential Matching Network: A New Architecture for Multiturn Response Selection in Retrieval-based Chatbots. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, Canada, July 30 – August 4, 2017. P. 496–505. DOI: 10.18653/v1/P17-1046.

Kapočiūtė-Dzikienė J. A Domain-Specific Generative Chatbot Trained from Little Data. Applied Sciences. 2020. Vol. 10, no. 7. Article no. 2221. DOI: 10.3390/app10072221.

Cuayáhuitl H., Lee D., Ryu S., et al. Ensemble-based deep reinforcement learning for chatbots. Neurocomputing. 2019. Vol. 366. P. 118–130. DOI: 10.1016/j.neucom.2019.08.007.

Kim S., Kwon O.-W., Kim H. Knowledge-Grounded Chatbot Based on Dual Wasserstein Generative Adversarial Networks with Effective Attention Mechanisms. Applied Sciences. 2020. Vol. 10, no. 9. P. 3335. DOI: 10.3390/app10093335.

Bahtin M.M. The problem of speech genres. Jestetika slovesnogo tvorchestva (Aesthetics of Verbal Creation). Moscow, Iskusstvo, 1986. P. 250–296. (in Russian)

Kononenko I.S., Sidorova E.A. Genre aspects of website classification. Software Engineering. 2015. Vol. 8. P. 32–40. (in Russian)

Sidorova E.A. A comprehensive approach to the study of lexical characteristics of the text. Vestnik SibGUTI. 2019. Vol. 3. P. 80–88. (in Russian)

MacQueen J.B. Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, 1967. P. 281–297.

Guo J., Hartung S., Komusiewicz C., et al. Exact algorithms and experiments for hierarchical tree clustering. Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11–15, 2010. AAAI Press, 2010. P. 1–6.

Manwar A., Mahalle H., Chinchkhede K., et al. A vector space model for information retrieval: a MATLAB approach. Indian Journal of Computer Science and Engineering. 2012. Vol. 3. P. 222–230.

Rendon E., Abundez I., Arizmendi A., et al. Internal versus external cluster validation indexes. International Journal of computers and communications. 2011. Vol. 5, no. 1. P. 27–34.

Liu Y., Li Z., Xiong H., et al. Understanding of internal clustering validation measures. IEEE International Conference on Data Mining, Sydney, NSW, Australia, December 13–17, 2010. IEEE, 2010. P. 911–916. DOI: 10.1109/tsmcb.2012.2220543.

Arbelaitz O., Gurrutxaga I., Muguerza J., et al. An extensive comparative study of cluster validity indices. Pattern Recognition. 2013. Vol. 46. P. 243–256. DOI: 10.1016/j.patcog.2012.07.021.

Rousseeuw P.J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics. 1987. Vol. 20. P. 53–65. DOI: 10.1016/0377-0427(87)90125-7.

Zagoruiko N.G., Borisova I.A., Kutnenko O.A., Dyubanov V.V. Constructing a compressed description of data using the competitive similarity function. Industry math. 2013. Vol. 16, no. 1. P. 275–286.

Telegram Bot API. URL: https://core.telegram.org/bots/api (accessed: 30.08.2022).

Manning C. D., Raghavan P., Schütze H. Introduction to Information Retrieval. Cambridge University Press, 2008. P. 151–175. DOI: 10.1017/CBO9780511809071.




DOI: http://dx.doi.org/10.14529/cmse220404