METHODS OF SPEECH AND TEXT DATABASES DEVELOPMENT FOR QA-SYSTEMS

The paper is devoted to the problems of question-an swer systems development (QA-systems). The subject of the study is disc ussion of approaches to the automatic filling of the database of the QA-system based on the analysis of the unstructured text sources currently available in the public domain of the Internet. The analysis reveals that the following ways of imp lementing QA-systems are distinguished: based on inference for ontologie s, rules and syntax, using artificial neural networks. The methods for automatically search of question-an swer pairs based on the structure of sentences and on the basis of associat ive-ontological analysis has been developed and tested in the research. The method based on the analysis of the structure o f sentences is effective for texts such as lists of frequently asked questions ( FAQ), as well as literature texts containing dialogs, direct speech, based on prelimi nary processing of the text, expressed in the form of a heuristic rule. The method based on associative-ontological analysi s is focused to the class of reference and dictionary texts and is based on t he assumption that in the descriptive text there is a sentence (or a group of s entences) containing the main idea of the text. In this case, the title of the te xt can be considered a question, and this sentence (or a group of sentences) is the answ er. We need to make the selection of meaning-generating sentences due to the sem antic reduction of the text automation. For this purpose, algorithms of self-referencing are applied based on the associative-ontological approach to the process ing of texts in natural language. For the experimental verification of the possibility of creating an open QAsystem based on the automatic collection of questio n-answer pairs from the Internet, a prototype of a collection module for the database of the QA-system has been developed.


Introduction
The task of automatic speech recognition in real conditions is far from its solution, taking into account the variability of the source of the speech signal and the acoustic noise that harbors the initial sequence of audio segments. In recent years, significant progress has been made in this area and there are commercial voice-independent applications that quite successfully recognize speech in the processing of voice commands (Google maps, Yandex maps), in interactive systems (Siri), in stenographic systems [1]. The accuracy of recognition of speech units in these systems has reached the necessary threshold, so that users begin to trust automatic voice input and think about the transition from the usual means of contact input of information to contactless ones.
The reached success in the field of speech recognition is associated with the development of cloud technologies, which made it possible to use: 1) "large" heterogeneous data for teaching a multi-level hierarchical acoustic language model of language and speech; 2) crowdsourcing technologies for manual processing of a huge volume of training and recognizable audio and text data; 3) distributed computing resources for servicing client voice applications.
The advantageous factors, that reduce the complexity of the task, are the possibility of preliminary tuning to a specific speaker and a relatively small size of the dictionary of recognizable speech units.
Among the possible areas of research contributing to the solution of the problem are the methods with the application of: 1) multichannel recording and processing of audio signals using an array of microphones for filtering audio noise; 2) multi-sensory recording of the process of speech formation using different types of datacom (microphones, laryngophones, video cameras, etc.); 3) biometric analysis of the psychophysiological state of the speaker with the evaluation of speech capabilities and the choice of the most accessible communication channel.
The effectiveness of human-machine interaction is also related to the current state of the operator. In the works [2,3] the technology of personified monitoring of working conditions of the personnel of industrial enterprises and industries, implemented in the interests of ensuring reliable activity and health preservation, is presented. The general scheme of a personified indicator of working conditions is presented. In the works [4,5] the analysis of domestic patents for methods and devices for diagnosing the functional state of a human operator has been performed, showing a low innovative ability of inventions, and the forecast of the process of scientific and inventive activity indicates a decrease in the number of inventions in this branch of science and equipment for the next years.
The problem of the variability of speech in the various psychophysiological states of the speaker caused by external factors is less studied and represents the greatest complexity. To study it, it is required to create the speech databases necessary for the subsequent learning of the on-board speech recognition system. But first of all it's important to determine the hardware and software resources that can be allocated for the processing of speech audio. This will determine which generation of speech recognition systems (based on comparison of standards, hidden Markov models, artificial neural networks, etc.) can be launched on the client device.
Given the responsibility of the tasks to be solved with the help of on-board client devices, it is difficult to record training voice databases in real operating conditions. The only option for the introduction of speech technologies is the iterative procedure for the gradual modification of speech training databases, recorded primarily in an artificially recreated acoustic environment. The main steps in the formation of speech databases are: 1) classification, analysis of the amplitude-frequency characteristics of audio noise and the creation of appropriate databases; 2) an analysis of the variability of the speaker's speech caused by audio noise.
It is probably possible to organize the implementation of the first step in conditions closed to real operation. Audio recordings in the second study can be carried out in the laboratory, giving the headphones audio speaker with the specified characteristics. The implementation of additional devices for audio signals recording in real conditions, of course, significantly accelerated the process of solving the problem of noise and variability of the speaker's speech filtering.
Automatic text processing is an integral step in the formation of human-machine speech interfaces. For QA-systems, it is important that the equivalent in sense questions can be recognized as the same question, regardless of the words, style, syntactic interconnections and idioms used. To search or generate an answer to a question, a QA system must have access to some knowledge base that contains information allowing you to formulate a response.
There are two main types of QA-systems: closed-domain or specialized (with a limited thematic area) and open-domain (not limited to a particular subject area). The Open-domain QA-systems work with information in all areas of knowledge, which provides the ability to conduct search in related areas. An open-domain QA-system usually works with several sources of knowledge, in which it searches for answers depending on the class of the given question [6,7].
The following ways of QA-systems implementing can be distinguished: on the basis of inference on ontologies [8], rules, and syntax [9], using artificial neural networks [10]. Also it is worth noting that there is the availability of approaches to improve the quality of QA-systems based on the user satisfaction score [11].
The system's response should be presented in the form of a phrase in natural language. In some cases, the simple search for the data of the copy of the communicative act is enough, that gives the question was ever used and an answer was given to it (a question-answer pair was formed).
The existing database filling technologies for QA-systems include expert filling [12], the use of crowd sourcing technologies [13], methods of procedure generation [14], automatic filling methods using existing anthologies (text corpus).
The growth of the number of public information resources in the Internet, which allows, on the one hand, the completeness of the terminological thesaurus within individual subject areas, and on the other hand, the diversity of thematic areas, has become the basis for making the assumption of the possibility of automatic analysis of texts of various content with the purpose to detect and highlight communicative acts for their subsequent entry into the database of the QA-system in the form of QA-pairs.
The joint use of the voice interface and QA-systems within the framework of human-machine interaction gives the following features: 1) the use of a closed-domain QA-system within the voice interface of interaction with the operator can expand the functionality in cases where the operator's co-command can't be directly executed, in which case the phrase is transmitted as a request to the QA system for issuing recommendations or receiving situational help. In this case, the QA system should be built on the extended thesaurus of the voice interface of a specific board system and include the basic aspects of the functioning of such a system in the base of question-answer pairs.
2) the use of an open-domain QA system operating in the voice assistant mode for issues not directly related to the operation of the on-board or mobile system (the analog of the assistants Siri, Cortana, Google Assistant, etc.) increases the process of satisfaction in communication with the system. In this case, the filling of the system can be made from available open resources, but in addition it is necessary to take into account the variability of speech and the difference in the forms of question phrases constructing.

The approach to the QA-system's database development
Consider the functional features of the QA-system that allows creating a database of QA-pairs, extracting knowledge from the publicly available Internet resources and providing a dialog questionanswer interface in the form of web service (the block diagram is shown in Figure). As we can see on the figure, the system consists of functionally independent blocks for generating a database and for using this database to respond to user requests.
For filling the database there is a set of web crawlers and a module for collecting QA-pairs, that collect, download and analyze text documents, as well as extracting question-answer pairs from them.
For the analysis of search queries (texts of questions) and the choice of the most relevant answer to this question, among the available question-answer pairs there is the interface search-and-dialogue component, represented on the structural diagram by the interface module of the question-answer system.
The formulation of the final answer is made by the module for responses generating (included in the interface module of the QA-system), so that the result looks syntactically natural and represents exactly what the user was looking for.
The mechanisms of decomposition of the question (user query), search and generation of the answer are considered, for example, in [15]. We will consider only methods of automatic collection of documents for filling the database (DB) of the QA-system based on the analysis of texts available in the web.
The available pages from the Internet are downloaded using web crawler technology [16]. It crawls links in processed documents according to specified algorithms, in conjunction with a headless browser that parses the original format of the downloaded document (PDF, HTML, MS Word, etc.) and converts it to text format. Additionally, the title of the document is retrieved. At this stage, the elements of the document are filtered, containing blocks of information that are not related to the main text: text blocks, navigation bars, etc.
In the work several methods of automatic The structured scheme of QA-system with speech interface selection of question-answer pairs were developed and tested based on the structure of sentences and on the basis of an associative-ontological approach to text analysis [16]. Before the direct allocation of question-answer pairs by any of the developed methods, the received texts are subjected to preliminary processing, in this case the graphematic analysis [17], which includes the definition of the boundaries of paragraphs, sentences and words, taking into account the structure of sentences.
Each word in the text is being lemmatized -normalized using the function m of the morphological analysis ( m -function). In this context the normalization means the obtaining the base form of the word

The method of QA-pairs delivering based on the sentence structure analysis
For texts such as lists of frequently asked questions (FAQs), as well as prose texts containing dialogs and direct speech, a method based on the analysis of the sentence structure obtained by preprocessing the text, expressed in the form of the following heuristic rule is effective. A sentence containing a direct speech is a sentence satisfying any of the following conditions: -the first symbol of the sentence is the symbol «-»; -within the sentence, a pair of symbols are sequentially located: the first character is the element of the set {«,», «.», «!», «?», «"»}, the second character is the «-» symbol; -inside the sentence, a pair of «:» and «"» symbols are sequentially located. From the sentences received, the author's words are deleted. The author's words are the text fragment that satisfies any of the following conditions: -the text fragment is located after a pair of characters: the first character is the element of the set {«,», «.», «!», «?», «"»}, the second character is the«-»symbol; -the fragment of the text is separated by the symbols «-»; -the fragment of the text is located before the sequence of characters: «:» and «"». Proposals that do not contain direct speech are considered in their original form, because they aren't needed in preprocessing.
Interrogative sentences are allocated from the text. These sentences satisfy the following condition: (the sentence contains more than two words) AND (the sentence ends with the symbol «?»). Immediately after the interrogative sentences within one paragraph, a sentence that satisfies the conditions is selected: (the sentence must not end with a «?» symbol) AND (the sentence contains at least one word). Such a proposal will be considered an answer to the question posed. If any sentence in this paragraph doesn't satisfy these conditions, we believe that the question doesn't contain a response and it won't be entered into the database.
These heuristic rules can be written in the form of a generating grammar and implemented as a finite automaton.

The method on QA-pairs delivering based on associative-ontological approach
The method based on associative-ontological analysis is primarily focused on the class of reference and dictionary texts and it is based on the assumption that in the descriptive text there is a sentence (or a group of sentences) containing the main idea of the text. In this case, the title of the text (including that indicated through the meta tags of the online document) can be considered as a question, and this sentence (or a group of sentences) is the answer.
The use of abstracting algorithms based on the associative-ontological approach to the processing of texts in natural language [18] makes it possible to automate the selection of meaning-generating sentences through the semantic reduction of the text. The abstracting of texts is based on bi-grams, where a bi-gram is a pair of words found in one sentence. A couple of words that are often found in one sentence are considered associated, and the more often this bi-gram occurs, the stronger the connection. The sentences containing concepts, whose sum of connections is greatest, better than all other reflect the subject area described in the text.

The experiments and discussion
For the experimental verification of the possibility of creating an open-domain QA-system based on the automatic collection of question-answer pairs from the Internet, a prototype of the collection module, working in conjunction with the web crawler of the monitoring system for Internet resources is developed [18,19]. The system processed 310,239 documents with useful volume of the text 1.92 GB (without taking into account the layout of the document and media data). While analyzing the texts, 2,230,325 questions and answers were received, the database size is 710 MB. The quantitative results obtained during the experimental verification of various methods are presented in Table. The obtained QA-pairs quantity

Method
QA-pairs quantity The method based on the structure sentences analysis without the direct speech registration 529117 The method based on the sentence structure analysis for direct speech 1080730 The method of QA-pairs delivering based on associative-ontological approach 310239 The greatest contribution to the formation of the database of question-answer pairs among the texts containing recorded communicative acts, mainly due to the high specific content of question-answer pairs within each document was made: -by the prose texts containing dialogues of heroes (26 %); -by the sections of frequently asked questions (FAQ) (17 %); -by the reference and dictionary sources using the algorithm based on associative-semantic analysis (21 %); -by the user generated content (UGC): forums, blogs, comments; -by the documentary texts and news content.

Conclusion
A prototype of a system for collecting question-answer pairs was developed on the basis of the actual material contained in the public domain of the Internet.
Available pages were downloaded using web crawler technology, which crawls links in conjunction with a headless browser that parses the original format of the loaded document.
Two methods were tested for identifying question-answer pairs: a method based on analysis of the structure of sentences, and a method based on an associative-ontological analysis of texts.
Based on the analysis of the results obtained by the developed methods, it can be asserted that for a particular sample the average number of question-answer pairs was 7,9 per 1 document (one questionanswer pair per 1 KB of text).
At the same time, an expert evaluation of the quality and completeness of the database, carried out using the interactive prototype, showed the impossibility of obtaining adequate answers for most of the specified search queries that the expert asked the search system without regard to the subject area.
This indicates the limited ability to create an open-domain (not specialized) QA-system only by directly identifying question-answer pairs from unstructured text sources currently available in the public domain of the Internet.
In conclusion, the authors are pleasant to express their sincere gratitude to Professor A.V. Bogomolov for his constructive criticism, a joint discussion of the problems of human-machine interaction in the framework of medical and biological research and congratulate him on the forty-fifth anniversary.