Review of spoken dialogue systems

: Spoken dialogue systems are computer programs developed to interact with users employing speech in order to provide them with specific automated services. The interaction is carried out by means of dialogue turns, which in many studies available in the literature, researchers aim to make as similar as possible to those between humans in terms of naturalness, intelligence and affective content. In this paper we describe the fundaments of these systems including the main technologies employed for their development. We also present an evolution of this technology and discuss some current applications. Moreover, we discuss development paradigms, including scripting languages and the development of conversational interfaces for mobile apps. The correct modelling of the user is a key aspect of this technology. This is why we also describe affective, personality and contextual models. Finally, we address some current research trends in terms of verbal communication, multimodal interaction and dialogue management.


INTRODUCTION
Continuous advances in the development of information technologies have made it possible to access information, web applications and services from nearly anywhere, at anytime and almost instantaneously through wireless connections.Devices such as smartphones and tablets are widely used today to access the web.However, the contents are usually accessible only through web browsers, which are operated by means of traditional graphical user interfaces (GUIs).
Advanced paradigms on human-machine interaction, like the ones proposed by Ambient Intelligence and Smart Environments, emphasize greater user-friendliness, more efficient services support, user-empowerment, and support for human interactions.In this vision, people will be surrounded by intelligent and intuitive interfaces embedded in everyday objects around us, and an environment that recognises and responds to the presence of individuals in a transparent way (Kovács & Kopacsi, 2006).This is why the systems proposed by these paradigms usually consist of a set of interconnected computing and sensing devices which surround the user pervasively in their environment and are invisible to them, providing a service that is dynamically adapted to the interaction context, so that users can interact naturally (De Silva, Morikawa, & Petra, 2012).
To ensure such a natural and intelligent interaction, it is necessary to provide an effective, easy, safe and transparent interaction between the user and the system.With this objective, as an attempt to enhance and ease humanto-computer interaction, in the last years there has been an increasing interest in simulating human-to-human communication, employing the so-called Spoken Dialogue Systems (SDSs; López-Cózar & Araki, 2005;Mc-Tear, 2004;Pieraccini, 2012).These systems have become a strong alternative to enhance computers with intelligent communicative capabilities employing speech, which is one of the most natural and flexible means of communication among humans.
SDSs can be defined as computer programs that accept speech as input and produce speech as output, engaging in a conversation with the user considering a given task.One goal of these systems is to make speech-based technologies more usable.Initially, they were used to ease interaction in simple tasks, such as provision of air travel information (Hempel, 2008).Nowadays, they are used in more complex scenarios, such as Intelligent Environments (Heinroth & Minker, 2013;Minker et al., 2006), in-car applications (Geutner, Steffens, & Manstetten, 2002), personal assistants (e.g., Siri, Google Now or Microsoft's Cortana; Janarthanam et al., 2013), smart homes (Krebber et al., 2004), and interaction with robots (Foster, Giuliani, & Isard, 2014).Another goal is it to make these technologies more accessible, especially for disabled and elderly people (Beskow et al., 2009;Vipperla, Wolters, & Renals, 2012), and to build assistants that are able to hold long-term relations with their users (Andrade et al., 2014;Bouakaz et al., 2014), which implies multifaceted research questions such as engagement and user modelling.
In this paper we present a review of the state of the art of this technology discussing its main advantages and pointing out some research trends.In Section 2 we discuss the fundaments of performance, addressing the main technologies employed.These technologies are used to implement several system modules, the characteristics of which vary depending on a number of factors, for example, the goal of the modules, the possibility of manually defining the behaviours of the modules, and the capability of automatically obtaining the modules from training samples.
In Section 3 we present an evolution of the technology, including some initial systems and research projects.Moreover, we discuss some sample applications in terms of health, education and embodied conversational agents.
In Section 4 we address current development paradigms to reduce the time and effort required in the processes of design, implementation and evaluation.More specifically, we focus on scripting languages and the development of conversational interfaces for mobile apps.The spoken dialogue industry has reached a maturity based on standards that pervade technology to provide high interoperability.This makes it possible to divide the market in a vertical structure of technology vendors, platform integrators, application developers, and hosting companies.
With regard to the evaluation of these systems, it is very difficult to define new procedures and measures that will be unanimously accepted by the scientific community (Lemon & Pietquin, 2012).This field can be considered to be in an initial phase of development.PARADISE (PARAdigm for DIalogue System Evaluation) is the most widely proposed methodology to perform a global evaluation of a dialogue system (Dybkjaer, Bernsen, & Minker, 2004;Walker, Litman, Kamm, & Abella, 1998).This methodology combines different measures regarding task success, dialogue efficiency and dialogue quality in a single function that measures the yield of the system in direct correlation with user satisfaction.The EAGLES evaluation working group (Expert Advisory Group on Language Engineering Standards) proposes different quantitative and qualitative measures (EAGLES, 1996).In the same line, the DISC project (Spoken Language Dialogue Systems and Components) (Failenschmid, Williams, Dybkjaer, & Bernsen, 1999) proposes different measures and criteria to be considered in the evaluation.More recent evaluation initiatives are focused on the assessment of usability and objective estimation of the quality of spoken dialogue interfaces (Möller, Engelbrecht, & Schleicher, 2008;Möller & Heusdens, 2013).
In Section 5 we discuss how to model the user to build more adaptive systems.Human speakers adapt their messages and the way they convey them to their interlocutors in a conversation, taking as well into account the context in which the dialogue takes place.The systems must be able to model this behaviour and try to replicate it.
Finally, in Section 6 we discuss how the specialists have recently envisioned future dialogue systems as being intelligent, adaptive, proactive, portable and multimodal.All these concepts are not mutually exclusive: for example, the system's intelligence can also be involved in the degree to which it can adapt to new situations, and this adaptiveness can result in better portability for use in different environments.

FUNDAMENTS
SDSs are complex to setup because the implementation requires employing a number of technologies to process the human language, which is a very complex task.Generally speaking, these systems are built employing five main technologies: Additionally, the systems typically employ other technologies to store the dialogue history.Figure 1 shows a conceptual module structure of such systems, in which the flow of information between the modules can be observed.

Automatic Speech Recognition
The module that implements ASR is called the speech recogniser.Its goal is to receive the user's speech and generate as output a recognition hypothesis, which is the sequence of words that most likely corresponds to what the user has said (Rabiner & Huang, 1993).Unfortunately, in many cases the recognition hypothesis contains errors in the form of inserted, substituted or deleted words.For example, the user may say: "Please I want to book a flight from Boston to New York" and the speech recognition result might be: "want to book a flight from Denver to New York."Note that in this case, the words "Please" and "I" have been deleted in the speech recognition result, and that the word "Boston" has been replaced with the word "Denver."ASR errors can be due to a number of factors, including environmental conditions (e.g., noise), acoustic similarity between words, and phenomena concerned with spontaneous speech, such as false starts, filled pauses and hesitations.

Stochastic approach
Several approaches to ASR can be found in the literature but the most used today is the stochastic one, which is based on acoustic and language models corresponding to a given language, e.g., English (Huang, Acero, & Hon, 2001).
On the one hand, the acoustic models represent the basic speech units of which the words are comprised (e.g., phonemes), and usually are represented using Hidden Markov Models (HMMs).Mostly, Gaussian Mixture Models (GMMs) are used to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represent the acoustic input.However, there are more recent methods for carrying out the fit.For example, Hinton et al. (2012) proposed to use Artificial Neural Networks (ANNs) to take into account several frames of coefficients and produce posterior probabilities over HMM states.
On the other hand, the language models determine the sentences that are expected to be uttered by the user.In most systems available in the literature, these models are compiled automatically from an analysis of a corpus of sentences in text format regarding the system's application domain.The goal of this process is to obtain statistical information regarding the appearance of a word in a sentence, given a previous history of words.Thus, the corpus of sentences must be large enough to enable obtaining significant statistics, typically at least several thousand words.Usually, the set of all the words considered by these models are stored in the so-called dictionary.However, recent approaches to ASR employ widecoverage models that do not require a dictionary.For example, the approach used in the Google speech API employs a knowledge graph that provides vocabulary related to more than several million entities, such as people, places and things.

N-best recognition
In the previous section we mentioned that the goal of the speech recogniser is to receive the user's speech and generate as output a recognition hypothesis, which is the sequence of words that most likely corresponds to what the user has said.However, many SDSs use a method called N-best recognition, in which case the recogniser generates a list of N recognition hypotheses as the maximum, instead of just one.Typically, this list is ranked in terms of likelihood, in such a way that the first hypothesis in the list is the one with highest likelihood, the second hypothesis in the list is second with highest likelihood and so forth.
This method is commonly used by SDSs because sometimes the correct recognition hypothesis is not the top-ranked one, but one of the lower-ranked hypotheses.Hence, it is possible for the speech recogniser to consider additional information provided by other modules of the dialogue system to re-score the hypotheses in the N-best list, thus replacing the initially top-ranked hypothesis with a different one.For example, the recogniser can employ semantic information provided by the SLU module to discard hypotheses in the list or re-score them if they do not have a correct semantic meaning.Moreover, the recogniser can take into account contextual information to re-score the hypotheses.

Confidence scores
Many SDSs employ techniques to process the ASR results and obtain scores regarding the speech recogniser's confidence on the recognised words.These scores are typically real numbers in the range 0.0 -1.0, which are attached to the words.A low value of the confidence score attached to a given word represents low confidence in the correct recognition of the word, whereas a high score denotes the opposite.These scores can be very important for the performance of a SDS, since by using them the system can decide to confirm a word if its confidence score is under a certain confidence threshold.
A method to compute the confidence scores followed by several researchers takes into account the N-best list of recognition hypotheses, and assigns a higher (or lower) score to each word considering whether the word appears in a large (or small) number of hypotheses (Cox & Cawley, 2003;Liu & Fung, 2003).
N-best lists can also be used to store the possible outputs of the SLU module given the ASR result, which will be discussed in the next section.For example, this can be useful for dialogue state tracking, whose goal is to estimate the user's goal as the dialogue progresses (Wang & Lemon, 2013).Recent work on the processing of N-best lists and confidence scores can be found in the Belief Tracking approach embodied in the Dialogue State Tracking Challenges (DSTC; Williams, 2012).

Spoken Language Understanding
As can be observed in Figure 1, the output of the speech recogniser is the input to the Spoken Language Understanding (SLU) module.The goal of this module is to obtain a semantic representation of the input, which typically is stored in the form of one or more frames (Allen, 1995).Essentially, a frame is a kind of record comprising several fields, which are called slots.For example, a SDS developed to provide flight information and register flight bookings might use a simple frame comprised of the following slots to understand the data in the ASR results: speechActType departureCity destinationCity departureDate arrivalDate airLine Thus, if we consider again the example on flight booking mentioned in Section 2.1, an ASR result could be as follows (confidence scores are noted within brackets): want (0.8676) go (0.6745) book (0.7853) a (0.7206) flight (0.6983) from (0.6205) Denver (0.3935) to (0.6874) new (0.8562) York (0.9876) Thus, the frame obtained from the analysis of this sentence might be: speechActType: flightBooking (0.6745) departureCity: Denver (0.3935) destinationCity: New York (0.8562) In this frame, the confidence scores have been attached to the values of the slots.According to the frame, the dialogue system has correctly understood that the user wants to make a flight booking, and that the destination city is New York.However, it has incorrectly understood the departure city due to an ASR error.
The task to be performed by the SLU module is very challenging due to the specific difficulties inherent in the processing of natural language, such as ambiguity, anaphora and ellipsis.To carry out SLU, this module typically employs grammar rules or statistical approaches, or some combination of both (Griol, Callejas, López-Cózar, & Riccardi, 2014).Also, it can employ the information in the dialogue history module (see Figure 1), which keeps track of previous system and user turns in the current dialogue.The goal is to find out whether the user has recently provided specific words which could be considered implicit in the context and thus available for sentence understanding.
Moreover, in many cases the SLU module must deal with the errors made by the ASR module, which can make the sentences ungrammatical.To deal with these problems, a number of techniques can be employed, such as relaxing the grammars, focusing the analysis on keywords, carrying out partial analyses of the recognised sentences, and employing statistical approaches (He & Young, 2005;Lemon & Pietquin, 2012).

Dialogue Management
As can be observed in Figure 1, the output of the SLU module is the input to the module that implements the Dialogue Management (DM), which is typically termed dialogue manager.The goal of this module is to decide what the system must do next in response to the user's input (McTear, 2004), such as providing information to the user, prompting the user to confirm words that the system is uncertain of, and prompting the user to rephrase the sentence.For example, from an inspection of the frame shown in the previous section, the dialogue manager may decide to generate a confirmation request for the departure city given that its confidence score is very low (0.3935).
To provide information to the user, the dialogue manager usually queries a local database and/or looks for data in Internet.Moreover, it takes into account information about previous dialogue turns, which is kept in the dia-logue history module.This information is important to guide the decision of the dialogue manager towards accomplishing its task.For example, from the information in this module the dialogue manager can notice that all the data regarding a flight booking but the departure date has already been obtained from the user.Hence, the dialogue manager may decide to prompt the user for the missing data.
A number of approaches can be found in the literature for carrying out dialogue management, such as rulebased, plan-based and based on statistical reinforcement learning (Frampton & Lemon, 2009).

Natural Language Generation
The dialogue manager's decision about what the system must do next is the input to the module that carries out the Natural Language Generation (NLG).As the decision is represented abstractly, the goal is to transform it into one or more sentences in text format that must be grammatically and semantically correct, as well as coherent with the current status of the dialogue (Lemon, 2011;López, Eisman, Castro, & Zurita, 2012).Several approaches can be found in the literature for this purpose.Many systems typically employ the simplest one, which is called template-based, and relies on the use of a number of templates to generate a number of sentence types (Baptist & Seneff, 2000).Some parts of the templates are fixed whereas others represent gaps that must be instantiated with data provided by the dialogue manager.For example, the following template can be used to generate sentences regarding available flights connecting two cities: TTS_Template_1 ::= I found <flightAmount> FLIGHT_S/P from <departureCity> to <destinationCity> leaving on <departureDate> In this template, the gaps are represented by means of angle brackets (e.g., <flightAmount>) and FLIGHT_S/P is a function that returns either the singular or the plural form for the word "flight," depending on the value of the <flightAmount> gap.For example, a sentence that this template can generate is as follows: "I found three flights from Madrid to New York leaving on Friday." In order to be coherent with the current status of the dialogue, the NLG module must generate sentences that consider what has already been said in the dialogue.This implies omitting some words in the sentences if these have been already mentioned (ellipsis) and using pronouns instead of nouns (anaphora).To accomplish this task, this module uses the dialogue history module, which stores recently used words.This module must also avoid redundant information in the output, as well as information that is so closely related that the user could automatically infer one piece when hearing another.The process of removing such information is called sentence aggregation (Dalianis, 1999).It is possible to find in the literature much more sophisticated and recent approaches than template-based, such as statistical (Dethlefs, Hastie, Cuayáhuitl, & Lemon, 2013;Rieser, Lemon, & Keizer, 2014).

Text-to-Speech synthesis
Finally, the sentences in text format generated by the NLG module are the input to the last module shown in Figure 1.This module carries the Text-to-Speech synthesis (TTS), which means a transformation of the sentences into the dialogue system's speech (Dutoit, 1996).As opposed to other simple methods for speech synthesis based on concatenation of pre-recorded words, the TTS process allows transforming into speech any arbitrary text, thus avoiding the need for having the words in the sentences pre-recorded in advance.
TTS is very complex due to a number of reasons.One is the possible existence in the sentences of abbreviations (e.g., Mr., Mrs. and Ms.) and other sequences of words (e.g., numbers) that cannot be transformed into speech directly.Another reason is that the pronunciation of words is not always the same and depends on a number of factors, such the position in the sentence (e.g., beginning vs. ending) and the type of sentence (e.g., declarative vs. interrogative).Hence, the TTS process requires two steps.The first performs a transformation of the input to replace the abbreviations and other sequences of words with the corresponding words.The second does a linguistic analysis of the transformed input to include in it marks that indicate how to pronounce the words, for example, in terms of intonation and speed.

EVOLUTION OF THE TECHNOLOGY
Human beings have always wanted to be able to communicate with artificial companions.There are many examples in cinema and literature.Some of the most ancient examples can be found in Greek and Roman mythology in which heroes could communicate with statues of goddesses or warriors.The first serious attempts at building talking systems were initiated in the eighteenth and nineteenth centuries, when the first automata were built to imitate human behaviour.These first machines were mechanical, and it was not until the end of the nineteenth century that scientists concluded that speech could be produced electrically.

Initial systems and research projects
At the beginning of the twentieth century, Stewart (1922) built a machine that could generate vocalic sounds electrically.During the 30s, the first electric systems that could produce any type of sound were built.At the same time there appeared the first systems with very basic natural language processing capabilities for machine translation applications.During the 40s, the first computers were developed and some prominent scientists like Alan Tu-ring pointed out their potential for applications demanding intelligence (Turing, 1950).
This was the starting point that fostered the research initiatives that in the 60s yielded the first language-based systems.For example, ELIZA (Weizenbaum, 1966) was based on keyword spotting and predefined templates to transform the user input into the system's answers.
Benefiting from the incessant improvements in the fields of ASR, natural language processing and speech synthesis, the first research initiatives related to SDSs appeared in the 80s.To some extent the origin of this research area is linked to two seminal projects: the DARPA Spoken Language Systems in the USA and the Esprit SUNDIAL in Europe.These projects were a starting point for the research in MIT and CMU, where some of the most important systems in the academia have been created.
The DARPA Communicator project stands out as one of the most important research projects in the 90s including multi-domain capabilities.This government-funded project aimed at the development of cutting-edge speech technologies, which could employ as an input not only speech but also other modalities.
Currently experts have proposed higher level objectives to develop SDSs, such as providing them with advanced reasoning, problem solving capabilities, adaptiveness, proactiveness, affective intelligence, multimodality and multilinguality (Heinroth & Minker, 2013).These new objectives are referred to the dialogue system as a whole, and represent major trends that in practice are achieved through the joint work in different areas and different components of the system.

Health
SDSs have also proven to be useful for providing the general public with access to telemedicine services, pro-moting patients' involvement in their own care, assisting in health care delivery, and improving patient outcome.Bickmore and Giorgino (2006) defined these systems as being "those automated systems whose primary goal is to provide health communication with patients or consumers primarily using natural language dialogue." These systems offer an innovative mechanism for providing cost-effective healthcare services within reach of patients who live in isolated regions, have financial or scheduling constraints, or simply appreciate confidentiality and privacy.Also, as they are based on speech, they are suitable for users with a wide range of computer, reading and health literacy skills.In general healthcare, professionals can only dedicate a very limited amount of time to each patient.Thus, patients can feel intimidated to ask questions, or to ask for information to be rephrased or simply uncomfortable to provide confidential information on face to face interviews.Many studies have shown that patients are more honest with a computer than a human clinician when disclosing potentially stigmatizing behaviours such as alcohol consumption, depression, and HIV risk behaviour (Ahmad et al., 2009;Ghanem, Hutton, Zenilman, Zimba, & Erbelding, 2005).

Education
Education is another important application domain for SDSs.According to Roda, Angehrn, and Nabeth (2001), educative technologies should accelerate the learning process, facilitate access, personalize the learning process, and supply a richer learning environment.
These aspects can be addressed by means of multimodal conversational agents by establishing a more engaging and human-like relationship between students and systems.This is why this kind of agents have been employed to develop a number of educational systems in very different domains, including tutoring (Pon- Barry, Schultz, Bratt, Clark, & Peters, 2006), conversation practice for language learners (Fryer & Carpenter, 2006), pedagogical agents and learning companions (Cavazza, de la Camara, & Turunen, 2010), dialogues to promote reflection and metacognitive skills (Kerly, Ellis, & Bull, 2008), or role-playing actors in simulated experiential learning environments (Griol, Molina, Sanchis de Miguel, & Callejas, 2012).
They have also been used for education and training, particularly in improving phonetic and linguistic skills, including assistance and guidance to F18 aircraft personnel during maintenance tasks (Bohus & Rudnicky, 2003), training soldiers in proper procedures for requesting artillery fire missions (Roque et al., 2006), and dialogue applications for computer-aided speech therapy with different language pathologies (Rodríguez, Saz, & Lleida, 2012).

Embodied conversational agents
Some of the most demanding applications for fully natural and understandable dialogues are embodied dialogue agents and personal companions.For example, Collagen is an application for building conversational assistants and collaborative agents (Rich & Sidner, 1998).AVATALK provides natural, interactive dialogues with responsive virtual humans (Hubal & Day, 2006).COMIC is a system developed for bathroom design using speech and gesture input/output, in collaboration with an avatar with facial emotions (Catizone, Setzer, & Wilks, 2003).NICE provides embodied historical and literary characters capable of natural, fun and experientially rich communication with children and adolescents (Corradini et al., 2004).

DEVELOPMENT PARADIGMS
As can be observed in Section 2, the dialogue system domain is highly multidisciplinary and benefits from the advances in multiple directions related to different specific areas (Williams et al., 2012).This way, current SDSs are the consequence of the work on more reliable speech recognizers, more intelligible synthetized voices and more flexible conversational behaviours, among other achievements (McTear, 2011).
Considering this multidisciplinary nature, it is no surprise that the first hallmark in the development of these systems was the appearance of modular paradigms that allowed the developers to centre on their particular areas of interest, treating the other parts as black boxes.For instance, when the first speech recognizers and synthesizers were accessible, it was a huge advance for researchers and practitioners that centred on dialogue management, as they could focus on the aspects directly related to handling the conversation without worrying about the details of how to recognize the user input or synthesize the output.Pieraccini and Huerta (2008) highlighted the importance of "reusable components" as one of the main trends for the industry of dialogue systems, as it was and still is an important aspect to build increasingly complex applications by taking advantage of already existing modules.

Scripting languages
The development of SDSs has also benefited from the appearance of scripting languages that are similar to other widespread general purpose languages.The most salient example is VoiceXML.1According to Levow (2012), this introduced some advantages including availability, robustness, ease of use, platform-independence, and flexibility.Soon other languages appeared to take advantage of the visual part of the web, for example SALT and X+V.However, speech-based web interaction with these languages has gradually lost support.Although they are still used to build some desktop systems (e.g., in Microsoft Speech API), most of the industrial platforms that hosted interpreters have disappeared.Nevertheless, now there seems to be an upsurge of voice navigation, and new initiatives have appeared, for example, the Web Speech API.2

Development of conversational interfaces for mobile apps
Also we can appreciate a big change in the SDSs community, a flourish due to the availability of large quantities of speech data (Williams et al., 2012), and the possibilities offered by mobile devices and their operating systems (Neustein & Markowitz, 2013).
Speech interaction with mobile assistants in smartphones is now more popular than ever, in part due to the pertinence of speech as an interaction modality with small-sized devices, the increasing accuracy of the recognizers offered to developers, and the democratization of their development.
Android and iOS offer specific libraries for ASR and speech synthesis that allow building conversational agents focusing on the interaction only (McTear & Callejas, 2013).The development is made in general purpose object-oriented languages (e.g., Java and C#), and thus is accessible to more developers.
Also robotics is starting to be increasingly relevant in the area, the appearance of open-hardware initiatives have also brought more attention to this topic, and natural interaction is central in human-robot interaction studies (Graaf & Ben Allouch, 2013;Sekmen & Challa, 2013).
It is difficult to foresee how speech interfaces will be developed in the future.However, the access of an increasingly bigger number of developers to the community, the advance of statistical approaches, the increasing possibilities to access and share corpora, and the opportunities to reuse implementations of different developers establish a good basis for a promising future.

MODELLING THE USER
The advances in the field of SDSs described in the previous sections have provided an excellent opportunity to build richer user models.At the beginning, the capabilities of speech recognizers were limited to very small vocabularies, and so the developed applications were very simple and took into account very little information from the users.With the development of the technology started the study of how to adapt the vocabulary for recognition and the messages synthesised to enhance the user experience.That is, now the user was the centre of the system design, instead of the application domain.
Numerous publications provide hints for voice interaction design, including insights on how to specify the requirements of SDSs taking into account the users (Cohen, Giangola, & Balogh, 2004;Harris, 2004;Kortum, 2008).Some authors have focused on particular users, and particularize the guidelines to certain profiles, for example, age and familiarity with the new technologies (Callejas, Griol, Engelbrecht, & López-Cózar, 2014).
However, nowadays the information about the user is not only considered in design time, it is included in modules that allow the system to dynamically adapt to the users' state.Currently it is possible to obtain and manage a huge amount of information about the users, not only about what they say, but also about how they say it, where the say it and even predict why they said it and what they will say next, and these abilities will be increasingly more sophisticated in the future thanks to the multidisciplinary perspectives of different sciences including computer science, linguistics, psychology and sociology.In the next subsections we provide more details on some of these dynamic sources of information about the users.

Affective models
Affective computing deals with the recognition, management and synthesis of emotions (Picard, 2003).It is particularly relevant for SDSs to adapt to the user state and also to provide flexible emotionally-coloured responses for different purposes (Callejas, López-Cózar, Ábalos, & Griol, 2011).
It might seem obvious that the main use of emotional information in dialogue systems is to try to avoid negative user states and foster positive ones.Some examples of such behaviour are to avoid user negative emotions due to system errors (Callejas, Griol, & López-Cózar, 2011), to favour engagement by diminishing boredom (Baker, D'Mello, Rodrigo, & Graesser, 2010), to maximize satisfaction (Lebai Lufti, Fernández-Martínez, Lucas-Cuesta, López-Lebón, & Montero, 2013), or to foster positive emotions to adhere to healthy habits (Creed & Beale, 2012).However, in some application domains it might also be useful to render or provoke negative states, for example, for emotional mirroring, or to try to stress the users for a specific purpose, for example, for the treatment of different types of anxiety (Callejas, Ravenet, Ochs, & Pelachaud;2014;Qu, Brinkman, Ling, Wiggers, & Heynderickx, 2014).
There exist many different ways in which emotions are defined, represented and managed within SDSs.Emotions can be represented as points in a space (usually with two dimensions: activation and evaluation), as discrete categories or with appraisal models that consider the cause and target of the emotional response (Hudlicka, 2014).The implementation of affective SDSs relies on the representation being used.If it follows the dimensional or discrete approach, the recognition is usually based on the manifestation of the user emotion, which can be processed considering linguistic (Balahur, Mihalcea, & Montoyo, 2014) and paralinguistic cues (Schuller & Batliner, 2013).When the appraisal model is used, a more sophisticated approach must be employed in order to consider as well the possible causes of the emotion (Moors, Ellsworth, Scherer, & Frijda, 2013).
Once a particular emotion is recognized, there are several ways how to consider it to adapt the system behaviour.Moreover, the approach selected also depends on the ultimate goal of the system, such as to optimize the selection of the answer, to lead the user to an optimal state for the interaction, to build a social interaction with the user, or a combination.
In the first case, the information about the user's emotional state can be employed as another source of information used to handcraft new rules or as a new input to a statistical dialogue manager (Callejas, Griol, et al., 2011).When the objective is to change the user's emotional state or build more social relations, the system must include complex models on how emotions vary over time, and how to sustain more complex forms of affect such as engagement and trust (Acosta & Ward, 2011).These same models can be used to generate a believable system's behaviour and to fine-tune the natural language generation and speech synthesis modules.

Personality models
Not only context and emotion determine our behaviour, they are also modulated by our personality (Callejas, López-Cózar, et al., 2011).Mairesse and Walker (2011) propose to tailor the system's personality according to the application domain.For example, in a tutoring system they suggest to render extrovert and agreeable pedagogic agents, whereas it could be interesting for a psychotherapy system to be neurotic.They also point out that the personality rendered by telesales agents could match the company's brand.
Other studies focus on adapting the systems' personality to match users' personality.For example, Nass and Yen (2012) showed that users' perception of the system's intelligence and competence increases if the perceived agent's personality matches their own.Also, having information about the user personality makes it possible to better adapt the system behaviour.This is very relevant to engage users in order to attain better performance and increase likeability, credibility, acceptance and overall user satisfaction.In Callejas, Griol, and López-Cózar (2014) we provide a discussion on these topics, as well as a framework for evaluating whether the system personality is perceived as intended by the users, and whether it matches the users' own personality.

Contextual models
Knowing the interaction context is very important for SDSs due to various reasons.Firstly, it allows obtaining a better system performance; for example, it is possible to use different noise models that allow increasing the speech recognition rates.Secondly, the location information can be used to deliver functionalities; for example, to find near spots, or to recognize the activities being carried out by the user to provide adequate services (Zhu & Sheng, 2011).

RESEARCH TRENDS
Language is one of the most pervasive and complex human capabilities.Developed over thousands of years, our abilities to get involved in long-term conversations, comprising multiple persons, on noisy environments, integrating multiple input/output modalities and covering multiple concurrent tasks is really amazing.This phenomenon has been pictured in fiction, for instance, in some thought-provoking films such as 2001, A.I. and Her, among others.
Despite the extensive list of techniques created and applied in the field of human-computer interaction, language is still the most common, fastest and natural way of communication.However, the low-level connection between language and thought makes the work on natural language technologies both a critical challenge and a great opportunity for research and innovation.
SDSs constitutes one of the most demanding areas of work as it involves the majority of the language-related subfields, from ASR to speech synthesis going through natural language understanding, semantic representation, dialogue management, affective modelling, multimodal interfaces, etc.Nevertheless, improvements in this area have many direct social and economical impacts.A recent survey carried out by Grand View Research, Inc. 3 estimated the worldwide market for intelligent virtual assistants in 2012 at USD 352 million, and forecasts an annual growth of 31.7% from 2013 to 2020.According to this report, reduction of customer service operational costs is the most prominent area where the economical impact will take advantage of this technology.
In the last few years, the integration of speech-enabled technologies in mobile platforms has become a main target.The notion of personal assistant has entered the market through widespread applications like Siri, Google Now or Microsoft's Cortana.
The additional integration of Voice Search in these platforms opens new areas of applications.In this case, the speech recogniser is in charge of the transcription from speech to text (obtaining a text query), which is then used as the input to a traditional search engine.Accordingly, by the integration of ASR and search engines, Voice Search can help users in simple tasks as exemplified in queries like: "Is there any Japanese restaurant near here?"or "Show me the weather forecast for tomorrow in Paris."However, Voice Search lacks any complex dialogue capability as it usually focuses on just one single input that generates a single output.
To sum up, research and innovation on language technologies in general and on SDSs in particular constitute a major and prominent area of interest both in the public and private sectors.
In the previous sections of this paper, we have introduced the main ideas around the notion of SDS, its components and global architecture, some common areas of application of the whole technology as well as some key user-related aspects.In this section we focus on some of the most noticeable research trends in this field.

Verbal communication
The first and sometimes one of the most critical components of a SDS is the speech recogniser.Accordingly, ASR errors are the first problem that a SDS must be able to cope with.Despite the undeniable improvements of the technology irrespective of the task under consideration, it is quite evident that there is still significant room for improvements.
Some of the main lines of research at this level are: detection and cancellation of background noise, spontaneous speech where spoken disfluencies can considerably affect the recogniser, real-time recognition or even some kind of prediction or anticipation over the next input, the integration of affect and emotion recognition as part of ASR (Batliner, Seppi, Steidl, & Schuller, 2010), and the application of new techniques apart from HMMs, such as deep neural networks (Dahl, Yu, Deng, & Acero, 2012).Although some recognition errors do not prevent a rea-3 http://www.grandviewresearch.com/industry-analysis/intelligent-virtual-assistant-industrysonable understanding of the user input (for instance, the detection of the main intent and keywords), there are still many cases in which the ASR's output leads to a complete semantic misunderstanding.

Multimodal interaction
Spoken language understanding (SLU) plays a crucial role in the design and implementation of SDSs.However, a natural user interaction not only requires reliable speech recognition but also the detection and analysis of additional nonverbal communication, such as facial expressions or emotional state and gesture, among others (Bui, 2006;López-Cózar & Araki, 2005).
The incorporation of multimodal interaction in Ambient Intelligence environments has become a basic goal in many research programs.For example, the first EU Call under the Horizon 2020 program in the area of language technologies (ICT-22-2014) has focused on multimodal and natural computer interaction.
Research over the current state of the art in multimodal SDSs includes topics such as semantic multimodal fusion (Russ et al., 2005).Additionally, some initial results demonstrate that using additional channels it is possible to reduce the ASR error rate employing multimodal disambiguation (Longé, Eyraud, & Hullfish, 2012).
Multimodal recognition of emotions has attracted the research community recently (Zeng, Pantic, Roisman, & Huang, 2009).For example, Calvo and D'Mello (2010) presented a survey on the combination of physiology, face, voice, text, body language and complex multimodal characterization.

Dialogue management
While introducing the global architecture and the main functional modules of a SDS, Section 2 has presented the Dialogue Manager as the component in charge of the coordination of the human-computer interaction.Different approaches for dialogue modelling have appeared in the last decades, each assuming a specific formalisation of the notion of dialogue.Taking into account their practical and theoretical aspects, some of the most prominent dialogue management approaches are the following (Jurafsky & Martin, 2009) Finite-state models conceive the dialogue as a sequence of steps over a state transition network.The nodes capture the implicit dialogue state and correspond to the system's utterances (answers, prompts, etc.), while the transitions between the nodes determine all the possible paths (Cohen, 1997).McTear (2002) described the Nuance automatic banking system implemented with this approach.Although simplicity can be mentioned as its main advantage, its lack of flexibility represents a crucial drawback.However, it is still a common strategy used to cope with basic operations in call centers.
Frame-based approaches have been introduced in Section 2. This dialogue management strategy is based on the idea that some components (called slots) of the dialogue often appear together and are required to complete a task.This approach incorporates flexibility as the order of filling the slots can be arbitrary, and even makes more natural the interactions as several slots can be filled in a single turn, or even it is possible to overwrite previous values of the slots, allowing correction and repair mechanisms.The frame-based framework originated some variations: schemas, agendas (used in the Carnegie Mellon Communicator system; Bohus & Rudnicky, 2003), task structure graphs, type hierarchies and blackboards (Rothkrantz, Wiggers, Flippo, Woei-A-Jin, & van Vark, 2004).
The Information State Update (ISU) approach models all the available information during the dialogue as an "Information State" (Larsson & Traum, 2000).Consequently, this state integrates information related to the state of all the participants in the dialogue.Basically, this state comprises all the information gathered during the previous contributions to the dialogue by the participants, and models the future actions to be taken by the dialogue manager.The ISU approach can be conceived as a declarative model of the dialogue.
All the approaches described so far require a computational linguist expert to formalize, design and implement the dialogue scheme itself.This hand-crafted strategy impacts on the global costs for the design, implementation and mainly on the maintainability of the dialogue system.In order to overcome these limitations, other approaches can be found in the literature, such as the agent-based and those focused on machine learning techniques.
The agent-based approach is particularly useful when it is necessary to execute and monitor operations in a dynamically changing application domain.It makes it possible to combine the benefits of different dialogue control models, such as finite-state based dialogue control and frame-based dialogue management (Chu, O'Neill, Hanna, & McTear, 2005).Similarly, it can benefit from alternative dialogue management strategies, such as systeminitiative and mixed-initiative (Walker, Hindle, Fromer, Di Fabbrizio, & Mestel, 1997).
Recent research has applied machine learning techniques to automatically infer dialogue systems.Among these techniques, the use of MDPs and POMDPs are worth mentioning.Accordingly, the methodological motivation as well as the technical kernel relies on the possibility of inducing a statistical framework from a huge corpus of dialogues (Young, Gasic, Thomson, & Williams, 2013).Some advantages provided by this framework need to be highlighted.Firstly, the incorporation on an explicit representation of uncertainty, which makes more robust the final system for verbal (speech) and non-verbal recognition in comparison to rule-based models.Secondly, the learning capability of the framework, which represents a significant reduction of developing costs.However, the tasks around data collection and annotation of the huge dialogue corpora that are required may jeopardise this second advantage.

Meta-cognition and incrementality
The human ability to get involved in complex interactions that create dialogues can be considered as a cognitive skill.This way, dialogue management is a technical sub-field which tends to mimic this cognitive skill using different approaches, as previously discussed.However, humans have the capability to reflect on their own behaviour and to use this reflection for improvement.The incorporation of metacognitive capabilities to the field of SDSs represents a challenging and promising research line (Alexandersson et al., 2014; EU-funded Metalogue project4 ).The turn-taking mechanism of standard Interaction Management architectures are based on complete sentences.However, human communication is intrinsically incremental.Some outstanding research is currently focusing on this topic (Schlangen & Skantze, 2011; EUfunded Parlance project5 ).

CONCLUSIONS
In this paper we have presented a short study on the state of the art of spoken dialogue systems, which are computer programs developed to interact with users employing speech in order to provide them with specific automated services.A key aspect with these systems is that the interaction is carried out by means of dialogue turns, which in many studies available in the literature, researchers aim to make as similar as possible to those between humans in terms of naturalness, intelligence and affective content.
The field is too broad to make a detailed study in just one paper.Thus, we have addressed a limited number of aspects to provide the reader with some basic knowledge on the core technologies employed for the development.Also, we have aimed at showing the technological challenges related to speech and language processing that limit the use of current systems for a wider range of potential users and applications.
In addition, we have presented an evolution of this technology and discussed some challenging applications, such as health, education and embodied conversational agents.As an outcome of the technological evolution, we have addressed the development paradigms, discussing specific scripting languages as well as development of conversational interfaces for mobile apps.
Given that the correct modelling of the user is a key aspect for this technology, we have addressed current models for affection, personality and contextual processing.
Finally, we have discussed some current research trends in terms of verbal communication, multimodal interaction and dialogue management.

Figure 1 :
Figure 1: Module architecture of a SDS.