Building intelligent systems, known as chat bots, that can engage in a human-like conversation constitutes one of the frontiers of AI still to be tamed. While voice assistants are now commonplace, these fall short of a true conversation as they are unable to hold the context of the conversation and utilize it in generating future responses.
A two-person chat, like the one below, is modeled as a sequence of utterances exchanged between two actors. In particular we are interested in the scenario where the two actors are a chat bot and a human.
Human: Hi, I want to buy a black, fully loaded Honda Accord. Can you help me?
Bot: Sure! Have you test driven one yet?
Human: No. I have only been researching cars in that range to figure out which one I want to buy. Now I am pretty sure I want to buy an Accord and thought I would check what you have in your lot.
Bot: Great! So I don’t have a black Accord buy do have one in Grey. I would suggest you drop in whenever is convenient to you and test drive it. I can guarantee you will love its feel!
Human: Sure. Can you book me in for tomorrow afternoon, say 2pm?
Bot: Can I have a contact number and name?
Human: Sure. It’s Scott Davis. (212) 3786785
Bot: Thanks Scott. You are all booked in. Look forward to seeing you tomorrow
Previous utterances by both parties are treated as context in responding to the current human utterances. Each utterance in itself is modeled as a sequence of words.
Traditional approaches to representing documents as bag-of-words fail to take into account the sequence of utterances and words. These representations also encode text at a syntactic (word) level rather than a semantic level. However, recent advances in word embeddings such as word2vc and Glove vectors map individual words into an n-dimensional continuous space, where similarity of the vectors represent the semantic similarity between the words they represent. This enables text fragments to be represented as n-dimensional vectors enabling a more semantic comparison of one utterance with another. However, simply combining the word vectors using a weighted sum still ignores the sequence of words and utterances. Hence typically Recurrent Neural Networks (RNN) are used to encode the sequence of utterances and words. Standard Recurrent Networks produce a representation of the sequence of words by taking two inputs, the current word at time t and the hidden state from time t-1, which is a “summary” of the sequence of words preceding the current word at time t. In doing so, it encodes the utterance taking the order of words within the utterance, hence making the phrase “The sun rises in the west” distinct from “In the west, the sun rises”. As sequences become longer, standard RNN nodes in practice are unable to model the complete sequence effectively. While a number of alternatives have been proposed, the method empirically shown to best model long sequences uses Long Short-term memory units (LSTM).
Various architectures based on RNNs have been used to implement chat bots with varying degrees of success. Chat bots are typically implemented using Retrieval based or Generative models. Retrieval based chatbots require the handcrafting of template responses from which the best match response is used to respond to the current target utterance by the human. In a dual encoder, the target utterance along with the context are together encoded by the RNN. The n-dimensional vector representing the hidden state of the last unit of the RNN represents an encoding of the conversation to date, a translator unit then converts this encoding to an encoding, in the same n-dimensional space, of the ideal response. Similarity of the ideal response encoding to the encoding of the set of response templates at the disposal of the chat bot, determines which response template is used to generate a response. Alternatively, the response encoding can be used to generate a response. These have the advantage of not requiring predefined response templates, however, they are typically more error prone as human language is complex.
The utterances within a chat and the vocabulary used are a result of a number of underlying factors. The intent of the utterance is one such factor. However, there are others like the personality, state of mind, perceived command of the language of the recipient as well as situational and domain specific constraints such as level of motivation, stage within a process (for example, a sales process where the vocabulary choice when researching a product would be very different compared to when investing time or other “currency” within the purchase). As some of these factors are also likely to evolve during the course of the conversation, these implicit factors can also be modeled using an RNN, intertwined with the RNNs that encode the context and response.
At Cyrano, we approach chat bit curation as a machine learning problem that effectively learns models from chat logs. Semi-supervised learning and active learning is used to minimise the upfront labelling of data. Algorithms learn to identify domain specific entities, topics and embeddings in concert with proprietary advanced language models that ensure that responses are generated that are more effective in transitioning the human from one stage of the conversation towards the goal state.