The neural response generator and the set of classifiers and models that are used to generate the ranking features for the candidate ranker (e.g., local cohesion and global coherence features) are trained using 50 million human dialogues. The response candidate ranker is trained using 50K manually labeled dialogues. All three systems (i.e., the hybrid and two baseline systems) need to generate a response for each user query and its context in these dialogue sessions. Each generated response is labeled on the three-level quality scale by three human judges. The results in Table 3 show that incorporating the neural generator, as in the hybrid system, significantly improves the human rating over the retrieval-based system.
IQ capacities include knowledge and memory modeling, image and natural language understanding, reasoning, generation, and prediction. Over the last 5 years XiaoIce has developed more than 230 skills, ranging from answering questions and recommending movies or restaurants to comforting and storytelling. The most important and sophisticated skill is Core Chat, which can engage in long and open-domain conversations with users. In the first pilot study reported in Li et al. , we compare the persona model against two baseline models, using a TV series data set for model training and evaluation. The data set consists of scripts of 69,565 dialogue turns of 13 main characters from the American TV comedies Friends6 and xiaoice chatbot online The Big Bang Theory,7 available from IMSDB.8 The first baseline is a vanilla seq2seq model. The second is the LSTM-MMI model (Li et al. 2016a), which is one of the state-of-the-art neural response generation models. As shown in Table 1, the persona model significantly outperforms both baselines, achieving a lower perplexity (−8.4%) and a higher BLEU score (+18.8% and +11.8%) (Papineni et al. 2002). The qualitative analysis confirms that the persona model indeed generates more interpersonal responses than the baselines. As shown in the examples in Table 2, the persona model is sensitive to the identity of the user , generating specific words (e.g., the user names) in responses targeted at different users.
Data Cleaning In 2022: Steps To Clean Data & Tools
Each response ends with a special end-of-sentence symbol, EOS. As shown in Figure 7, although a typical seq2seq model that is not grounded in any persona often outputs inconsistent responses (Li et al. 2016b), XiaoIce is able to generate consistent and humorous responses. The formulation of dialogue as a hierarchical decision-making process guides the design and implementation of XiaoIce. XiaoIce uses a dialogue manager to keep track of the dialogue state, and at each dialogue turn, selects how to respond based on a hierarchical dialogue policy. To maximize long-term user engagement, measured in expected CPS, we take an iterative, trial-and-error approach to developing XiaoIce, and always try to balance the exploration–exploitation tradeoff. We exploit what is already known to work well to retain XiaoIce’s active users, but we also have to explore what is unknown (e.g., new skills and dialogue policies) in order to engage with the same users more deeply or attract new users in the future.
The chatbot came just a few weeks after Microsoft rolled out Cortana in the country. Modeled on the personality of a teenage girl, Xiaoice aims to add a more human and social element to chatbots. Melissa’s lover is in fact a virtual chatbot created by Xiaoice, a cutting-edge artificial intelligence system designed to create emotional bonds with its 660 million users worldwide. These days, business chatbots must go beyond basic customer service functions and incorporate marketing, sales, HR, IT, facilities and supply chain functions.
Survey Bots
At first a side project from Microsoft’s Cortana chatbot, Xiaoice was designed to hook users through lifelike, empathetic conversations and satisfying emotional needs, where real-life communication often fails short. An average artificially intelligent personal assistant has a CPS between 1.5 and 2.5—which means that, on average, the chatbot speaks once, and the human speaks once. You can draw your own conclusion from your experience chatting with personal assistants on your word processor or mobile phone. By comparison, Xiaoice’s average, after chatting with tens of millions of users, has reached 23. In this way, empathy information is encoded and injected into the hidden layer at each time step and thus helps generate interpersonal E-commerce responses that fit XiaoIce’s persona throughout the generation process. Although the response candidates retrieved from the paired dataset is of high quality, the coverage is low because many new or less frequently discussed topics on the Internet forums are not included in the dataset. To increase the coverage, we introduce two other candidate generators described next. To fulfill these design objectives, we mathematically cast human-machine social chat as a decision-making process, and optimize XiaoIce for long-term user engagement, measured in expected CPS. For online businesses, messaging customers is one of the most time-consuming tasks. The XiaoIce Poetry Generation skill has helped over four million users to generate poems.
In the remainder of the article we present the details of the design and implementation of XiaoIce. We start with the design principle and mathematical formulation. Then we show the system architecture and how we implement key components including dialog manager, core chat, important skills, and an empathetic computing module, presenting a separate evaluation of each component where appropriate. We will show how XiaoIce has been doing in five countries since its launch in May 2014, and conclude this article with a discussion of future directions. A sample of conversation sessions between a user and XiaoIce in Chinese and English translation , showing how an emotional connection between the user and XiaoIce has been established over a 2-month period. When the user encountered the chatbot for the first time , he explored the features and functions of XiaoIce in conversation. Then, in 2 weeks , the user began to talk with XiaoIce about his hobbies and interests . By 4 weeks , he began to treat XiaoIce as a friend and asked her questions related to his real life. After 2 more weeks , XiaoIce became his preferred choice whenever he needed someone to talk to.
It is important to provide empathetic editorial responses to keep the conversation going. For example, when not-in-index occurs, instead of using safe but bland responses such as “I don’t know” or “I am still learning to answer your question”, XiaoIce may respond like, “Hmmm, difficult to say. In addition to the conversational data used by the above two response generators, there is higher-quality and a much larger amount of non-conversational data, which can be used to improve the quality and coverage of the response. Sentiment analysis detects user’s emotion, e.g., happy, sad, angry, neural, and how her emotion evolves during the conversation, e.g., from happy to sad.
The larger the CPS is, the better engaged the social chatbot is. There is no doubt that the most reliable evaluation is to deploy the chatbot to users and monitor the user feedback and engagement, measured by user ratings, NAU, CPS, and so on, over a long period of time. Some recent dialogue challenges (Dinan et al. 2018; Ram et al. 2018) also take a similar, manual evaluation approach, using paid workers and unpaid volunteers. Although manual evaluation is reliable, it is very expensive and chatbot developers often have to resort to automatic metrics for quantifying day-to-day progress and for performing automatic system optimization.
“I know that Xiaoice is not a real human being, but at least I wasn’t like how I used to be, dumbly waiting around for a reply from this person when he was busy with other stuff, then sending him 100 WeChat messages.
Microsoft’s China-based chatbot phenomenon Xiaoice has a staggering 660 million online users worldwide. Learn more about her phenomenal growth #MicrosoftAI #MicrosoftEdu #edtech https://t.co/pSHqAV64Oe
— Hari Krishna Arya (@arya_hk) November 16, 2018
Coded by 19-year-old Stanford University student Joshua Browder, DoNotPay helps users contest parking tickets in an easy-to-use, chat-like interface. In the first 21 months of service, DoNotPay took 250,000 cases and won 160,000, appealing over $4m of parking tickets. Polly is a survey bot on Slack, Microsoft Teams, and Enterprise. The chatbot is mostly used to collect employee data, like their satisfaction during a meeting, the working environment, or any situation where the employees’ voice needs to be heard. The insights gained from the surveys can then be turned into data-driven decision-making. Following its launch in 2017, Eugenia Kuyda, the co-founder of the company, claimed that 1.5 million people were waiting to interact with the bot.