Engati - User Guide
...
Building Your Bot
Concepts

NLP/ NLU Capabilities

11min

1. Introduction

The Engati platform comes pre-packaged with its own proprietary NLU engine for making the chatbots built on the platform respond most accurately to the user’s queries.

While there are a lot of complex computations which happen for determining each response, we have made great efforts to keep the platform simple for building and training the chatbots. The following sections will help to throw light on what goes on behind the scenes from an NLP/NLU perspective for resolving each query.

2. NLP Pipeline

The NLP pipeline can be defined as a series of steps which happen in perfect synchronization to process the query and terms and provide the best response. Each step or component of the pipeline is focused on performing a particular objective and then all these results are consolidated, reconciled and the best match is returned by the chatbot. This pipeline is what orchestrates all the individual aspects. Let us look at each of the components in more detail.

2.1 Normalisation

To reduce bias in the pattern recognition algorithm tokens are transformed into a consistent format which is lower case. This is also necessary since many users prefer lower case text for chatting.

Also noisy characters e.g. punctuation can also be removed for better downstream processing.

2.2 Tokenisation

Input messages are split into sentences, and then sentences are split into tokens/words. Tokens are the lowest common denominator for further processing.

2.3 Stop Words Removal

Stop words are frequently occurring words like ‘the’, ‘and’, ‘a’, etc. that do not contribute greatly to understanding text and are thus removed from the input message. This helps in reducing noise and improves accuracy.

Custom Stopwords can also be added to enhance the training of the NLP engine.

2.4 Spell Check

The platform includes the capability to correct misspelt words in user messages.

The spell check models continuously learn new words based on the training data added by Bot Administrator. It computes and keeps track of frequencies of words in the system to identify misspelt words. It is based on a word distance algorithm called “Levenshtein Distance”.

For every user query if we detect any word as a misspelt word we choose the best suggestion based on the above-mentioned word distance algorithm. The spell-check dictionary automatically rebuilds itself on every new statement being added to the system.

Example: Let’s say the user message is “recharge offers for prepaid with 10GB data and unlimited national calls”. This user message is converted to “recharge offers for prepaid with 10GB data and unlimited national calls” before any further processing.

2.5 Stemming/ Lemmatization

The root of each word is determined to eliminate affixes from words. The approach is different for different languages as well. In some cases, the system would perform lemmatization to get to the root word whereas in others it would use the stemming approach. These are fundamentally different in their approaches but are used for the same intent of distilling the terms to their base form.

For example – the word stem of liking liked is determined as like. This helps in better semantic understanding.

2.6 Conversational Context

NLU capabilities of the chatbot allow it to maintain conversational context while conversing with a user, by keeping track of the entities. This makes the conversation easier and quicker so that the user need not mention the entity in subsequent queries carrying the same context.

The NLU engine remembers the entities used in a session by saving them as conversation context history.

Conversational context is also detailed in its own section.

2.7 Named Entity Recognition

Entities are very important for identifying and extracting useful data from natural

language input.

While intents allow us to understand the motivation behind a particular user input,

entities are used to pick out specific pieces of information that users have

mentioned. Any important data to get from a user’s request is an entity.

There are built-in entities like Day/Date/TimeQuantity/Country and many more. Specific Custom Entities can be added for a use case, for Telecom plans the possible values can be “unlimited national calls”, “unlimited data plan”, “Fixed call plan” etc.

The Bot Administrator gets to specify the entities expected in user queries and the extraction engine extracts these entities from the user query.

Example:

“What are pre-paid plans with 10GB data and unlimited national calls?”

The entity extraction engine can figure out the user is talking about “10GB” data and a plan of type “unlimited national calls”

2.8 Figures of Speech Determination

The tokens/words are tagged into various figures of speech like nouns, verbs, adjectives, adverbs, etc. These data points are used to tag the query and then give appropriate weightage for these parts of speech in the final representation. This influences the responses determined by the NLP engine to result in more relevant matches.

2.9 Synonyms

Synonyms are alternate words to denote the same object or action. From a bot and context relevancy, typical use cases involve your domain-specific synonyms. It could also be used for cases of common misspellings, abbreviations, and similar uses.

You can add more than one synonym by pressing the tab/enter key after each entry. The system finds matching synonyms and adds them to your list, you can remove these if you don’t find them relevant.

2.10 FAQ Disambiguation

Related Match is a capability which lets the bot provide not just the best response from the training data set but also other options which were under consideration and were close to the top match.

The number of options which are shown to the user is dependent on how many trained responses matched the query and the distribution pattern of their scores. The system also makes a determination of the most well-formed statement/phrase from the set of variations and uses that to present as the variation.

2.11 Semantic Match

A Semantic match is the ability to understand the text semantically and then identify the best response based on the match. FAQs and Documents uploaded into the bot are broken down into multiple logical tokens and then semantically matched for the best response. Engati’s NLP/NLU pipeline also automatically ranks and identifies the best response to a user query against multiple possible ones.

In case of issues, feel free to reach out to [email protected].