Jun 15, 2023, Innovations

How to leverage Voice AI for task and process automation

Adrian Głażewski .NET Developer
how to use a voicebot in business processes
AI, which stands for Artificial Intelligence, is one of the most rapidly advancing technologies that deals with creating computer systems capable of processing information and making decisions in a way similar to humans. With AI, we are able to increase productivity and efficiency in our operations, optimize processes, gain access to essential information and data analysis, improve customer service, and enhance our company's competitiveness in the market.

What is Voice AI?

Voice AI is a technology that utilizes artificial intelligence to enable devices to communicate with humans through speech. It allows for intuitive interaction with devices without the need for text input or keyboard usage, making it more accessible and inclusive.

But before we dive into the details of this technology, let’s take a look at the numbers and trends that indicate its undoubtedly growing popularity:

  • The market value of voicebots reached $2.3 billion in 2022 and is projected to exceed $30 billion by 2028 (Grand View Research).

  • 80% of companies plan to implement AI-powered voice bot solutions by the end of 2025 (Oracle).

  • Work productivity can increase by up to 40% through process automation using voicebots and artificial intelligence (Forrester).

  • 85% of businesses believe that implementing business process automation using voice AI technology leads to better financial results and increased market competitiveness (Accenture).

  • Industries leading in the utilization of artificial intelligence for process automation in voice channels include banking, insurance, and services (Deloitte).

Voice AI components

Let’s now dive into the technical aspects and take a closer look at the components of Voice AI, which encompass various technologies that enable machines and computers to process, understand, and generate human natural language.

NLP: Natural Language Processing

NLP uses machine learning techniques to train computer systems to process natural language in a way that achieves desired outcomes. The techniques encompassed by NLP include speech recognition, text recognition, machine translation, sentiment analysis, natural language understanding, and text generation. In other words, thanks to NLP, computer software can process our human way of communication and convert it into data that can be further processed in subsequent stages.

The elements that are part of NLP include:

  • Text Understanding

  • Text Generation

  • Speech-to-Text Conversion

  • Text-to-Speech Conversion

  • Speech Translation

  • Speaker Recognition

  • Emotion Recognition

  • Voice User Interface (VUI)

NLU: Natural Language Understanding

NLU deals with attempting to understand natural language using machine learning techniques. It is an incredibly important component as it allows artificial intelligence to comprehend the context of what has been spoken by a person – to recognize the intent and extract entities.

Intent Recognition

Intent recognition involves determining the user’s intent based on their statement or query. In other words, the natural language understanding engine tries to understand what the user wants to achieve, their intention, or goal.

For example, if a user asks about the opening hours of a restaurant, the intent is to obtain information about the hours when the establishment is open.

Entity Extraction

Entity extraction is the process of identifying relevant elements in a user’s statement or query – place names, time, dates, phone numbers, etc. For instance, if a user inquires about booking a hotel room, entity extraction can help identify the date, number of guests, or length of stay.

There are many providers of intent recognition and entity extraction tools in the market, such as Google Cloud Natural Language Processing, Microsoft Azure Cognitive Services, IBM Watson Natural Language Understanding, or our proprietary solution, Vosito.

Potential threats associated with intent recognition:

  • Language complexity and ambiguity

  • Vocabulary and pronunciation

  • Limited training data

NLG: Natural Language Generation

NLG is software that can generate text in natural language. In the market, there are both highly complex solutions like the renowned GPT chatbot, as well as less sophisticated ones that can generate text based on templates, for example, by substituting words in the appropriate places. For instance, if we are creating a taxi booking assistant, this software can generate text for confirmation that may sound like: “Thank you for your order. Your taxi from {pickup location} to {destination location} will be arranged on {date} at {time}.”

Although with advancements in NLP, NLG has also become more efficient and effective, it still faces certain challenges and difficulties:

  • Natural language generation – Generating natural and human-understandable text requires NLG systems to understand the context and grammatical rules of natural language. The complexity of natural language, word ambiguity and meanings, as well as non-standard language usage by different users, pose challenges for NLG systems.

  • Conciseness and coherence – Texts generated by NLG systems need to be concise, coherent, and unambiguous to be easy to read and understand for users.

  • Content appropriateness – Texts generated by NLG systems need to be appropriate to the context in which they are used, tailored to different situations and user requirements.

Speaker Recognition

Another component of Voice AI is the Speaker Recognition system. It is a process that involves analyzing speech characteristics such as tone, tempo, modulation, and intonation, and then comparing these features to a previously registered speaker profile.

This component is becoming increasingly common and finds applications in various fields such as security, video surveillance, telecommunications, and even video games.

We can use it for two primary purposes: identification and verification. Identification is an attempt to match a speaker 1:N, indicating which person is currently speaking from a group of people. Verification, or authentication, is an attempt to match a speaker 1:1, which is used for securing access to a specific system.

Where to Use NLP Techniques?

Over time, conducting business without using NLP techniques will become increasingly challenging, costly, and less effective.

So, where can we use software that implements NLP techniques?

  • Customer support in the form of chatbots or voicebots – enables conducting specific processes with customers without the need for human involvement.

  • The world of finance – qualitative analysis of financial articles.

  • The world of medicine – for example, improving medical documentation by automatically populating symptoms when registering with a doctor.

  • The world of law – assisting in the analysis of documentation for contextual search or analysis of recorded legal acts.

Of course, NLP is not a one-size-fits-all solution, and sometimes it can even become a problem itself. Therefore, in our daily encounters, we may face the following challenges:

  • People communicate in a colorful manner, using various linguistic constructions that may be incomprehensible to computers.

  • Vocabulary and pronunciation can vary significantly depending on the region, culture, and user style.

  • The effectiveness of NLU systems depends on the quality of the training data used for their learning. Insufficient or inadequate training data can negatively impact the accuracy of intent recognition and entity extraction.

  • Natural language is a dynamic medium, and people constantly introduce new words, phrases, and abbreviations. Applications for intent recognition and entity extraction need to be continuously updated to keep up with changing trends in natural language.

Categories of Voice AI Solutions

TTS - Text to Speech

TTS is a technology used to convert text into speech, utilizing advanced models that allow for the generation of human-like voices. Building such a synthesizer is highly complex as it requires extensive data and significant computational power to generate natural-sounding speech.

However, TTS-related services have become one of the most attractive offerings worldwide, and many scientific studies indicate that audio content is better received by users than written content.

Real-Time Sound Generation

There are solutions available that provide APIs for generating real-time sound. Among the leading providers, we can distinguish Google, Amazon, IBM, Microsoft, and Nuance. It is worth mentioning that different providers have varying levels of voice synthesizer development.

One of the notable companies is ElevenLabs, which, in my opinion, has the most advanced synthesizer with voices that are almost indistinguishable from the average human. The manufacturer recommends their solutions for generating voices for short video materials on platforms like YouTube, as narrators, advertisements, etc. Unfortunately, the current offering does not support the Polish language.

Potential Software Solutions

What are some examples of using TTS in the software world? It could be an application that reads the text displayed on the screen and provides voice narration, a system that reads announcements at train stations, airports, or restaurants, or a GPS application. Additionally, TTS can be used in robotics to enable robots to communicate through speech.


When discussing TTS, it is essential to mention SSML (Speech Synthesis Markup Language), which is a markup language used to provide detailed information about the text that will be processed into speech by the synthesizer. These tags allow programmers and users to specify pronunciation, accents, intonation, or other speech elements to achieve a more natural and human-like sound. However, not all synthesizers support these tags, which can result in incorrect text processing.

Risks Associated with TTS

TTS does not always provide speech quality comparable to human speech – sometimes the speech synthesizer may sound artificial or be difficult to understand. TTS may also struggle to adjust the tone and intonation according to the meaning of the sentence.

ASR - Automatic speech recognition

Another component is ASR, a technology that attempts to convert human speech into text, utilizing various elements of artificial intelligence and machine learning.

The foundation of ASR is always the creation and training of an acoustic model that analyzes a set of words and phrases to determine the most probable ones based on context and previous utterances.

Leading ASR providers include Amazon Web Services, Google Cloud Speech-to-Text, Microsoft Azure Speech Services, and IBM Watson Speech to Text.

Potential Challenges

Speech-to-text processing is a complex task, and it comes with several difficulties:

  • Acoustic-technical conditions, such as the environment and ambient sounds, voice reflections, or using an unsuitable microphone.

  • Words that sound very similar, such as “three” and “free” or “meet” and “meat”.

  • Mismatched language model used for conversations, where the speaker uses complex vocabulary that the ASR may struggle to transcribe accurately.

  • Speaking speed and dialect used by the speaker. Each person speaks differently, and sometimes even humans have difficulty understanding each other. This becomes an even greater challenge for accurate transcription by ASR.

Where can ASR be Implemented?

Automatic speech recognition can be implemented in various situations, including:


  1. Automatic transcription, which encompasses several scenarios. For example, generating captions for existing audio content and displaying them together with video material, such as in video games, TV programs, and movies.

  2. Generating transcriptions after events, as online search predominantly revolves around written content. Creating transcriptions for audio material can increase accessibility to it.

  3. Real-time transcription, as many ASR providers offer API interfaces, allowing for streaming audio input and receiving immediate transcriptions.

  4. Enhancing support processes through automated customer understanding. ASR systems can improve customer support processes by automatically categorizing issues or collecting basic information. They can also gather customer satisfaction feedback through word analysis.

  5. Agent monitoring, where automatic transcription enables evaluation of how call center agents convey crucial information to customers.

  6. Improved content generation, where ASR can enhance applications for writing SMS messages or emails, making them more efficient.

Voice user interface

When talking about Voice AI, it is impossible to overlook applications in the category of NLUI (Neural Language User Interface) that combine the technologies mentioned above.

A natural language user interface is a type of computer interface in which linguistic phenomena such as verbs, phrases, and clauses act as controls for creating, selecting, and modifying data in applications.

NLUI utilizes NLP, NLU, and ASR technologies to enable users to easily and intuitively interact with devices, allowing them to give commands, ask questions, and engage in interactions using natural language instead of specialized commands and graphical interfaces.

Applications of NLUI

NLUI can be applied in a wide range of areas, and the technology is continuously evolving and subject to constant innovation. Some examples include:

  • Voice assistants – providing easy and intuitive device interaction (Siri, Alexa, and Google Assistant).

  • Home automation systems – controlling home automation systems such as lighting, air conditioning, and door locks.

  • Electronic banking systems – enabling users to easily and quickly perform transactions and check balances.

  • Medical systems – facilitating easy and intuitive usage of healthcare-related applications for patients.

Hellobot - our custom solution

Hellobot belongs to the NLUI category of applications. Through the integration of multiple components – ASR, TTS, and NLP – it can create a solution that supports and automates customer processes using voice.

The solution integrates with other data sources and applications, allowing active collaboration with platforms specified by the customer.

What is the bot deployment process like?

1. The process of building voice automation begins with a thorough analysis of the customer’s needs. Even before the final contract is signed, we engage in discussions with the customer to analyze their current business processes and propose potential solutions.

2. Next, we conduct several workshop meetings with the customer to define the detailed scope in which the bot will be deployed. We clearly define the vision, strategic goals, and outcomes after the implementation process is completed.

3. The next step is solution preparation. We utilize our own tools to meticulously build scenarios, taking into account the smallest details and nuances. This minimizes the possibility of errors or bot behavior that hinders the achievement of set goals. This stage also involves not only creating a functional scenario but also preparing interfaces for integrating the voice bot with the customer’s systems.

4. An important step is preparing our proprietary NLU solution – Vosito – for intent recognition and entity extraction from conversations. After an appropriate training process, the tool can recognize intents and entities in both Polish and English languages.

5. We begin the process by identifying areas where we need text understanding. We gather information about potential utterances and categorize them by the most frequently occurring intents.

6. Next, we analyze the suggested intents, which must be specific and non-synonymous. Together with the customer, we sort potential utterances that can be recognized at a given moment based on which we train Vosito. It’s worth noting that training artificial intelligence is a lengthy and continuous process. It is rare to achieve AI that can recognize intents from the beginning of its production deployment. This stage is the best time to improve the tool. It is also important to note that we can never achieve 100% intent recognition accuracy.

7. The process of creating the scenario also involves starting manual testing. We have several different environments, including a 1:1 environment that matches the production environment, where our testers can test scenarios during their development. Of course, this is not always possible, especially if the customer is unable to provide their test environment. After successful testing, we initiate the deployment process. Throughout each stage, we maintain regular communication with the customer to select the most optimal solutions. Once the process is completed, the customer receives training on system operation and access credentials to our panel.

Voicebot vs ChatGPT

Although ChatGPT is a highly advanced tool that understands natural language, processes it, and generates human-like responses, making it well-suited for conversational interactions, it currently cannot replace a voicebot.

First and foremost, we cannot compel ChatGPT to engage in conversations according to our specific expectations, which hinders the execution of a business process aligned with a particular business plan. Additionally, for security reasons, ChatGPT cannot and should not have access to internal systems.

Other limitations, aside from the fact that ChatGPT only has access to data up until 2021, include content generation that may sometimes appear accurate but could deviate from the truth or lack quality. On the other hand, ChatGPT can serve as effective support for a voicebot.

What processes does a voicebot support?

  • A voicebot successfully supports the NPS (Net Promoter Score) process, which involves collecting feedback on experiences such as service visits or phone conversations. In this process, we can ask any number of questions, ranging from rating on a numerical scale of 1 to 5, yes/no responses, to open-ended questions.

  • We also handle processes related to appointment confirmation, for example, in a car service center, medical clinic, beauty salon, or government office. Hellobot can conduct a conversation to determine whether the client confirms the appointment or not. If not, subsequent stages of the scenario can involve scheduling a future appointment.

  • Another scenario we can implement is a debt collection bot that makes calls to potential debtors to provide information about their outstanding debts and gather details about potential repayment dates.

  • We can also create a scenario for ordering food, such as pizza, where the voicebot can identify the type of pizza based on the customer’s mentioned ingredients and generate the order. The pizza bot can also handle payment processing.

  • A voicebot doesn’t necessarily have to be limited to phone conversations. Conversations can also be conducted through a web browser.

  • Additionally, we can handle info kiosks, allowing voice-controlled interactions with them.

Summing up

Voice AI is a technology that plays an increasingly important role in business and finds applications in various industries, from customer support to education and healthcare. It is a technology with significant potential that can greatly improve business processes and customer interactions. However, the key to success lies in continuous improvement and adaptation to changing customer needs.