news

Understanding the general concept of the voice assistant systems Siri and Cortana

Eslam RamadanOctober 17, 2024

30 2 minutes read

Understanding the general concept of the voice assistant systems Siri and Cortana

Technology of voice assistants in smartphones, known as Siri in iPhones or Cortana, is a concept that most of us are familiar with. This technology has most likely captivated you, showcasing the advanced level technology has reached. For those who have seen the movie “HER,” they will understand how impactful this technology can be in our lives. Behind the voice assistant technology, whether it’s Siri or Cortana, lies a great and complex work that deserves appreciation. Exploring all the intricacies and technical terms requires expertise, but let me give you a brief overview of how this technology works and understand its general idea, leaving the complexities to the experts. Now, let’s focus together.

To simplify, let’s take Siri as an example in iPhones. When you speak to Siri, your voice and conversation are immediately encoded into a Compact Digital Form, preserving this information as digital data consisting of 0s and 1s.

This digital signal in your network-connected device is wirelessly transmitted through the nearest Cell Tower, passing through a series of landlines to your Internet Service Provider (ISP) and then to a server on the network’s cloud.
This digital signal is loaded with a series of models designed to understand the language you used, and simultaneously, your speech is segmented on your device.
Your device has a Recognizer to connect to the Cloud and determine if the command you inputted through your voice can be executed on the device itself or if it needs to connect to the network.
For instance, if you ask it to play a song on your phone, this task can be easily executed locally on your device, unlike if you ask it to make a restaurant reservation or search for something. In this case, it recognizes that it needs to connect to the network. If the Local Recognizer detects that the command is within your device and doesn’t require the cloud, it informs that it is unnecessary this time and doesn’t connect to it.
When your voice reaches the server, it compares your voice within a statistical system to estimate your voice, recognize the vocabulary you used, the commands you made, and the letters that compose these sounds.
Simultaneously, the Local Recognizer compares the same sounds in a concise version of this statistical system to facilitate the process and save time. For both systems, the one with the highest probability of success is activated.
Your conversation is understood as a series of moving and static letters passed through a language explorer and analyzer, which estimates the words you spoke.
The system then generates a selected list of translations of your conversation, which could mean the sequence of your words.
If there is enough confidence in the result provided by the system, it determines the exact task to be executed. For example, if you intend to send a text message to Ahmed Hussein, the device prepares the name “Ahmed Hussein” from your contact list, then you write the message you want to send, and like magic, you see your message displayed on the screen without needing any manual effort, just your voice. If there is any ambiguity during this process, the device will go back to that point and ask, for example, if you meant Ahmed Hussein or Ahmed Hassan?

I didn’t mean to make it lengthy or complicate the details for you. I just wanted to convey the general idea simply, and I hope it has been conveyed effectively.

Eslam RamadanOctober 17, 2024

30 2 minutes read