Human-machine interface (HMI) technology has made great strides in the past ten to fifteen years: until the early 2000s, color screens and touch screens on embedded devices (first hand-held phones, then smartphones) certainly were. Not so- plus reasonably priced. With improved processing performance, reduced costs and the emergence of new communication technologies, devices have been created capable of translating what users want into an order. Until a few years back, devices being able to connect to the cloud and allow remote control of IoT devices through the use of voice commands (like Amazon’s Alexa) was just science fiction. So far, smart sensors and smart audio devices allow easy creation of devices for your voice-controlled personal assistant. This article will guide you through the best components for choosing to design your own unique version of Alexa: System Overview A digital voice assistant is an electronic device capable of performing the following steps: Capture voice messages and convert the voice message to Voice Stream processes the audio stream through algorithms complex, and interpreted as uniquely linking the command to the action of running a voice feedback message. Behind all this lies a whole gamut of hardware and software technologies. Figure 1 shows the voice assistant block diagram. Figure 1: Schematic diagram of the voice assistant (Source: author) As in any other communication channel, this channel also includes an origin signal, an acquisition and transmission system, an encoder, a processing system, a decoding system, and an emission system output signal. In the case of a voice assistance system, the source signal is an audio message, which is a mechanical wave formed by our vocal cords and propagating through the air (communication media) as vibration. The vibration is obtained by a microphone that acts as a signal transducer. Then, the signal is adapted and encoded to be processed. At this point, the encoded audio stream can be processed locally (via a microcontroller or microprocessor) or sent remotely for more efficient processing through voice recognition algorithms and artificial intelligence on the cloud. Thus, the processing output will be something that will be provided to the operating system. For audio feedback, the path is similar, but opposite: the stream is decoded and sent to an amplifier, which will reproduce the sound through the amplifier. System Components Designing a voice assistance system from scratch is a complex task. Until a few decades ago, this was only possible for teams of engineers with great skills in sound design. Nowadays, we are fortunate enough to be able to rely on a series of hardware and software components that definitely make the task much simpler. One of the most important parts of this project is the audio acquisition and reproduction sections, which require excellent knowledge of the application field, as well as analog electronics skills. To simplify the task, we can use digital transducers, which integrate the necessary analog components and transmit data that is already digitally encoded. For the input section, we can use INMP441 omnidirectional microphone, which uses MEMS (microelectromechanical systems) technology for transmission and implements I2S digital interface for data exchange. In this way, problems with signal adaptation are avoided, and the interface with the processing unit is not affected by noise. Moreover, the signal to be processed is already in a digital format. For the audio output stage, we can use the MAX98357A, which is a Class D 3W dual channel amplifier, also equipped with an I2S interface. Similar to the input stage, the hardware design is very simplified: the amplifier receives digital signals at the input via the I2S interface, decodes the sound samples and reproduces them as voltages on the L and R ends of the chip, which are simply connected to the speaker. At this point, the last device you’ll need to select is the microcontroller, which should be able to process (or send to the cloud) the audio stream coming from the microphone, and send the audio stream to be reproduced by the loudspeaker. The most suitable MCU for this purpose, of course, is the ESP32 module from Espressif (preferably the WROVER module, with 8 MB of RAM, up to 16 MB of flash memory). With high connectivity (BLE and WiFi) and high performance combined with a very low price, this unit is the right choice for smart home applications. It also features two independent I2S interfaces, which perfectly fit the purpose of this project. Figure 2 shows the three main components used in this project. Figure 2: MAX98357A (left), ESP32-WROVER (middle), INMP441 (right) (Source: Web) I2S communication bus The I2S interface protocol is a key point of the project, because it makes both hardware and software simple, which helps designers and developers ease From a whole series of problems related to analog acquisition and reproduction. I2S stands for Inter-IC Sound, an electrical standard for serial interfaces created to connect digital audio devices. It is used in audio applications to transfer PCM audio samples between integrated circuits, as in our case (MCU and microphone/loudspeaker). I2S was created in 1986 and is a product of Philips Semiconductor (now NXP Semiconductors). The I2S bus is synchronous, providing a different clock signal than the data line, and this allows for easier reception from asynchronous devices, as the communication system gets the clock directly from the data stream. It also provides up to two multiplexed channels on the same data line (Right channel and Left channel). I2S includes at least the following three lines: Clock Line (BCLK): Used to mark the bit time and allow synchronization between the Connected Devices word select line (WS) or Right Clock Line (LRCLK): Used for multiplexing the left channel (WS) Low and the channel Right (WS height). So it is presented as a square wave with a duty cycle at 50% data line: used for PCM sample transmission and multiplexing according to WS condition; The data is encoded in 2’s complement The clock switching frequency (fck) cannot be chosen arbitrarily, but is generated based on the sampling frequency of the input signal (fs), the number of channels (nch) and the number of single bits sampled (nbit), where fck is shifted fck = fs * nbit * nch For example, to say 2 8 kHz sampled data streams, where each sample contains 12 bits, we will need to set fck = 8000 * 12 * 2 = 192000 cycles / sec. Figure 3 shows an example of a timing diagram for bus lines. Figure 3: I2S bus timing diagram (source: hackaday.com) Electrical connections and software snippets Figure 4 shows the connections between the three major components of the system. It is really easy to understand the connections, it is enough to connect the clock signals, synchronize the words and the data line between the microphone / amplifier and the MCU. All analog conditioning, filtering, and amplification are integrated into the digital devices. Figure 4: Wired Digital Components Connection (Source: Author) Finally, Figure 5 shows the configuration of the structures for correct use of the I2S bus in the ESP-IDF environment of the ESP32. Configuration refers to sampling the input signal at 8 kHz, 16 bits per single channel sample, as input and output (where there is one microphone, there is a single channel output). Figure 5: An example of an I2S bus configuration (Source: author) Project One, a thousand possibilities This project is a starting point for any device that needs audio input, amplifier output, and a processing and communication system: a similar platform can be implemented in a large number of applications, such as assistants for smart homes , audio players, environmental alarm systems, baby monitors and many more. What kind of application are you going to use this platform for?