Toshiba's Speech Recognition AI Technology Delivers User-specific Operation of Home Appliances

Toshiba Corp. has developed an AI technology that can bring fast recognition of speakers and keywords to all kinds of electronic products, without any need for internet connectivity and no need to rely on cloud resources for processing.

Home appliances integrating the technology will be able to register individual speakers with only three utterances, and to adjust operation in response to voice commands.

For example, an air-conditioner set in operation by a voice command will also adjust its temperature setting to suit the user who made the command.

Keyword detection and speaker recognition both require large numbers of calculations, which are typically executed remotely on a cloud platform or high-end devices like a smartphone. Making such capabilities a native feature of home appliances and other devices requires high-speed AI technology that can be embedded in the devices themselves. Toshiba says that its new AI can simultaneously and quickly execute keyword detection and speaker recognition, all without any need for network connectivity or remote processing power.

The technology has two core features.

The first feature is the use of the intermediate outputs during the keyword detection for effective speaker registration and recognition. The AI must first detect keywords by separating ambient noise from audio information. Its neural network does this by processing spoken input while absorbing the effects of ambient noise. Speaker registration and recognition are performed using the intermediate outputs of the neural network, an approach that suppresses the effects of ambient noise on speaker recognition, and also reduces the time required recognize the speaker. It secures high-speed operations with constrained resources.

The second feature is the use of data expansion methodology in the neural network. Data expansion is a method for learning from small amounts of data, in this case spoken utterances. By randomly assigning zero weight to connections between neural network nodes, simulated voice information can be generated, as if a speaker had spoken in various ways. Successful identification of individuals is based on the AI learning from their speech samples, and this method recognizes particular speakers even when only a small number of utterances are available. Toshiba has reduced the number of required speech samples to a point where the new AI technology can complete user registration with only three utterances.

Comparative evaluation based on three utterances per registered speaker found that Toshiba's method achieved an identification accuracy of 89% for 100 people, while accuracy of i-vector, a commonly used method for speaker recognition, remains at 71%. As devices such as home appliances are expected to have five to 10 registered speakers at most, this level of performance is considered sufficient for practical application. Furthermore, the amounts of computation and processing speed were measured on a server and confirmed that neither would be problematic, even in an embedded system.

As its next step, Toshiba will work toward incorporating the technology in embedded systems and investigating its utility in home appliances and other use cases. The company is also reviewing the opportunity to develop new services, such as application in the communication AI "RECAIUS™" developed by Toshiba Digital Solutions Corporation.