The Intelligent Voice system is a batch processing system designed to process hours of audio recordings as efficiently as possible. We also offer systems designed for real-time transcription, low latency key word spotting, IoT hub devices and embedded applications and more - for details on running these solutions on Azure please contact us.
Single VM for Evaluation / Lab Use
A single VM is recommended for evaluation and lab use.
For a basic installation of Intelligent Voice 6 with GPU acceleration on Azure we recommend a single GPU instance type with 128GB RAM and 4 CPUs. The minimum storage requirement is 500GB.
Operating systems supported:
- Red Hat Enterprise Linux 9 (recommended) or 8
- Ubuntu 22.04 LTS (recommended) or 20.04 LTS
- Oracle Linux 8
To install on a single VM the Standard_NC16as_T4_v3 type is recommended. Other suitable sizes include Standard_NC64as_T4_v3, any NCv2-series, any NCv3-series, any ND A100 v4 series.
Multiple VMs for Production Use
Installing on multiple VMs is recommended for production use, to improve resilience and scalability, and to reduce costs.
Production Deployment Example for 10,000+ Hours a Day
An example of a full production system deployment with autoscaling.
This system uses a single application server VM, with Virtual Machine Scale Sets deploying VM images from a private Azure Compute Gallery. The current state of the IV job queue is sent to Application Insights using Telegraf and scaling rules created based on the number of jobs.
The database and file store can optionally use Azure Database for MariaDB and Azure Storage Accounts, to support high availability configurations and/or cross-region replication.
This solution is ideal to process 10,000 - 100,000 hours of audio per day. To scale up to 1,000,000 audio hours per day or more, multiple application servers can be run with a load balancer, or traffic can be sharded across multiple IV systems.
This diagram below shows how to configure the system with 9 VM scale sets supporting all optional features.
If you don't use some of the features, they don't need to be installed. For example, a Relativity integration does not require Sentiment, Voice Biometrics or Text from Video.
Note the following dependencies:
- ASR, Diarization and Voice Biometrics require VAD
- Summarization and Sentiment require ASR
- Tagger requires ASR or VideoOCR
- Voice Biometrics requires Elasticsearch
- JumpToWeb requires Sphinxsearch (on app server)
App Server
The size of the VM should be chosen according the required features and expected traffic. Example sizes:
Minimal - Standard B8ms (8 vcpus, 32 GiB memory)
10,000 audio hours per day - Standard_D16ads_v5 (16 vcpus, 64 GiB memory)
100,000 audio hours per day - Standard_D64ads_v5 (64 vcpus, 256 GiB memory)
Azure images: Red Hat Enterprise Linux (version 9 recommended, version 8 supported) or Ubuntu (22.04 LTS recommended, 20.04 LTS supported)
OS Disk: 500GB+ Premium SSD LRS. Disk must be partitioned to have sufficient space under /var/ for container images (320GB if all features are installed).
Mount a data disk, share or container on the filesystem under /data (see Microsoft guides Mount SMB Azure file share on Linux or How to mount Blob storage as a file system with BlobFuse)
Azure Database for MariaDB
General purpose Compute Gen 5 Memory optimised with 2 vCPUs, 20GB RAM
This is suitable for 10k-50k hours per day. Use larger instance sizes to scale up
ASR, Diarization and Sentiment Images
This should be the lowest spot price T4 instance type if available in your region, or else the lowest spot price from other compatible GPU types (listed above). Typically this will be:
Tagger Images
Tagger VMs should all be the lowest full utilization (not burst) x86-64 spot price instances meeting the minimum requirement of 2vCPUs and 16GB RAM. Typically this will be:
Standard_E2as_v5
Elasticsearch
Always on. Only required for the Voice biometric feature
Lowest cost x86-64 spot price instances meeting the minimum requirement of 2vCPUs and 16GB RAM. Typically this will be:
Standard_E2as_v5
All other images
VAD, Voice biometric, Video OCR, LM Builder and LexiQal Credibility images should all be the lowest compute-optimised or general purpose (not burst) x86-64 spot price instances meeting the minimum requirement of 2vCPUs and 4GB RAM. Typically this will be:
Standard_F2s_v2
or
Standard_D2as_v5
In some regions other instance types might have lower spot prices.
Azure Reference Deployment Architecture and Best Practices