Voice Connections
Voice connections operate in a similar fashion to the Gateway connection. However, they use a different set of payloads and a separate UDP-based connection for RTC data transmission. Because UDP is generally used for both receiving and transmitting RTC data, your client must be able to receive UDP packets, even through a firewall or NAT (see UDP Hole Punching for more information). The Discord voice servers implement functionality (see IP Discovery) for discovering the local machine's remote UDP IP/Port, which can assist in some network configurations. If you cannot support a UDP connection, you may implement a WebRTC connection instead.
Audio and video from a "Go Live" stream require a separate connection to another voice server. Only microphone and camera data are sent over the normal connection.
Voice Gateway
To ensure that you have the most up-to-date information, please use version 9. Otherwise, the events and commands documented here may not reflect what you receive over the socket. Video is only fully supported on Gateway v5 and above.
Gateway Versions
| Version | Status | Change |
|---|---|---|
| 9 | Recommended | Added channel_id to Opcode 0 Identify and Opcode 7 Resume |
| 8 | Recommended | Added buffered resuming |
| 7 | Available | Added Opcode 17 Channel Options Update |
| 6 | Available | Added Opcode 16 Voice Backend Version |
| 5 | Available | Added Opcode 15 Media Sink Wants |
| 4 | Available | Changed speaking status from boolean to bitmask |
| 3 | Deprecated | Added video functionality, consolidated Opcode 1 Hello payload |
| 2 | Deprecated | Changed Gateway heartbeat reply to Opcode 6 Heartbeak ACK |
| 1 | Deprecated | Initial version |
Gateway Commands
| Name | Description |
|---|---|
| Identify | Start a new voice connection |
| Resume | Resume a dropped connection |
| Heartbeat | Maintain an active WebSocket connection |
| Media Sink Wants | Indicate the desired media stream quality |
| Select Protocol | Select the voice protocol and mode |
| Session Update | Indicate the client's supported codecs |
| Speaking | Indicate the user's speaking state |
| Voice Backend Version | Request the current voice backend version |
Gateway Events
| Name | Description |
|---|---|
| Hello | Defines the heartbeat interval |
| Heartbeat ACK | Acknowledges a received client heartbeat |
| Clients Connect | A user connected to voice, also sent on initial connection to inform the client of existing users |
| Client Flags | Contains the flags of a user that connected to voice, also sent on initial connection for each existing user |
| Client Platform | Contains the platform type of a user that connected to voice, also sent on initial connection for each existing user |
| Client Disconnect | A user disconnected from voice |
| Media Sink Wants | Requested media stream quality updated |
| Ready | Contains SSRC, IP/Port, experiment, and encryption mode information |
| Resumed | Acknowledges a successful connection resume |
| Session Description | Acknowledges a successful protocol selection and contains the information needed to send/receive RTC data |
| Session Update | Client session description changed |
| Speaking | User speaking state updated |
| Voice Backend Version | Current voice backend version information, as requested by the client |
Connecting to Voice
Retrieving Voice Server Information
The first step in connecting to a voice server (and in turn, a guild's voice channel or private channel) is formulating a request that can be sent to the Gateway, which will return information about the voice server we will connect to. Because Discord's voice platform is widely distributed, users should never cache or save the results of this call. To inform the Gateway of our intent to establish voice connectivity, we first send an Update Voice State payload.
If our request succeeded, the Gateway will respond with two events—a Voice State Update event and a Voice Server Update event—meaning you must properly wait for both events before continuing. The first will contain a new key, session_id, and the second will provide voice server information we can use to establish a new voice connection.
With this information, we can move on to establishing a voice WebSocket connection.
When changing channels within the same guild, it is possible to receive a Voice Server Update with the same endpoint as the existing session. However, the token will be changed and you cannot re-use the previous session during a channel change, even if the endpoint remains the same.
Establishing a Voice WebSocket Connection
Once we retrieve a session_id, token, and endpoint information, we can connect and handshake with the voice server over another secure WebSocket.
Unlike the Gateway endpoint we receive in a Get Gateway request, the endpoint received from our Voice Server Update payload does not contain a URL protocol,
so some libraries may require manually prepending it with wss:// before connecting. Once connected to the voice WebSocket endpoint, we can immediately send an Opcode 0 Identify payload:
Identify Structure
| Field | Type | Description |
|---|---|---|
| server_id | snowflake | The ID of the guild, private channel, stream, or lobby being connected to |
| channel_id 1 | snowflake | The ID of the channel being connected to |
| user_id | snowflake | The ID of the current user |
| session_id | string | The session ID of the current session |
| token | string | The voice token for the current session |
| video? | boolean | Whether this connection supports video (default false) |
| streams? | array[stream object] | Simulcast streams to send |
| max_dave_protocol_version? | integer | The maximum DAVE protocol version supported by the client (default 0) |
1 Only required for Gateway v9 and above.
Stream Structure
| Field | Type | Description |
|---|---|---|
| type 1 | string | The type of media stream to send |
| rid | string | The RTP stream ID |
| quality? | integer | The media quality to send (0-100, default 0) |
| active? | boolean | Whether the stream is active (default false) |
| max_bitrate? | integer | The maximum bitrate to send in bps |
| max_framerate? | integer | The maximum framerate to send in fps |
| max_resolution? | stream resolution object | The maximum resolution to send |
| ssrc? | integer | The SSRC of the stream |
| rtx_ssrc? | integer | The SSRC of the retransmission stream |
1 Currently, this field is ignored and always set to video.
Media Type
| Value | Description |
|---|---|
| audio | Audio |
| video | Video |
| screen | Screenshare |
| test | Speed test |
Stream Resolution Structure
| Field | Type | Description |
|---|---|---|
| type | string | The resolution type to use |
| width | number | The fixed resolution width, or 0 for source |
| height | number | The fixed resolution height, or 0 for source |
Resolution Type
| Value | Description |
|---|---|
| fixed | Fixed resolution |
| source | Source resolution |
Example Identify
{"op": 0,"d": {"server_id": "41771983423143937","user_id": "104694319306248192","session_id": "30f32c5d54ae86130fc4a215c7474263","token": "66d29164ee8cd919","video": true,"streams": [{ "type": "video", "rid": "100", "quality": 100 },{ "type": "video", "rid": "50", "quality": 50 }],"max_dave_protocol_version": 1}}
The voice server should respond with an Opcode 2 Ready payload, which informs us of the SSRC, connection IP/port, supported transport encryption modes, and experiments the voice server supports:
Ready Structure
| Field | Type | Description |
|---|---|---|
| ssrc | integer | The SSRC of the user's voice connection |
| ip | string | The IP address of the voice server |
| port | integer | The port of the voice server |
| modes | array[string] | Supported transport encryption modes |
| experiments | array[string] | Available voice experiments |
| streams | array[stream object] | Populated simulcast streams |
Example Ready
{"op": 2,"d": {"ssrc": 12871,"ip": "127.0.0.1","port": 1234,"modes": ["aead_aes256_gcm_rtpsize","aead_aes256_gcm","aead_xchacha20_poly1305_rtpsize","xsalsa20_poly1305_lite_rtpsize","xsalsa20_poly1305_lite","xsalsa20_poly1305_suffix","xsalsa20_poly1305"],"experiments": ["fixed_keyframe_interval"],"streams": [{"type": "video","ssrc": 12872,"rtx_ssrc": 12873,"rid": "50","quality": 50,"active": false},{"type": "video","ssrc": 12874,"rtx_ssrc": 12875,"rid": "100","quality": 100,"active": false}]}}
Establishing a Voice Connection
Once we receive the properties of a voice server from our Ready payload, we can proceed to the final step of voice connections, which entails establishing and handshaking a connection for RTC data. First, we establish either a UDP connection using the Ready payload data, or prepare a WebRTC SDP. We then send an Opcode 1 Select Protocol with details about our connection:
Select Protocol Structure
| Field | Type | Description |
|---|---|---|
| protocol | string | The voice protocol to use |
| data | ?protocol data | string | The voice connection data or WebRTC SDP |
| rtc_connection_id? | string | The UUID RTC connection ID, used for analytics |
| codecs? | array[codec object] | The supported audio/video codecs |
| experiments? | array[string] | The received voice experiments to enable |
Protocol Type
| Value | Description |
|---|---|
| udp | Standard UDP voice connection |
| webrtc | WebRTC voice connection |
Protocol Data Structure
| Field | Type | Description |
|---|---|---|
| address 1 | string | The discovered IP address of the client |
| port 1 | integer | The discovered UDP port of the client |
| mode | string | The transport encryption mode to use |
1 These fields are only used to receive RTC data. If you only wish to send frames and do not care about receiving, you can randomize these values.
Codec Structure
| Field | Type | Description |
|---|---|---|
| name | string | The name of the codec |
| type | string | The type of codec |
| priority 1 | integer | The preferred priority of the codec as a multiple of 1000 (unique per type) |
| payload_type 2 | integer | The dynamic RTP payload type of the codec |
| rtx_payload_type? | integer | The dynamic RTP payload type of the retransmission codec (video-only) |
| encode? | boolean | Whether the client supports encoding this codec (default true) |
| decode? | boolean | Whether the client supports decoding this codec (default true) |
1 For audio, Opus is the only available codec and should be priority 1000.
2 No payload type should be set to 96, as it is reserved for probe packets.
Supported Codecs
Providing codecs is optional due to backwards compatibility with old clients and bots that do not handle video.
If the client does not provide any codecs, the server assumes an Opus audio codec with a payload type of 120 and no specific video codec.
If no clients with specified video codecs are connected, the server defaults to H264.
| Type | Name | Status |
|---|---|---|
| audio | opus | Required |
| video | AV1 | Preferred |
| video | H265 | Preferred |
| video | H264 | Default |
| video | VP8 | Available |
| video | VP9 | Available |
Example Select Protocol
{"op": 1,"d": {"protocol": "udp","data": {"address": "127.0.0.1","port": 1337,"mode": "aead_aes256_gcm_rtpsize"},"codecs": [{"name": "opus","type": "audio","priority": 1000,"payload_type": 120},{"name": "AV1","type": "video","priority": 1000,"payload_type": 101,"rtx_payload_type": 102,"encode": false,"decode": true},{"name": "H264","type": "video","priority": 2000,"payload_type": 103,"rtx_payload_type": 104,"encode": true,"decode": true}],"rtc_connection_id": "d6b92f64-40df-48eb-8bce-7facb043149a","experiments": ["fixed_keyframe_interval"]}}
Transport Encryption Mode
The RTP size variants determine the unencrypted size of the RTP header in the same way as SRTP, which considers CSRCs and (optionally) the extension preamble to be part of the unencrypted header. The deprecated variants use a fixed size unencrypted header for RTP.
The Gateway will report what encryption modes are available in Opcode 2 Ready.
Compatible modes will always include aead_xchacha20_poly1305_rtpsize but may not include aead_aes256_gcm_rtpsize depending on the underlying hardware. You must support aead_xchacha20_poly1305_rtpsize. You should prefer to use aead_aes256_gcm_rtpsize when it is available.
| Value | Name | Nonce | Status |
|---|---|---|---|
| aead_aes256_gcm_rtpsize | AEAD AES256 GCM (RTP Size) | 32-bit incremental integer value appended to payload | Preferred |
| aead_xchacha20_poly1305_rtpsize | AEAD XChaCha20 Poly1305 (RTP Size) | 32-bit incremental integer value appended to payload | Required |
| xsalsa20_poly1305_lite_rtpsize | XSalsa20 Poly1305 Lite (RTP Size) | 32-bit incremental integer value appended to payload | Deprecated |
| aead_aes256_gcm | AEAD AES256-GCM | 32-bit incremental integer value appended to payload | Deprecated |
| xsalsa20_poly1305 | XSalsa20 Poly1305 | Copy of RTP header | Deprecated |
| xsalsa20_poly1305_suffix | XSalsa20 Poly1305 (Suffix) | 24 random bytes | Deprecated |
| xsalsa20_poly1305_lite | XSalsa20 Poly1305 (Lite) | 32-bit incremental integer value, appended to payload | Deprecated |
Finally, the voice server will respond with an Opcode 4 Session Description that includes the mode and secret_key, a 32 byte array used for sending and receiving RTC data:
Session Description Structure
| Field | Type | Description |
|---|---|---|
| audio_codec | string | The audio codec to use |
| video_codec | string | The video codec to use |
| media_session_id | string | The media session ID, used for analytics |
| mode? | string | The transport encryption mode to use, not applicable to WebRTC |
| secret_key? | array[integer] | The 32 byte secret key used for encryption, not applicable to WebRTC |
| sdp? | string | The WebRTC session description protocol |
| keyframe_interval? | integer | The keyframe interval in milliseconds |
| dave_protocol_version | integer | The DAVE protocol version to use, where 0 indicates no DAVE support |
Example Session Description
{"op": 4,"d": {"audio_codec": "opus","media_session_id": "89f1d62f166b948746f7646713d39dbb","mode": "aead_aes256_gcm_rtpsize","secret_key": [ ... ],"video_codec": "H264","dave_protocol_version": 1}}
We can now start sending and receiving RTC data over the previously established UDP or WebRTC connection.
Session Updates
At any time, the client may update the codecs they support using an Opcode 14 Session Update.
If a user joins that does not support the current codecs, or a user indicates that they no longer support the current codecs, the voice server will send an Opcode 14 Session Update:
This may also be sent to update the current media_session_id or keyframe_interval.
Session Update Structure (Send)
| Field | Type | Description |
|---|---|---|
| codecs | array[codec object] | The supported audio/video codecs |
Session Update Structure (Receive)
| Field | Type | Description |
|---|---|---|
| audio_codec? | string | The new audio codec to use |
| video_codec? | string | The new video codec to use |
| media_session_id? | string | The new media session ID, used for analytics |
| keyframe_interval? | integer | The keyframe interval in milliseconds |
End-to-End Encryption
Since September 2024, Discord is migrating voice and video in private channels, voice channels, and streams to use end-to-end encryption (E2EE) through the DAVE protocol. When any DAVE protocol is enabled for a call, the full contents of media frames sent and received by call participants are end-to-end encrypted.
This section is a high-level overview of how to support Discord's audio & video end-to-end encryption (DAVE) protocol, centered around the Gateway opcodes necessary for the protocol. The most thorough documentation on the DAVE protocol is found in the protocol whitepaper. You may additionally be able to leverage or refer to Discord's open-source library libdave to assist your implementation. The exact format of the DAVE protocol opcodes is detailed in the opcodes section of the protocol whitepaper.
When a call is E2EE, all members of the call exchange keys via a Messaging Layer Security (MLS) group. This group is used to derive per-sender ratcheted media keys (known only to the participants of the group) to encrypt/decrypt media frames sent in the call.
Binary Websocket Messages
To reduce overhead, some of the new DAVE protocol opcodes are sent as binary instead of JSON text. See the format column in voice opcodes to identify them. Binary websocket messages have the following format:
| Field | Type | Description | Size |
|---|---|---|---|
| Sequence 1 | Unsigned short (big endian) | Sequence number | 2 bytes |
| Opcode | Unsigned integer (big endian) | Unsigned integer opcode value | 1 bytes |
| Payload | Binary data | Format defined by opcode | Variable bytes |
1 Sequence numbers are only sent from the server to the client, and all server-sent binary opcodes require the sequence number. See Buffered Resume for further details on how sequence numbers are used when present.
Indicating DAVE Protocol Support
Include the highest DAVE protocol version you support in Opcode 0 Identify as max_dave_protocol_version. Sending version 0, or omitting the max_dave_protocol_version field, indicates no DAVE protocol support.
The voice Gateway specifies the initial protocol version in Opcode 4 Session Description under dave_protocol_version. This may be any non-discontinued protocol version equal to or less than your supported protocol version.
Protocol Transitions
The voice server negotiates protocol version and MLS group transitions to ensure the continuity of media being sent for the call. This can occur when the call is upgrading/downgrading to/from E2EE (in the initial transition phase), changing protocol versions, or when the MLS group is changing.
Some opcodes include a transition ID. After preparing local state necessary to perform the transition, send Opcode 23 DAVE Protocol Transition Ready to indicate to the Gateway that you are ready to execute the transition. When all participants are ready or when a timeout has been reached, the Gateway dispatches Opcode 22 DAVE Protocol Execute Transition to confirm execution of the transition. The transition execution is what indicates to media senders that they can begin sending media with the new protocol context (e.g. without E2EE after a downgrade, with a new protocol version after a protocol version change, or using a new key ratchet after a group participant change).
Downgrade
Downgrades to protocol version 0 are announced via Opcode 21 DAVE Protocol Prepare Transition. This can occur during the transition phase when a client that does not support the protocol joins the call. When this transition is executed, senders should stop sending media using the protocol format.
Version Change & Upgrade
Protocol version transitions (including upgrades from protocol version 0) are announced via Opcode 24 DAVE Protocol Prepare Epoch. In addition to the transition_id, this opcode includes the epoch for the upcoming MLS epoch.
Receiving Opcode 24 DAVE Protocol Prepare Epoch with epoch = 1 indicates that a new MLS group is being created. Participants must:
- Prepare a local MLS group with the parameters appropriate for the DAVE protocol version
- Generate and send Opcode 26 MLS Key Package to deliver a new MLS key package to the Gateway
When the epoch is greater than 1, the protocol version of the existing MLS group is changing.
When the transition is executed, senders must start sending media using the new protocol context (e.g. formatted for the new protocol version or using a new key ratchet).
MLS Group Changes
When the participants of the MLS group must change, existing participants receive an Opcode 29 MLS Announce Commit Transition, whereas new members being added to the group receive Opcode 30 MLS Welcome. Both opcodes include the transition ID and binary MLS Commit or MLS Welcome message.
To prepare for the protocol transition, existing group members must apply the commit to progress their local MLS group to the correct next state. Opcode 23 DAVE Protocol Transition Ready is sent when the MLS commit has been processed.
Welcomed members send Opcode 23 DAVE Protocol Transition Ready after successfully joining the group received in the MLS Welcome message.
External Sender
The voice server must be an external sender of the MLS group, so that it can send external MLS proposals to add and remove call participants when appropriate (i.e. proposing the addition of new members when they connect and the removal of previous members when they disconnect).
DAVE protocol participants only process proposals which arrive from the external sender, and not from any other group members. The external sender only sends Add or Remove proposals.
The Gateway uses Opcode 25 MLS External Sender Package to provide the external sender public key and credential to MLS group participants. This message may be sent immediately on Gateway connect or at a later time when the call is upgrading to use the DAVE protocol.
Group creators must include the external sender they receive from the Gateway in their MLS group extensions when creating the group. Welcomed group members ensure that the expected external sender extension is present in the group they are about to join.
Joining the MLS Group
Except for the initial creation of the first group for the call, joining the MLS group always occurs after receiving Opcode 30 MLS Welcome.
Key Packages
To be proposed to be added to the MLS group, pending members must send an MLS key package via Opcode 26 MLS Key Package. Key packages are only used one time, and a new key package must be generated each time pending member is waiting to be added or re-added to the group.
Identity Public Key
MLS participants use an asymmetric keypair for MLS message signatures and authentication. The public key of this keypair is included in the key package and MLS tree. It is known to other participants in the call and is leveraged for out-of-band identity verification.
You can choose to generate a new ephemeral keypair for every protocol call or use the same persistent keypair at all times. Keys can be uploaded and verified using Upload Voice Public Key and Verify Voice Public Key respectively.
Initial Group
When there is not yet an MLS group (e.g. a transport-only encrypted call is upgrading or two members have just joined a new call), all pending group members create a local group using the MLS parameters defined by the DAVE protocol version and
including the voice server external sender received via Opcode 25 MLS External Sender Package. Every pending member of the group has the chance to produce the initial commit that creates the MLS group with epoch = 1.
Pending group members receive add proposals for every other pending group member from the Gateway. If an additional pending member joins while there is not yet an MLS group, they receive all in-flight proposal messages.
Proposal and commit handling follows the same process whether or not there is an established group. See Proposals and Commits.
Welcome
Pending group members receive a welcome message from another group member which adds them to the MLS group. This is dispatched from the Gateway via Opcode 30 MLS Welcome.
Invalid Group
If the group received in an Opcode 30 MLS Welcome or Opcode 29 MLS Announce Commit Transition is unprocessable, the member receiving the unprocessable message sends Opcode 31 MLS Invalid Commit Welcome to the Gateway. Additionally, the local group state is reset and a new key package is generated and sent to the Gateway via Opcode 26 MLS Key Package.
This causes the Gateway to propose the removal and re-addition of the requesting member.
Proposals and Commits
The Gateway dispatches proposals which must be appended or revoked via Opcode 27 MLS Proposals. All members of the established or pending MLS group must append or revoke the proposals they receive, and then produce an MLS commit message and optionally an MLS welcome message (when committing add proposals which add new members) which they send to the Gateway via Opcode 28 MLS Commit Welcome.
In each epoch, the Gateway dispatches the "winning" commit via Opcode 29 MLS Announce Commit Transition and optionally the associated welcome messages via Opcode 30 MLS Welcome. The Gateway broadcasts the first valid commit and welcome(s) it sees in the given epoch, and drops any commits later received for the out-of-date epoch. All dispatched unrevoked proposals in the epoch must be included in the commit for it to be valid. All members added in the epoch must be welcomed for the welcome to be valid.
Payload Format
Some fields in the protocol frame payload use ULEB128 encoding. This is a variable-length code compression to represent arbitrarily large unsigned integers in a small number of bytes.
| Field | Type | Description | Size |
|---|---|---|---|
| Media Frame | Binary data | Interleaved unencrypted and encrypted media frame | Variable bytes |
| Authentication Tag | Binary data | Truncated AES128-GCM AEAD Authentication Tag | 8 bytes |
| Nonce | ULEB128 | Truncated synchronization nonce | Variable bytes |
| Unencrypted Ranges | ULEB128 | Unencrypted range offset and length pairs | Variable bytes |
| Supplemental Data Size | Unsigned integer (big endian) | Byte size of supplemental data | 1 byte |
| Magic Marker | Binary data | 0xFAFA marker to assist with protocol frame identification | 2 bytes |
Media Frame
The encrypted frame transformer is codec-aware and processes incoming encoded frames from WebRTC to determine which ranges must be left unencrypted so that they can pass through the WebRTC packetizer and depacketizer.
All of the (potentially discontiguous) encrypted ranges are joined together, in their order in the original frame, to be encrypted as one block of plaintext, using the AES128-GCM AEAD encryption described below.
All of the (potentially discontiguous) unencrypted ranges from the frame are joined together and included as additional data to be authenticated by the AEAD ciphersuite. This ensures the SFU is unable to include or replace content in user media frames.
In the resulting interleaved protocol media frame, the unencrypted ranges remain unmodified in their original location from the incoming frame. Encrypted ranges are replaced by their associated ciphertext range. The encrypting frame transformer may mutate the encoded frame it receives to ensure it can pass through the packetizer and depacketizer in an expected and reproducible manner.
Authentication Tag
The authentication tag is an 8-byte truncated version of the authentication tag resulting from the AEAD encryption.
Nonce
The ULEB128 nonce is a variable length representation of the nonce used for encryption/decryption.
Unencrypted Ranges
The unencrypted ranges identify which portions of the interleaved protocol media frame are plaintext and which are ciphertext. Each included range is represented as a byte offset and byte size pair, with both encoded using ULEB128. Unencrypted ranges are ordered by their ascending byte offset. The encrypting frame transformer is codec-aware, and processes each incoming encoded frame to determine the unencrypted ranges for the frame. The decrypted frame transformer deserializes the unencrypted ranges from the protocol supplemental data, and reconstructs the merged additional data and ciphertext necessary for decryption.
Supplemental Data Size
The supplemental data size is the sum of bytes required for:
- 8-byte authentication tag
- Variable length ULEB128 nonce
- Variable length ULEB128 unencrypted ranges
- 1 byte supplemental data size
- 2 byte magic marker
Magic Marker
The magic marker is a constant 2-byte value 0xFAFA. This is used by media receivers to detect protocol frames as well as by the SFU to avoid sending protocol frames to non-protocol-supporting receivers during transition periods.
Payload Encryption
Media frames are encrypted for E2EE using AES128-GCM. Depending on the protocol, some bytes may be left unencrypted to allow for packetization and depacketization of frames. For more detail, see the codec handling section of the protocol whitepaper.
Sender Key Derivation
Each media sender has a ratcheted per-sender key. There is a new per-sender ratchet created in each MLS group epoch. The initial secret for each sender's ratchet is an exported 16-byte secret from the MLS group. Keys are retrieved from the ratchet via a generation counter derived from the most-significant byte of the 4-byte nonce.
For very long lived epochs, the nonce wrap-around must be handled so the generation does not also wrap back around to 0.
See the sender key derivation section of the protocol whitepaper for the detailed process.
Authentication Tag
The authentication tag resulting from the AES128-GCM encryption is truncated to 8 bytes. Some implementations may provide the desired tag length as a parameter whereas some may always return the full 12-byte tag from which the 4 least significant bytes should be removed.
Nonce
The nonce passed to the AES128-GCM encryption and decryption functions is a full 12-byte nonce, but the protocol only uses at most 4-bytes. The 12-byte nonce can be expanded from a 4-byte truncated nonce by setting the 8 most significant bytes of the nonce to zero, with the 4 least significant bytes carrying the value of the truncated nonce.
The generation used for the sender's key ratchet is retrieved from the most-significant byte of the 4-byte nonce (i.e. the 4th least significant byte of the full 12-byte nonce).
AEAD Additional Data
The additional data passed to the AEAD encryption and decryption functions is the concatenation of all unencrypted ranges from the frame. This ensures that the SFU cannot modify any unencrypted content in the frame without being detected by receivers.
Heartbeating
In order to maintain your WebSocket connection, you need to continuously send heartbeats at the interval determined in Opcode 8 Hello.
This is sent at the start of the connection. Be warned that the Opcode 8 Hello structure differs by Gateway version.
Versions below v3 follow a flat structure without op or d fields, including only a single heartbeat_interval field. Be sure to expect this different format based on your version.
This heartbeat interval is the minimum interval you should heartbeat at. You can heartbeat at a faster interval if you wish.
For example, the web client uses a heartbeat interval of min(heartbeat_interval, 5000) if the Gateway version is v4 or above, and heartbeat_interval * 0.1 otherwise. The desktop client uses the provided heartbeat interval if the Gateway version is v4 or above, and heartbeat_interval * 0.25 otherwise.
Hello Structure
| Field | Type | Description |
|---|---|---|
| v | integer | The voice server version |
| heartbeat_interval | integer | The minimum interval (in milliseconds) the client should heartbeat at |
Example Hello
{"op": 8,"d": {"v": 8,"heartbeat_interval": 41250}}
The Gateway may request a heartbeat from the client in some situations by sending an Opcode 3 Heartbeat. When this occurs, the client should immediately send an Opcode 3 Heartbeat without waiting the remainder of the current interval.
After receiving Opcode 8 Hello, you should send Opcode 3 Heartbeat—which contains an integer nonce—every elapsed interval:
Heartbeat Structure
| Field | Type | Description |
|---|---|---|
| t | integer | A unique integer nonce (e.g. the current unix timestamp) |
| seq_ack? | integer | The last received sequence number |
Example Heartbeat
{"op": 3,"d": {"t": 1501184119561,"seq_ack": 10}}
Since Gateway v8, heartbeat messages must include seq_ack which contains the sequence number of the last numbered message received from the gateway. See Buffered Resume for more information.
Previous versions follow a flat structure, with the d field representing the t field in both the Heartbeat and Heartbeat ACK structure.
In return, you will be sent back an Opcode 6 Heartbeat ACK that contains the previously sent nonce:
Example Heartbeat ACK
{"op": 6,"d": {"t": 1501184119561}}
UDP Connections
UDP is the most likely protocol that clients will use. First, we open a UDP connection to the IP and port provided in the Ready payload. If required, we can now perform an IP Discovery using this connection. Once we've fully discovered our external IP and UDP port, we can then tell the voice WebSocket what it is by sending a Select Protocol as outlined above, and receive our Session Description to begin sending/receiving RTC data.
IP Discovery
Generally routers on the Internet mask or obfuscate UDP ports through a process called NAT. Most users who implement voice will want to utilize IP discovery to find their external IP and port which will then be used for receiving voice communications. To retrieve your external IP and port, send the following UDP packet to your voice port (all numeric are big endian):
| Field | Type | Description | Size |
|---|---|---|---|
| Type | Unsigned short (big endian) | Values 0x1 and 0x2 indicate request and response, respectively | 2 bytes |
| Length | Unsigned short (big endian) | Message length excluding Type and Length fields (value 70) | 2 bytes |
| SSRC | Unsigned integer (big endian) | The SSRC of the user | 4 bytes |
| Address | Null-terminated string | The external IP address of the user | 64 bytes |
| Port | Unsigned short (big endian) | The external port number of the user | 2 bytes |
Sending and Receiving Voice
Voice data sent to and received from Discord should be encoded or decoded with Opus, using two channels (stereo) and a sample rate of 48kHz.
Video data should be encoded or decoded using the RFCs relevant to the codec being used. Data is sent using a RTP Header, followed by encrypted Opus audio data or video data. Encryption uses the key passed in Session Description and the nonce formed with the 12 byte header appended with 12 null bytes, if required. Discord encrypts with the libsodium encryption library.
Transport encryption between the client and the selective forwarding unit (SFU) is still used even in E2EE calls.
When receiving data, the user who sent the packet is identified by caching the SSRC and user IDs received from Speaking events. At least one Speaking event for the user is received before any frames are received, so the user ID should always be available.
RTP Packet Structure
| Field | Type | Description | Size |
|---|---|---|---|
| Version + Flags 1 | Unsigned byte | The RTP version and flags (always 0x80 for voice) | 1 byte |
| Payload Type 2 | Unsigned byte | The type of payload (0x78 with the default Opus configuration) | 1 byte |
| Sequence | Unsigned short (big endian) | The sequence number of the packet | 2 bytes |
| Timestamp | Unsigned integer (big endian) | The RTC timestamp of the packet | 4 bytes |
| SSRC | Unsigned integer (big endian) | The SSRC of the user | 4 bytes |
| Payload | Binary data | Encrypted audio/video data | n bytes |
1 If sending an RTP header extension, the flags should have the extension bit (1 << 4) set (e.g. 0x80 becomes 0x90).
2 When sending a final video frame, the payload type should have the M bit (1 << 7) set (e.g. 0x78 becomes 0xF8).
Quality of Service
Discord utilizes RTCP packets to monitor the quality of the connection. Sending and parsing these packets is not required, but is recommended to aid in monitoring the connection and synchronizing audio and video streams. The client should send an RTCP Sender Report roughly every 5 seconds (without padding or reception report blocks) to inform the server of the current state of the connection. Likewise, Discord will send RTCP Receiver Reports to the client to provide feedback on the quality of the connection.
WebRTC Connections
WebRTC allows for direct peer-to-peer voice connections, and is most commonly used in browsers. To use WebRTC, you must first send a Select Protocol payload as outlined above, with the protocol field set to webrtc, and data set to the client's WebRTC SDP. The voice server will respond with a Session Description payload, with the sdp field set to the server's WebRTC SDP. The client can then use this SDP to establish a WebRTC connection.
Speaking
To notify the voice server that you are speaking or have stopped speaking, send an Opcode 5 Speaking payload:
Speaking Structure
| Field | Type | Description |
|---|---|---|
| speaking 1 | integer | The speaking flags |
| ssrc | integer | The SSRC of the speaking user |
| user_id 2 | snowflake | The user ID of the speaking user |
| delay? 3 | integer | The speaking packet delay |
1 For Gateway v3 and below, this field is a boolean.
2 Only sent by the voice server.
3 Not sent by the voice server.
Speaking Flags
| Value | Name | Description |
|---|---|---|
| 1 << 0 | VOICE | Normal transmission of voice audio |
| 1 << 1 | SOUNDSHARE | Transmission of context audio for video, no speaking indicator |
| 1 << 2 | PRIORITY | Priority speaker, lowering audio of other speakers |
Example Speaking (Send)
{"op": 5,"d": {"speaking": 5,"delay": 0,"ssrc": 1}}
When a different user's speaking state is updated, and for each user with a speaking state at connection start, the voice server will send an Opcode 5 Speaking payload:
Example Speaking (Receive)
{"op": 5,"d": {"speaking": 5,"ssrc": 2,"user_id": "852892297661906993"}}
Video
To notify the voice server that you are sending video, send an Opcode 12 Video payload:
Video Structure
| Field | Type | Description |
|---|---|---|
| audio_ssrc | integer | The SSRC of the audio stream |
| video_ssrc | integer | The SSRC of the video stream |
| rtx_ssrc 1 | integer | The SSRC of the retransmission stream |
| streams | array[stream object] | Simulcast streams to send |
| user_id 2 | snowflake | The user ID of the video user |
1 Not sent by the voice server.
2 Only sent by the voice server.
Example Video (Send)
{"op": 12,"d": {"audio_ssrc": 13959,"video_ssrc": 13960,"rtx_ssrc": 13961,"streams": [{"type": "video","rid": "100","ssrc": 13960,"active": true,"quality": 100,"rtx_ssrc": 13961,"max_bitrate": 9000000,"max_framerate": 60,"max_resolution": {"type": "source","width": 0,"height": 0}}]}}
When a different user's video state is updated, and for each user with a video state at connection start, the voice server will send an Opcode 12 Video payload:
Example Video (Receive)
{"op": 12,"d": {"user_id": "852892297661906993","audio_ssrc": 13959,"video_ssrc": 13960,"streams": [{"ssrc": 13960,"rtx_ssrc": 13961,"rid": "100","quality": 100,"max_resolution": {"width": 0,"type": "source","height": 0},"max_framerate": 60,"active": true}]}}
Voice Data Interpolation
When there's a break in the sent data, the packet transmission shouldn't simply stop. Instead, send five frames of silence (0xF8, 0xFF, 0xFE) before stopping to avoid unintended Opus interpolation with subsequent transmissions.
Likewise, when you receive these five frames of silence, you know that the user has stopped speaking.
Resuming Voice Connection
When your client detects that its connection has been severed, it should open a new WebSocket connection. Once the new connection has been opened, your client should send an Opcode 7 Resume payload:
Resume Structure
| Field | Type | Description |
|---|---|---|
| server_id | snowflake | The ID of the guild or private channel being connected to |
| channel_id 2 | snowflake | The ID of the channel being connected to |
| session_id | string | The session ID of the current session |
| token | string | The voice token for the current session |
| seq_ack? 1 | integer | The last received sequence number |
1 Only available on Gateway v8 and above.
2 Only required for Gateway v9 and above.
Example Resume
{"op": 7,"d": {"server_id": "41771983423143937","session_id": "30f32c5d54ae86130fc4a215c7474263","token": "66d29164ee8cd919"}}
If successful, the voice server will respond with an Opcode 9 Resumed to signal that your client is now resumed:
Example Resumed
{"op": 9,"d": null}
If the resume is unsuccessful—for example, due to an invalid session—the WebSocket connection will close with the appropriate close code. You should then follow the Connecting flow to reconnect.
Buffered Resume
Since version 8, the Gateway can resend buffered messages that have been lost upon resume. To support this, the Gateway includes a sequence number with all messages that may need to be re-sent.
Example Message With Sequence Number
{"op": 5,"d": {"speaking": 0,"delay": 0,"ssrc": 110},"seq": 10}
A client using Gateway v8 must include the last sequence number they received under the data d key as seq_ack in both the Opcode 3 Heartbeat and Opcode 7 Resume payloads.
If no sequence numbered messages have been received, seq_ack can be omitted or included with a value of -1.
The Gateway uses a fixed bit length sequence number and handles wrapping the sequence number around. Since Gateway messages will always arrive in order, a client only needs to retain the last sequence number they have seen.
If the session is successfully resumed, the Gateway will respond with an Opcode 9 Resumed and will re-send any messages that the client did not receive.
The resume may be unsuccessful if the buffer for the session no longer contains a message that has been missed. In this case the session will be closed and you should then follow the Connecting flow to reconnect.
Connected Clients
Client Connections
At connection start, and when a client thereafter connects to voice, the voice server will send a series of events. This includes an Opcode 11 Clients Connect containing every connected user, as well as individual Opcode 18 Client Flags and Opcode 20 Client Platform for each user.
These events are meant to inform a new client of all existing clients and their flags/platform, and inform existing clients of a newly-connected client.
Clients Connect Structure
| Field | Type | Description |
|---|---|---|
| user_ids | snowflake | The IDs of the users that connected |
Example Clients Connect
{"op": 11,"d": {"user_ids": ["852892297661906993"]}}
Client Flags Structure
| Field | Type | Description |
|---|---|---|
| user_id | snowflake | The ID of the user that connected |
| flags | ?integer | The user's voice flags |
Voice Flags
| Value | Name | Description |
|---|---|---|
| 1 << 0 | CLIPS_ENABLED | User has clips enabled |
| 1 << 1 | ALLOW_VOICE_RECORDING | User has allowed their voice to be recorded in another user's clips |
| 1 << 2 | ALLOW_ANY_VIEWER_CLIPS | User has allowed stream viewers to clip them |
Example Client Flags
{"op": 18,"d": {"user_id": "852892297661906993","flags": 3}}
Client Platform Structure
| Field | Type | Description |
|---|---|---|
| user_id | snowflake | The ID of the user that connected |
| platform | ?integer | The user's voice platform |
Voice Platform
| Value | Name | Description |
|---|---|---|
| 0 | DESKTOP | Desktop-based client |
| 1 | MOBILE | Mobile client |
| 2 | XBOX | Xbox integration |
| 3 | PLAYSTATION | PlayStation integration |
Example Client Platform
{"op": 20,"d": {"user_id": "852892297661906993","platform": 0}}
Client Disconnections
When a user disconnects from voice, the voice server will send an Opcode 13 Client Disconnect:
When received, the SSRC of the user should be discarded.
Client Disconnect Structure
| Field | Type | Description |
|---|---|---|
| user_id | snowflake | The ID of the user that disconnected |
Example Client Disconnect
{"op": 13,"d": {"user_id": "852892297661906993"}}
Simulcasting
The voice server supports simulcasting, allowing clients to send multiple video streams of different qualities and adjust the quality of the video stream they receive to fit bandwidth constraints. This can be used to lower the quality of a received video stream when the user is not in focus, or to disable the transmission of a voice or video stream entirely when a user is off-screen or a client has muted them.
A media stream specified by a given SSRC can be requested at a quality level between 0 and 100, with 0 disabling it entirely and 100 being the highest quality.
Additionally, if the user offers multiple streams for a given media type, the client can request a specific stream by setting its quality level to 100 and the others to 0.
A special SSRC value of any can be used to request a quality level for all streams.
Clients may request the media quality they want per SSRC by sending an Opcode 15 Media Sink Wants payload with a mapping of SSRCs to quality levels.
Clients may also specify a pixelCounts field to indicate the preferred resolution of the video stream for each SSRC, which can be used by the voice server to determine the best quality level to send based on the client's capabilities and preferences.
Likewise, the voice server may send a Opcode 15 Media Sink Wants payload to inform the client of the quality levels it should be sending for each SSRC.
Example Media Sink Wants
{"op": 15,"d": {"8964": 100,"pixelCounts": {"8964": 1189844.5769597634}}}
Voice Backend Version
For analytics, the client may want to receive information about the voice backend's current version. To do so, send an Opcode 16 Voice Backend Version with an empty payload:
Voice Backend Version Structure
| Field | Type | Description |
|---|---|---|
| voice | string | The voice backend's version |
| rtc_worker | string | The WebRTC worker's version |
Example Voice Backend Version (Send)
{"op": 16,"d": {}}
In response, the voice server will send an Opcode 16 Voice Backend Version payload with the versions:
Example Voice Backend Version (Receive)
{"op": 16,"d": {"voice": "0.9.1","rtc_worker": "0.3.35"}}
Streams
Stream connections operate in a similar fashion to regular voice connections. In fact, on the protocol side, they are identical and use all of the payloads and processes described above. The main differences are within the Gateway protocol, as streams are started and joined differently to regular voice connections.
Connecting to Streams
To start or join a stream, the client must first be connected to the voice instance that the stream is hosted on. Then, send a Create Stream or Watch Stream payload to the Gateway.
If our request succeeded, as with voice, you must wait for the Gateway to respond with two events—a Stream Create event and a Stream Server Update.
You can then use the information provided in these events to establish a connection to the stream server as outlined in Connecting to Voice. Note that the server_id used when identifying will be provided in the Stream Create event.
Note that if joining a stream fails, the Gateway will instead respond with a Stream Delete event which will contain the reason for the failure.