Speech Recognition and Graph Transformer Network II

$$\gdef \sam #1 {\mathrm{softargmax}(#1)}$$ $$\gdef \vect #1 {\boldsymbol{#1}} $$ $$\gdef \matr #1 {\boldsymbol{#1}} $$ $$\gdef \E {\mathbb{E}} $$ $$\gdef \V {\mathbb{V}} $$ $$\gdef \R {\mathbb{R}} $$ $$\gdef \N {\mathbb{N}} $$ $$\gdef \relu #1 {\texttt{ReLU}(#1)} $$ $$\gdef \D {\,\mathrm{d}} $$ $$\gdef \deriv #1 #2 {\frac{\D #1}{\D #2}}$$ $$\gdef \pd #1 #2 {\frac{\partial #1}{\partial #2}}$$ $$\gdef \set #1 {\left\lbrace #1 \right\rbrace} $$ % My colours $$\gdef \aqua #1 {\textcolor{8dd3c7}{#1}} $$ $$\gdef \yellow #1 {\textcolor{ffffb3}{#1}} $$ $$\gdef \lavender #1 {\textcolor{bebada}{#1}} $$ $$\gdef \red #1 {\textcolor{fb8072}{#1}} $$ $$\gdef \blue #1 {\textcolor{80b1d3}{#1}} $$ $$\gdef \orange #1 {\textcolor{fdb462}{#1}} $$ $$\gdef \green #1 {\textcolor{b3de69}{#1}} $$ $$\gdef \pink #1 {\textcolor{fccde5}{#1}} $$ $$\gdef \vgrey #1 {\textcolor{d9d9d9}{#1}} $$ $$\gdef \violet #1 {\textcolor{bc80bd}{#1}} $$ $$\gdef \unka #1 {\textcolor{ccebc5}{#1}} $$ $$\gdef \unkb #1 {\textcolor{ffed6f}{#1}} $$ % Vectors $$\gdef \vx {\pink{\vect{x }}} $$ $$\gdef \vy {\blue{\vect{y }}} $$ $$\gdef \vb {\vect{b}} $$ $$\gdef \vz {\orange{\vect{z }}} $$ $$\gdef \vtheta {\vect{\theta }} $$ $$\gdef \vh {\green{\vect{h }}} $$ $$\gdef \vq {\aqua{\vect{q }}} $$ $$\gdef \vk {\yellow{\vect{k }}} $$ $$\gdef \vv {\green{\vect{v }}} $$ $$\gdef \vytilde {\violet{\tilde{\vect{y}}}} $$ $$\gdef \vyhat {\red{\hat{\vect{y}}}} $$ $$\gdef \vycheck {\blue{\check{\vect{y}}}} $$ $$\gdef \vzcheck {\blue{\check{\vect{z}}}} $$ $$\gdef \vztilde {\green{\tilde{\vect{z}}}} $$ $$\gdef \vmu {\green{\vect{\mu}}} $$ $$\gdef \vu {\orange{\vect{u}}} $$ % Matrices $$\gdef \mW {\matr{W}} $$ $$\gdef \mA {\matr{A}} $$ $$\gdef \mX {\pink{\matr{X}}} $$ $$\gdef \mY {\blue{\matr{Y}}} $$ $$\gdef \mQ {\aqua{\matr{Q }}} $$ $$\gdef \mK {\yellow{\matr{K }}} $$ $$\gdef \mV {\lavender{\matr{V }}} $$ $$\gdef \mH {\green{\matr{H }}} $$ % Coloured math $$\gdef \cx {\pink{x}} $$ $$\gdef \ctheta {\orange{\theta}} $$ $$\gdef \cz {\orange{z}} $$ $$\gdef \Enc {\lavender{\text{Enc}}} $$ $$\gdef \Dec {\aqua{\text{Dec}}}$$
🎙️ Awni Hannun

Inference Time

The inference of a transcription from a given audio signal can be formulated using 2 distributions:

  • The acoustic model (audio to transcription), represented as $P(\mY \mid \mX)$
  • The language model, $P(\mY)$

The final inference is obtained by taking sum of the log probabilities of the above two, i.e.

\[\vect{\mY*} = \underset{\mY}{\operatorname{argmax}} \log P(\mY \mid \mX) + \log P(\mY)\]

We use the additional term to ensure the transcription is consistent with the rules of the language. Else we may get grammatically wrong transcription.

While returning an output sequence, we can follow the greedy approach, where we take the maximum value of $P(\vy_{t} \mid \vy_{t-1}…\vy_{1})$

However, we can end up missing out a good sequence, which may not have the maximum value of $P(\vy_{t} \mid …)$, as illustrated by the example below.

Figure 1: Greedy Approach : Less Optimal Solution

To remedy this, we employ beam search. Essentially, we consider maximum $k$ tokens in terms of probability at each step $t$ of the sequence. For each of these n-grams, we proceed further and find out the maximum.

The illustration below shows how Beam Search can lead to a better sequence.

Figure 2: Beam Search : Stage 1

Figure 3: Beam Search : Stage 2

Figure 4: Beam Search : Stage 3

Graph Transformer Networks

We have previously seen Weighted Finite State Automata (WFSA) being used to represent the alignment graphs, as shown before.

Graph Transformer Networks (GTNs) are basically WSFA with automatic differentiation.

Lets look at key differences between Neural Networks (NNs) and GTNs

Core Data Structure Tensor Graph/WFSA
  Matrix Multiplication Compose
Core Operations Reduction operations (Sum, Product) Shortest Distance (Viterbi,Forward)
  Negate, Add, Subtract, … Closure, Union, Concatenate, …

Why Differentiable WSFAs?

  • Encode Priors: easy to encode priors into a graph
  • End-to-end: Bridge the training and test process
  • Facilitate research: When we separate the data from the code it makes it easier to explore different ideas. In our case graph is the data and the code is operations allows us to explore different ideas by changing the graph.

Sequence Criteria with WFSAs

It turns out that lots of the sequence criteria like connection temporal classification (CTC) can be specified as the difference between two forward scores.

  • Graphs A: function of input \mX (speech) and target “output” \mY (transcription).
  • Graph Z: function of input \mX, which servers as normalization.
  • Loss: $\log P(\mY \mid \mX) = \text{forwardScore}(A_{\mX,\mY}) - \text{forwardScore}(Z_{\mX})$

Other common loss functions in ASR:

  • Automatic Segmentations Criterion (ASG)
  • Connectionist TEmporal Classification (CTC)
  • Lattice Free MMI (LF-MMI)

Lines of code for CTC: Custom vs GTN

Warp-CTC 9,742
wav2letter 2,859
PyTorch 1,161
GTN 30*

*Same graph code works for decoding.

Different Types of WFS

Weighted Finite-State Acceptor (WFSA)

The bold circle is the starting state and the accepting state is the one with concentric circles.


A simple WFSA which recognizes aa or ab

  • The score of aa is 0 + 2 = 2
  • The score of ba is 1 + 2 = 3

Figure 5: Three Nodes Weighted Finite-State Acceptor (WFSA)

Weighted Finite-State Transducer (WFST)

These graphs are called transducers because they map input sequences to output sequences.


A simple WFST that recognizes ab to xz and bb to yz.

  • The score of ab $\to$ xz is 1.1 + 3.3 = 4.4
  • The score of bb $\to$ yz is 2.0 + 3.3 = 5.3

Figure 6: Three Nodes Weighted Finite-State Transducer (WFST)

More WFSAs and WFSTs

  • Cycles and self-loops are allowed
  • Multiple start and accept nodes are allowed
  • $\epsilon$ “nothing” transitions are allowed in WFSAs and WFSTs.



The union accepts a sequence if it is accepted by any of the input graphs.

Figure 7: Union Operation Between Three Graphs

Kleene Closure

Accepts any accepted sequence by input graph repeated 0 or more times.

Figure 8: Kleene Closure Operation


Similar to matrix multiplication or convolution. This operation is probably the most important in GTNs.

  1. Any path that is accepted by both WFSAs is accepted by the product of the intersection.
  2. The score of the intersected path is the sum of the scores of the paths in the input graphs.

Figure 9: Intersection Operation Between Two Graphs


It is basically the same thing as intersection but for transducers instead of acceptors.

Figure 10: Compose Operation Between Two Graphs

The main thing about composition is that it let’s map from different domains. For example, if we have a graph that maps from letters to words and another graph that maps from words to sentences. When we compose these two graphs we will map from letters to sentences.

Forward Score

Assumes the graph is DAG and an efficient dynamic programming algorithm.

Figure 11: Forward Score is the Softmax of the Paths

The graphs accepts three paths:

  • aca with score = 1.1 + 1.4 + 2.1
  • ba with score = 3.2 + 2.1
  • ca with score 1.4 + 2.1

ForwardScore(g) is the actual-softmax of the paths.

Sequence Criteria with WFSTs

Target graph Y:

Figure 12: Target Graph \mY

Emissions graph E:

Think of it between two nodes we have logits.

Figure 13: Emission Graph E

Target contrained graph A by intersection(Y, E):

Figure 14: Target Constrained Graph


\[\ell = - (\text{forwardScore}(A) - \text{forwardScore}(E))\]

This is not CTC but approaches CTC.

Code examples:

Make the target graph

Similar to figure 8.

import gtn

# Make the graph 
target = gtn.Graph(calc_grad=False)
# Add nodes:
# Add arcs:
target.add_arc(src_node=0, dst_node=1, label=0)
target.add_arc(src_node=1, dst_node=1, label=0)
target.add_arc(src_node=1, dst_node=2, label=1)
target.add_arc(src_node=2, dst_node=2, label=1)
# Draw the graph
label_map = {0: 'a', 1: 'b'}
gtn.draw(target, "target.pdf", label_map)

Make the emissions graph

Similar to figure 9.

import gtn

# Emissions array (logits)
emissions_array = np.random.randn(4, 3)
# Make the graph 
emissions = gtn.linear_graph(4, 3, calc_grad=True)
# Set the weights 



  1. Compute the graphs
  2. Compute the loss
  3. Automatic gradients
from gtn import *

def ASG(emissions, target):
    # Step 1
    # Compute constrained and normalization graphs:
    A = intersect(target, emissions)
    Z = emissions

    # Step 2: 
    # Forward both.graphs: 
    A_score = forward(A)
    Z_score = forward(Z)
    # Compute loss:
    loss = negate(subtract(A_score, Z_score))
    # Step 3: 
    # Clear previous gradients:
    # Compute gradients:
    backward(loss, retain_graph=False)
    return loss.item(), emissions.grad()


from gtn import *

def CTC(emissions, "target"):

The only difference is the target which makes it very easy to try different algorithms and the GTN framework is supposed to be similar to PyTorch with different data structure and operations.

Further readings:

📝 Gyanesh Gupta, Abed Qaddoumi
14th April 2021