Speech-to-speech translation (STST) can be generally seen as the combination of three sub-tasks: (i) transcribing speech to text in a
source language (ASR), (ii) translating text from a source to a target language (MT) and (iii) generating speech from text in a target
language (TTS). Significant progress has been recently made in these three distinct tasks as well as in their joint combination.
Remarkably the state of the art in both pipelined and end-to-end STST systems is achieved by deep learning models that have
fundamental characterists in common: they are all sequence-to-sequence models (seq2seq) models with an encoder, a decoder and an attention network.
This course will start with a general introduction to deep learning, and will then focus on sequence-to-sequence models for MT, ASR and
TTS, including variations and combinations of them.
Synopsis (tentative)
1. Fundamentals of neural networks: feed-forward networks, activation functions, loss functions, training criteria, stochastic gradient descent,
learning rate policies, computation graph, back propagation algorithm. 2. Recurrent neural networks: time unfolded representation,
back-propagation through time, vanishing and exploding gradient problems, long-short term memory units, gated recurrent units;
3. Recurrent Neural MT: encoder-decoder architecture, attention model, beam search, model variations, large vocabulary methods,
beam search, ensamble decoding. 4. Non-recurrent Neural MT: sequence-to-sequence models, convolutional networks, transformer models,
universal transformer; 5. Training criteria: crosse-entropy, data as demonstrator, reinforcement learning, bandit learning,
minimum risk training, curriculum learning, adversarial learning; 6. Neural ASR and TTS: task definition, training data,
seq-to-seq models, performance; 7. Neural end-to-end models: task definition, training data, architectures, performance.