Deep Learning

Deep Learning Paper Recap - Diffusion and Transformer Models

This week’s Deep Learning Paper Reviews is Diffusion-LM Improves Controllable Text Generation and Sparsifying Transformer Models with Trainable Representation Pooling.

Deep Learning Paper Recap - Diffusion and Transformer Models

This week’s Deep Learning Paper Reviews is Diffusion-LM Improves Controllable Text Generation and Sparsifying Transformer Models with Trainable Representation Pooling.

Diffusion-LM Improves Controllable Text Generation

What’s Exciting About this Paper

This paper is the first time that continuous diffusion models have been applied to the task of controllable NLG (Natural Language Generation), allowing gradient based methods to be used.

Innovative “rounding” and “embedding” steps are added at the end of the Markov Chain casting a discrete problem to a continuous one.

Key Findings

The authors are able to outperform existing methods, such as PPLM and FUDGE, over various controllable text generation tasks such as conditioning on semantic content, parts-of-speech, syntax trees, syntax spans, length, and left / right context.

Diffusion-LM fails to perform as well as LM (Language Model) fine-tuning for each task but, here, the LM is frozen allowing the same LM to be used over various tasks.

Source

Our Takeaways

Further research is necessary, but it’s exciting to see diffusion modeling techniques being utilized in the field of NLG with their success in the image / audio synthesis domains.

A major bottleneck that must be addressed is the slow decoding speed, however. Diffusion-LM is ~7x slower than autoregressive text generation with only a minimal number of diffusion steps (200 whereas ~2000 is more typical).

Sparsifying Transformer Models with Trainable Representation Pooling

What’s Exciting About this Paper

This paper proposes a novel method to sparsify transformer architectures which achieves sublinear time and memory complexity. This method, called representation pooling, learns to select the best representations depending on the advantage they give on a downstream task.

This paper also presents a Successive Halving Top-k operator that outperforms previous approaches in terms of approximation quality and speed.

The authors provide a detailed analysis of its differential properties and prove that it is trainable in an end-to-end manner.

Key Findings

Experiments with trainable representation pooling achieved 1.8× speedup during training, 4.5× speedup during inference.

Results on the long document summarization task showed that even a simple baseline performs comparably to the current SOTA.

Source

Our Takeaways

Representation pooling coupled with a scoring function with a trainable parameter can help reduce time and memory complexity of transformer architectures while maintaining top quality performance for long documents summarization.

This technique, however, might not be as useful for tasks where the input sequences have a short length such as sentence-level translation.