Skip to article frontmatterSkip to article content

Attention, Transformers, and Transfer Learning

Cornell University

Sequence Models

Recursive Neural Network

RNN kind of resembles Finite State Machine, where you have the state transition function

si+1=δ(si,xi)s_{i+1} = \delta(s_{i}, x_i)

except the state space RNN operates on is the continuous real numbers.

Attention

The attention model wants the tokens to interact not by ordering or position (as in RNN), but by how they look similar to each other in the embedding space.

You can think of them as differentiable relaxation of the concept of “dictionary” -- Christian Szegedy