Graph Networks
This is a very detailed and clear intro to Graph Networks by Deepmind.
Graph Definition (Box 3 & 3.2.1)
We define a graph to have node, edge, and global attributes. Global attribute, u, for example can be the gravitational field in a three body problem setting.
Global attribute is there to give a chance to any local node / edge to know what’s happening in a global perspective (mostly places far awy from it). If we exclude the global u (which aggregates information from across the nodes and edges), the information that a node has access to after m steps of propagation is determined by the set of nodes and edges that are at most m hops away (Figure 7). This can be interpreted as breaking down a complex computation into smaller elementary steps.
Graph Update (3.2.2 & 3.2.3 & 4.2)

The whole GN consists of the following 6 functions:
$$ \begin{array}{l} {\mathbf{e}_{k}^{\prime}=\phi^{e}\left(\mathbf{e}_{k},\mathbf{v}_{r_{k}},\mathbf{v}_{s_{k}},\mathbf{u}\right)}\\ {\mathbf{v}_{i}^{\prime}=\phi^{v}\left({{\bar{\mathbf{e}}}}_{i}^{\prime},\mathbf{v}_{i},\mathbf{u}\right)}\\ {\mathbf{u}^{\prime}=\phi^{u}\left({\bar{\mathbf{e}}}^{\prime},{\bar{\mathbf{v}}}^{\prime},\mathbf{u}\right)} \end{array} \begin{array}{l} {\bar{{\mathbf{e}}}_{i}^{\prime}=\rho^{e\to v}\left(E_{i}^{\prime}\right)}\\ {\bar{{\mathbf{e}}}^{\prime}=\rho^{e\to u}\left(E^{\prime}\right)}\\ {\bar{{\mathbf{v}}}^{\prime}=\rho^{v\to u}\left(V^{\prime}\right)} \end{array} $$ϕe is an edge specific function and is applied to every edge, same for node and global. ϕ can be any function, people usually use Neural Networks.
On the other hand, ρ takes in a set as input, so needs to be an aggregate function invariant to permutations, like sum, mean, max, min, etc. These functions are not affected by the order they are sent in.
Composing Multi Graph Block Architecture (4.3)
A GN is simply three aggregate functions and three update functions for node, edge, global respectively. It always takes in a graph comprised of edge, node, and global attributes as input, and returning a graph comprised of edge, node, and global attribute. The GN operates on any graph without caring about size of the graph, so it is extremely easy to stack multiple GN block together.
Figure 6 shows 3 commons ways of stacking GN blocks.
Graph Networks Advantage (3.2.4)
- First, graphs can express arbitrary relationships among entities, so the way input interacts is not limited to a fixed model architecture.
- Graphs represent entities and their relations as sets, which are invariant to permutations. However, we can still impose ordering by encoding the indices in the node or edge attribute.
- A GN’s per-edge and per-node functions are reused across all edges and nodes, respectively. Therefore, GN can easily operate on graphs of any size.
Limitations (5.2)
Notions like recursion, control flow, and conditional iteration are not straightforward to represent with graphs.
GraphCast
GraphCast by Deepmind is a perfect example to understand encode-process-decode GN structure and it is clearly written. Note the model architecture is inside the supplementary material. The following notes mostly come from Section3 GraphCast model.
They defined two space for this task. One on the grid space based on earth’s latitude and logitude, where the atmospheric information comes from. The other on a “mesh space” based on a regular sphere, where the actual processing happens. The design was made on the observation that “Using the latitude-longitude grid is not an advisable representation due to its spatial inhomogeneity, and high resolution at the poles which demands disproportionate compute resources.”

As for the mesh space, it is not only 1 mesh space, but 7 mesh space layered together. The first mesh space (M0) is a regular icosahefron (12 nodes and 20 faces). The second (M1) is the first layer divided 6 times. The third (M2) is the second divided 6 times and so forth. The M0 layer is the base and the coarsest one. Among all layers, It has the fewest points and the greatest distance between each point. On the other hand, M6 layer has the finest resolution, most points, and each point is only a few units away from each other.
When we update a particular point on the net, we look at each layer: M0, M1, …, M6. For each layer Mi, we look at this point’s neighbor on this mesh and use these edges to update this point’s value. On M0, the neighbors are far away, so you get information from away. On M6, the neighbors are very close, so you get information from close points. This is exactly what depicted on paper’s Fig1 e) - every point has input from different distance.
What’s most important is that since the finer resolution layer is obtained by dividing the coarser layer, every point and edge on the coarser layer can be directly found on the finer layer. Therefore, in memory we only have to store a single layer M6.
You might ask what if a point on M6 is not originally on M0? Will it still have neighbors on M0 properly defined? Yes (I think), because you can think of the mesh construction process in reverse: if you start with an arbitrary point on the finest mesh and make the resolution coarser, you will always be able to get back to some regular icosahefron with the same strcuture as the M0 we started with (though not the exact same one because this point you started with might indeed not be on the M0 we built the multi-mesh upon).
The encoder is a graph mapping grid nodes to mesh nodes. The processer happens on mesh nodes only. The decoder maps mesh nodes back to the grid nodes.