Flexible Diffusion Modeling of Long Videos

Long videos sampled on GQN-Mazes and MineRL by iterated application of our Hierarchy-2 sampling scheme, and a CARLA Town 01 video sampled with an autoregressive sampling scheme.

Arrays of 30-1000 second videos

CARLA Town01. Blocks of sampled completions with (1) FDM with Autoreg, (2) FDM with Hierarchy-2, and (3) CWVAE. Within each block each column shows completions for a different test video. The top row is the ground-truth and the second row contains sampled completions.

MineRL. Blocks of sampled completions with (1) FDM with Autoreg, (2) FDM with Hierarchy-2, and (3) CWVAE. Within each block each column shows completions for a different test video. The top row is the ground-truth and all other rows are sampled completions.

GQN-Mazes. Blocks of sampled completions with (1) FDM with Autoreg, (2) FDM with Hierarchy-2, and (3) CWVAE. Within each block each column shows completions for a different test video. The top row is the ground-truth and all other rows are sampled completions.

And unconditional samples (i.e. not conditioned on the first few frames)