Video Generation with Learned Priors
visual prediction task Given n frames of video x_{1}, …, x_{n}, predict T subsequent frames x_{n+1}… x_{n+T}.
Action Conditioned Prediction Network Visual prediction task. Two inputs:
taken action image Predict the next t frame. Challenge! What is the action? Predicting the action is not super easy.
RAFI Conditional Flow Matching with Video Generation.
https://arxiv.org/pdf/2406.14436