![]() Free software = Download Free software and also open source code also known as FOSS (Free and Open Source Software). Freeware Trialware = Download Free software but some parts are trial/shareware. RECENTLY UPDATED = The software has been updated the last 31 days. NO LONGER DEVELOPED = The software hasn't been updated in over 5 years. Type and download NO MORE UPDATES? = The software hasn't been updated in over 2 years. Version number / Beta version number / Update version number and when it whas released. Finally, Blake Hechtman and Anselm Levskaya were generous in helping us debug a number of JAX issues.Explanation: NEW SOFTWARE= New tool since your last visit NEW VERSION= New version since your last visit NEW REVIEW= New review since your last visit NEW VERSION= New version Latest version ![]() Sarah Laszlo and Kathy Meier-Hellstern have greatly helped us incorporate important responsible AI practices into this project, which we are immensely grateful for. Alex Rizkowsky was very helpful in keeping things organized, while Erica Moreira and Victor Gomes ensured smooth resourcing for the project. Jason Baldridge was instrumental for bouncing ideas. Tim Salimans and Chitwan Saharia helped us with brainstorming and coming up with shared benchmarks. We are grateful to Evan Rapoport, Douglas Eck and Zoubin Ghahramani for supporting this work in a variety of ways. We appreciate the efforts of Kevin Murphy and David Fleet for advising the project and providing feedback throughout. Special thanks to Gabriel Bender and Thang Luong for reviewing the paper and providing constructive feedback. We also want to thank Niki Parmar for initial discussions. To our artist friends Irina Blok and Alonso Martinez for extensive creative exploration of the system and for using Phenaki to generate some of the videos showcased here. We give special thanks to the Imagen Video team for their collaboration and for providing their system to do super resolution. “Timelapse of sunset in the modern city.” “The camera zooms out slowly to the skyscraper exterior.” “The lion wearing looks at the camera and smiles.” “Zoom out to the lion wearing a dark suit in an office room.” “The camera zooms into the lion's face, inside the office.” “A lion runs on top of the office desks.” “We are in an office room with empty desks.” “The camera zooms into one of the many windows.” “Crash zoom towards a futuristic skyscraper.” “The ocean and the coastline of a futuristic city.” “The camera points up to the sky through the water.” “We follow the blue fish as it swims in the dark ocean.” “The screen behind the astronaut displays fish swimming in the sea.” “The camera moves beyond the astronaut and looks at the screen.” “The astronaut leaves the keyboard and walks away.” “The astronaut leaves the keyboard and walks to the left.” “The camera moves away from the astronaut.” “The astronaut is typing in the keyboard.” “The camera moves forward until showing an astronaut in the blue room.” “The camera gets inside the alien spaceship.” “An alien spaceship arrives to the futuristic city.” To the best of our knowledge, this is the first time a paper studies generating videos from such time-variable prompts.įurthermore, we observed that our video encoder-decoder outperformed all per-frame baselines currently used in the literature on both spatio-temporal quality and number of tokens per video. When compared to prior video generation methods, we observed that Phenaki could generate arbitrarily long videos conditioned on an open-domain sequence of prompts in the form of time-variable text, or a story. ![]() To address the data issues, we demonstrate that joint training on a large corpus of image-text pairs and a smaller number of video-text examples can result in generalization beyond what is available in the video datasets alone. To address the first two issues, Phenaki leverages its two main components:Īn encoder-decoder model that compresses videos to discrete embeddings, or tokens, with a tokenizer that can work with variable-length videos thanks to its use of causal attention in time.Ī transformer model that translates text embeddings to video tokens: we use a bi-directional masked transformer conditioned on pre-computed text tokens to generate video tokens from text, which are subsequently de-tokenized to create the actual video. Generating videos from text is particularly challenging due to various factors, such as high computational cost, variable video lengths, and limited availability of high quality text-video data. We present Phenaki, a model that can synthesize realistic videos from textual prompt sequences.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |