Realtime models
After seeing conversational realtime agents from OpenAI and Google, I’ve been fascinated of the idea of models that can interrupt the user.
For example, the current realtime models from these providers are not fully realtime/cannot interrupt the user. What do I mean by this? Well they wait until the user stops talking before issuing a response.
input tokens might look like :
<video frame>
Hi (audio)
<video frame>
what
is
this
<video frame>
object?
These input tokens are streamed to the inference provider, however the model is only told to process them when microphone input is low and a response is wanted. Models instead need to be run continuously and be able to respond at any time, mixing both input tokens from the outside environment with their output tokens in realtime. Qwen Omni might do this (need to look more into it).
In order for models to become “realtime” they need to be able to have extremely low latency.
Realtime models already exist. For example, self-driving cars. When a self driving car drives, it has a classic input of a video camera, speed, and direction. On a single run, it produces vehicle movements for the steering wheel and throttle for what the next action should be. That is then outputted, and the real world environment / inputs to the model react likewise (video, speed, and direction inputs change). This creates a sort of feedback loop. Models have been shown able to successfully exploit this feedback loop.
Realtime use cases could include:
- Say you want to sing to a model and have it harmonize with you. Maybe of use for music producers.
- Live translation. Already being done by Qwen-3 Livetranslate.
- Playing games. If a model with visual input can produce keyboard commands extremely quickly, it could reliably play games that require fast reaction time. It could make jumps in video game worlds.
- Make responsive humanoid robots. Integration with full joint movement of the entire body in realtime is possible, but hard.
- Do high frequency trading.
- Conversational video AI (see below)
Video/world models
Sora 2 was a mindblower for me. The addition of audio completely changed the game. With this feature now available, the possibilities of future models are now endless.
What if Sora 2 took audio input, was realtime, had a long, small-changing context window?
- With this you could talk to AI people in realtime.
- They could be from the past, say a dead grandparent or historical figure!
- Or you could generate an AI partner to talk with over video. They could be a fake person, or a real one like a celebrity/kpop idol/politician. This could create serious attachment/social issues. Like why need kpop fansigns when you can talk to your kpop idol AI boyfriend/girlfriend anytime on a video call?
- They would be able to react visually to your voice through facial emotions. Or if you provided video input too, they could read your facial emotions.
- Your AI partner could play video games with you and talk in realtime. You could play Valorant with your AI girlfriend and they would talk to you over comms. Your AI girlfriend could be a porn source, generating whatever video you would like to see.
- 360 AI video generation. You could generate 360 videos for YouTube to then rewatch in VR (for non realtime use cases). For realtime cases, you would just need to generate the video of the viewpoint and then tell the model which point in space and direction the user it looking and have it react/generate the video accordingly (see next).
- Realtime maneuverable videos. Basically generating a FPS video game screen. You hook this up into a monitor or VR headset and you could generate worlds to explore in and look around anywhere.
- For both maneuverable and 360 videos, you could generate 3D versions of them too, outputting a video stream for each eye. 360 3D porn?! How much more real can you get?!
- Now take maneuverable videos a step further and imagine the model generating multiple viewpoints at once, given their position in space. You could collaborate with someone or watch the same thing from various points of view in a world. Say you’re directing a movie, you could work with someone else to build the scene, walk around it, and make changes to it like requesting the model through audio to edit the scene like nanobanna can do for images. Keeping long context might be difficult though. That’s why object generation (next section) will be key. Or you could play multiplayer video games together. Or you could do war simulation training with multiple soldiers wearing headsets to simulate working on the battlefield together.
Honestly makes The Matrix 100% possible. If you’re able to hook up a model to the human body that’s able to generate inputs to the entire nervous system (visual cortex (video), hearing (audio), touch, smell, taste), you could literally create a Matrix simulation.
Object generation
To solve the losing data problem of generation models due to their limited context window, storing data through generating files that have persistent data will be key. These can be of use to engineers designing infrastructure, movie makers, and game designers.
Say you do work in Fusion 360, coming up with a CAD model automatically and then being able to further edit it would be useful. This can already be done, but typically on a mesh only basis, CAD / constrained designs are in progress like from https://zoo.dev/
Or mesh-based characters or 3D designs of scenes if you’re a movie maker. Models that can create and edit these would be amazing. You can edit them if you want to change them. They can be used as context in your prompts for video generation. Same type of design flow happens in video game design, so it would be useful there too.
Take movie making to the next step and you can command AI actors in movie set scenes being rendered by the model to act a certain way. Basically you’re replaced the entire set with a virtual world. No longer do you need to hire actors. You can tell AI actors how to act a certain way and they will do it.
Human actors would lose their jobs. Basically replaced with AI equivalents that are the equivalent of modern day animated movie characters/IP. Companies would build IP with characters that are loved, even though they might not actually be real. These would out-compete movies with real actors in terms of cost by a significant margin.
Current possible models
- AI partner that can play a videogame that already has a AI built for it, and then they can talk to you over comms. For example, Dota 2 or Starcraft, or League of Legends.
- AI replications of those choose your own adventure girlfriend games. The model will just generate new video based on the previous output. Likely very successful because these are just 30 second cutscenes that need to be generated.
World/video gen model + training model feedback loop
World models, as they are being used eight now, are really good for test case generation for training data. For example, say a self driving car failed a situation in real life. You can recreate that situation in your world model to test that it fails and then train your self driving model to succeed. The two work in tandem. And then to make things even more useful, you can ask the world model to generate different test cases that are similar. For example, if the original use case was a cat running across the road, other cases could be the cat running across at a different angle or in a city or desert. How this could be extrapolated in the future to other training models than just self driving car ones, could be very interesting.
How likely are these models?
Very. Really, the only thing that’s holding them back right now is available hardware, both for training and inference. The data and model factors of the AI equation are already available (AI = hardware + model + data). We would need a pretty good hardware improvement to run these models in realtime, however, existing realtime deep learning models for video games and self driving already show that they’re possible. Hardware, hardware, hardware! Let’s invest in that! What a future AI hardware has ahead for itself!