All of this started when I saw the “millennial speech” with Tom Bilyeu somewhere on Facebook watch in ~2017. I remember I shared it on our Communication Skills course Moodle group. I got hooked to finding my “why” statement. From that time onwards, it was absolutely essential for me to know why I was working on whatever I was working on at that time.
In 2021, Kushal (Engineering @ Dubverse) and I were building SimpleSounds.net. It was a text to speech platform for game developers. The problem statement was simple: games need hundreds of dialogues pre recorded and making revisions takes a lot of time and energy. Our solution was to build a simple editor where anyone could type their dialogue, select an emotion and export that as audio files. We were on a mission to replace VoiceOver artists.
First Interaction with Varshul
Later that year, I got a text by Varshul (the cofounder of Dubverse) on Shipping Friday, a founder community on Discord. We had frequent long phone calls and discussed about synthetic media overall. He was in the early days of building Dubverse.
I’m not very sure of how this came up, but on one of our calls, he told me about the Metahumans. Metahumans were basically ultra realistic 3D models anyone could build and use in their games. It was a pity I didn’t know about that given I was working with game devs at that time.
I spent some a couple of days exploring the Unreal Engine + Metahumans. To simplify it for you, here is what I found out
- You can customize Metahumans to look like you
- They had a Live Link app, which basically used the camera from your iPhone (the Face ID one) and you could use metahumans like you use the Memoji
- Unreal Engine provided the ability to create a simple environment where you could make a simple studio (just need to import a bunch of 3D objects), set up lighting, software defined cameras etc and start recording.
It didn’t take much time for me to realize how powerful it was. If used properly, it could be built into a product which could create content without you ever showing your real face or coming in front of a camera. You could use the AI voices I created to power them up. They had the perfect lip sync, very high definition, and the lights just worked perfect. You could add motion captures to it and bam! Your avatar can now move around and do stuff. Don’t believe me? Check out this video.
Video Dubbing as an Aha-Moment
On the next call with Varshul, I tried to get a sense of the dubbing market. It took me some time to realize how massive the opportunity really is: everyone is building for the next wave of people who come on to the internet. Obviously they don’t speak English or Hindi for that matter. They might be familiar with only the regional language. He was kind enough to let me in on a user call.
So one of the big questions is: how do they consume video content? Well, someone might make content exclusively for them, but that is going to take time. The creator would need the market to be big enough before they start serving the needs. What if you could dub the content? Audio dubbing is simple: you could take in any video, rewrite the script in another language, and re-do the VoiceOver. Movies have been doing this for years. Why would anyone build a startup around it? Or a better question: why would anyone need a startup around it?
The answer has two parts. First, manual dubbing is neither scalable nor on-demand. Second, what do you do if you don’t have a person to rewrite the script?
This is where Dubverse came in, it enabled people to consume English content in Assamese, even though none of the targeted audience spoke English. I saw it firsthand working with one of our users. They wanted to educate women in small town and villages on how to use the smartphone: camera, notes, mail, meet, WhatsApp etc.
This was something really revolutionary. But wait, we could make it better. What if we added lip sync? It would literally seem like the actor is speaking in the native language, without the need of re-recording the entire video. This is something that is extremely hard to pull off manually, unless the actors are multilingual. This is also something we are actively working on at Dubverse. Do check out our job board.
Do you see it coming together? Can you see how we are working on replacing cameras?
What if Google copies you?
One more thing: people often ask me what makes you more capable than the big shots in the tech industry to pull it off? The answer lies in our tech and our go-to-market strategy.
Before I explain how the tech works, here is a short primer on ML or Machine Learning. ML is basically building models that start dumb, and as they see more data, they become better.
Say you search “Giants” on Google, you’ll get back images of big people, right? Not quite. In San Francisco, there’s a team SF Giants and there’s a NY Giants in New York. If you search “giants” from those locations, you’ll get your team results first. Now think, did some engineer at Google hard-code the location in the search codebase? Something like “if search query is giants and location is New York, give NY Giants; else if location is San Francisco, give SF Giants; else give back images of giant people?”
Google uses ML to solve this. The process is simple, whenever you search for “giants” you see random results that match the query. Say you are in SF, you’ll scroll down to the point where you see SF giants (what you need) and click on it. Bam! A data point. More and more people from SF will do this and the search engine will automatically learn that for SF location, it should show the SF giants thing first.
Now, if we talk specifically about the transcripts editor on Dubverse, it works the same way. We translate the text to the best of our capability and let a reviewer (usually the owner of the content) tell us if it was correct or not. This gives us data points to improve the product. But there’s more! We know for each account what type of content they product: sports, educational, cooking, explainer etc. which allows us to create domain specific models which keep getting better with time.
But how is this better than Google Translate? Simple, you never give feedback (or data points) for google translate to improve (;
Now comes our GTM part. We are extremely focused on getting enterprise users who have a lot of videos they want translated. This should be obvious now as we are a data company.
So unless those big shots do exactly what we do, day-in day-out with the same obsession, it’s going to be extremely hard to beat us at our own game.
So, yes, I am proud to be a founding team member at Dubverse and to be on the front lines of making content accessible for the next billion people who come on the internet!
Lastly, I wanted to include some cool links to the cool stuff people are doing out in the world that is soon going to be a part of our reality.
- Keep a close eye on lucidrains, he writes usable code for big closed source projects. He wrote huge parts for Stable Diffusion by StabilityAI
- SpongeBob SquarePants singing the Ben 10 Theme song, all AI generated.
- AI Generated Music Video
- Make Obama speak in any language (sample only)
- Video generation using hands + Stable Diffusion
- Generate ~30 second videos from a single image