No, I can't imagine.............the video is rather silly and pointless.
Thats a bit rude, maybe the OP did not like the video either but was more interested in the technique for other projects.
As for how it is done, it is not very hard it is however very time consuming, you will need lots of clips.
I do not use Sony Vegas or Premier pro enough to give you instructions but in after effects it can be achieved in just a few steps and then repeated a lot.
It does help if you can get clips of people talking using similar letters and words that you would like to replace the audio with.
then once you get several clips make your Audio track my advice would be to have 2 tracks 1 of just acapella/ spoken words and 1 for the music.
then expand the view so you can see the Wave form copy and paste your video clips to the peaks of the Wave form and then do the same for the audio.
I warn you though it is not just a 10 min job I had to do a large Motion Typography for a commercial and all though it was only 30 seconds long it took forever to do it and text is easier than trying to do video.
if I was you I would go and make a few motion typography synced to audio first as there are plenty of tutorials out there. Then when you can do that confidently you can move onto video, which comes with a lot more syncing and changing speed/looping the available clips.
Look up these kind of Tutorials
Audio to Key frame
syncing text to audio
motion typography
and maybe you will find some actual tutorials to audio to video but I doubt there is many out there as it takes so long to do.
also look up bad lip reading as that gives you an idea that words being spoken can easily be misinterpreted into other similar looking words, it is not the sound they make it is the mouth movement that will sell the mashed up audio.
here is a tutorial to get you started