The music was created to fit the theme, but that video had to be edited to sync that well using something in AE or a plugin.
In other words you could take any song and use the same video, he has a technique for syncing the video to the beat regardless of the music used.
I did message the person but they is really popular right now, as you might imagine.
Thats a bit rude, maybe the OP did not like the video either but was more interested in the technique for other projects.
As for how it is done, it is not very hard it is however very time consuming, you will need lots of clips.
I do not use Sony Vegas or Premier pro enough to give you instructions but in after effects it can be achieved in just a few steps and then repeated a lot.
It does help if you can get clips of people talking using similar letters and words that you would like to replace the audio with.
then once you get several clips make your Audio track my advice would be to have 2 tracks 1 of just acapella/ spoken words and 1 for the music.
then expand the view so you can see the Wave form copy and paste your video clips to the peaks of the Wave form and then do the same for the audio.
I warn you though it is not just a 10 min job I had to do a large Motion Typography for a commercial and all though it was only 30 seconds long it took forever to do it and text is easier than trying to do video.
if I was you I would go and make a few motion typography synced to audio first as there are plenty of tutorials out there. Then when you can do that confidently you can move onto video, which comes with a lot more syncing and changing speed/looping the available clips.
Look up these kind of Tutorials
Audio to Key frame
syncing text to audio
and maybe you will find some actual tutorials to audio to video but I doubt there is many out there as it takes so long to do.
also look up bad lip reading as that gives you an idea that words being spoken can easily be misinterpreted into other similar looking words, it is not the sound they make it is the mouth movement that will sell the mashed up audio.
In years past, in the studios that I've worked in, they often used an Eventide Harmonizer (or other similar product, eg, the h8000) to do MIDI controlled time compression / expansion (without pitch shifting) of the audio tracks so that the cues / hits line up perfectly. If they didn't need the cues to be absolutely perfect, I've seen guys just route the MIDI from one of the control wheels on a synth to the MIDI input of the Eventide and line up the cues amazingly well just by ear.
Compared to splicing in or out segments or even loops (where this is musically acceptable), as long as the required degree of time shifting is not excessive, the changes in tempo with the Harmonizer method won't be noticed, and it's usually a much faster way to work. An intermediate approach is to splice in / out segments to get close, and then do the fine tempo adjustments with the harmonizer or with equivalent software.