January 18, 2024

134 - Vocal Clone

I ran another funny experiment yesterday, to put something I strongly suspected to the test. My AI Vocal Clone is indeed a significantly better singer than I am when trained on a half-hour of my terrible karaoke.

This is made possible by the dynamics of how such a system works, and not true of "AI" in general. In vocal replacement via cloning, there is a vocal pattern for one individual that is paired with the template of precisely what you want from that clone, such as to replace the singer in some piece of music. This effectively means that you can take a voice and superimpose musical talent onto that voice, even if the original person has none.

While this opens the door to all kinds of funny and stupid things I could do, like delivering on my suggestion that I take the voices of my co-founder and me to make a cover of "The Other Side", my main point today is to debunk a common mistake people make.

"Generative AI" systems fundamentally CANNOT do this, because when you're "generating" something there is no template for precisely what you desire. You can improve the quality of human content in systems where you replace something like one vocal pattern with another because even conversion errors are still grounded in that target pattern. In generative AI there is no target pattern, just next-token prediction and the data-blender of regurgitation, spewing out "plausible, but wrong" material.

RAG doesn't fix this, nor does CoT, MoE, or any other thing you may choose to apply to an LLM or image generator. Even when you ask such a system to do something like "summarize", it is still generating content without a pattern, just with stronger bounding than other methods.

The type of data in audio also helps, as it is mathematical, not semantic, meaning that one note symbolizes only that one note, and can't be semantically confused with another based on context or culture. There are different musical scales in different cultures, but that has more to do with how music is composed and perceived, than how it sounds.

To "generate" anything of reliably higher quality requires a working cognitive architecture and precisely one such system has been demonstrated to date. Absent a human-like motivational system, human-like memory, and the working cognitive architecture to process that information, nothing may be learned. If nothing is truly learned, it cannot be improved. If nothing is improved, it cannot become better than the data it was fed.

Now to drive this point home, my AI Vocal Clone has "...Gotta make you understand..."