062 - SDXL Experiments

For my stupid/funny experiment of the week, I decided to train an SDXL model to generate images using my face. I walked around my current city, took 22 total selfies, and fine-tuned the model for 7 minutes on a large A40. I think the entertainment was worth the $2.18 paid for it.

I could add more data or train it for longer to improve the results, but such systems are still just toys, with all of the limitations of toys.

One limiting factor is that humans only understand what they're doing with "prompt engineering" in an extremely limited and abstract sense. Humans aren't built for this task, and can never be as good at it as systems designed to use tools like this.

One of the exciting and interesting things the next year or two may see is when the systems that already proved vastly superior to humans at "prompt engineering" language models will be applied to the image domain. Our last research system used a relatively tiny old language model from early 2019 as a communication device, learning how to utilize it through both a dynamically adapted structure and emotional grading of the model's outputs.

The system was able to measure how well each sentence aligned with the intended meaning of the graph data that was fed into the model to transform that information and perspective into linear sequences of human language. This allowed the system to vastly outperform the best people have accomplished with systems like GPT-4 years earlier, on a model over 100 times smaller and 4 years older, with the research system starting out under 1 GB in size back in 2019, and growing dynamically to over 1.6 TB.

Short of wiring a human brain directly into such tool AI, a hazardous and rather unappealing option for most, a human cannot compete with even a non-scaling version of that. Factor in the scalability of such systems, and there is no competition.

These toys like SDXL are only the mildest preview of what is possible. Today you can spot the artifacts that flaws in the process and prompting produce, but tomorrow even the human brain's over-emphasis on visual processing won't be sufficient to tell the difference.