Nick Winder

Nick Winder

Software & AI Developer

← blog

We're Entering the Age of Software Distillation, and Nobody Knows If It's Legal

AI models are distilling software knowledge in ways that blur copyright lines. What does it mean when an LLM writes its own library from its training data?


Something clicked for me recently while I've been playing around with AI-assisted development. We're not just in the age of AI coding assistants. We're in the age of software distillation, and I don't think we've really reckoned with what that means yet.

Here's the thing that got me thinking: I recall reading — and I'll be honest I can't pin down the exact source — that OpenAI ran into trouble getting their model to effectively utilise some open-source multi-threading libraries. Rather than spending more time wrangling compatibility, they apparently just asked the model to write its own version. And it did. A working library, shaped by everything it had absorbed during training, conjured from scratch.

That's wild when you sit with it for a second. The model wasn't connecting to the internet. It wasn't reading the source code in real-time. It was synthesising a new implementation from its own internal representation of how multi-threading libraries work — knowledge that was effectively distilled from all the open-source code, documentation, blog posts, and Stack Overflow answers it had ingested. The library didn't exist before, but it's clearly a product of all those libraries that do.

What Software Distillation Actually Is

I'm using the term "distillation" deliberately here, and not just in the casual "AI knows stuff" sense. In the machine learning world, knowledge distillation has a specific meaning: you take a large, capable model (the teacher) and train a smaller model (the student) to reproduce its behaviour. The student doesn't just memorise the teacher's answers — it internalises the underlying patterns well enough to generalise to new situations.

What's happening with software is essentially the same thing, just with an extra step. The original library authors write code. That code gets published. It gets discussed, forked, blogged about, and documented. An LLM trains on all of it. Now the model has distilled the essence of that software — its design patterns, its API conventions, the problems it was solving — into its weights. When you ask it to write a similar library, you get something that reflects all of that accumulated knowledge, even if it doesn't share a single line of code with the original.

This is genuinely different from copying. It's closer to how a developer absorbs patterns from years of reading other people's code and then writes their own implementations. Except the model has "read" vastly more code than any human ever could.

The Proprietary Problem

The interesting (and legally murky) extension of this is what happens when you move beyond open-source software.

Open-source code at least operates under known licenses. MIT, Apache, GPL — there are rules about what you can do with it, and those rules mostly apply to the code itself. But what happens when an LLM trains on proprietary software? You can't distribute someone's closed-source code without a contract, but can you train a model on the output of that software? There's no copying of source code there — just feeding in results and letting the model learn what the software knows how to do. And once that knowledge is inside the model's weights, what then?

I'm not a lawyer and I won't pretend to know where the lines are, but intuitively it feels like there's at least a question worth asking here. The model doesn't copy the code — it absorbs patterns and generates something new. The output might not share a single line with the original, yet could still clearly reflect the design decisions of proprietary software that someone put a lot of work into protecting. Whether that constitutes infringement, I genuinely have no idea.

What does seem clear is that our existing frameworks weren't really written with this scenario in mind. Copyright law was built around copying — you could point at the copied thing. When the "copying," if it even is that, happens inside billions of parameters during a training run, the usual ways of thinking about it start to feel a bit wobbly.

The AI Copyright Wars Are Already Happening

Here's where it gets almost satirical: AI companies have been loudly vocal about protecting their own models from distillation, even while training those same models on scraped internet data and distilling into the model weights.

OpenAI's terms of service explicitly prohibit using their model outputs to train competing models. When DeepSeek released their R1 model at the start of 2025, there were immediate accusations that it had been distilled from OpenAI's models — that the training process had essentially reverse-engineered OpenAI's capabilities by learning from its outputs rather than from scratch. OpenAI, to put it diplomatically, was not pleased.

The fascinating irony is that the mechanism being accused of — learning the essence of something from its outputs rather than having direct access to its internals — is structurally similar to what happened during the training of large language models in the first place. Open-source software was absorbed. Books were read. Code repositories were processed. All of it shaped what these models know and how they think.

So we have AI companies arguing simultaneously that:

  1. Training on publicly available data (including copyrighted works) is transformative use and therefore acceptable
  2. Training on their model outputs is theft

Both of those positions might even be legally defensible. But holding both at the same time requires a fairly creative reading of "fair."

What This Actually Means for Software

The practical implications of software distillation are interesting regardless of how the legal questions resolve.

On one hand, it represents something genuinely useful. If a model can distil the purpose of a library — the problem it solves, the API design philosophy, the edge cases it handles — and then generate fresh implementations, that's a massive productivity unlock. You don't need perfect compatibility with every existing library. You need something that works well for your specific use case, informed by the best thinking in the ecosystem.

The scale of what's now possible is getting hard to ignore. Anthropic recently published a write-up on building a C compiler using sixteen parallel Claude agents. Not a toy compiler. A 100,000-line Rust implementation that compiles the Linux kernel across x86, ARM, and RISC-V, and can build real-world projects like QEMU, FFmpeg, PostgreSQL, and Redis. It even runs Doom. The project took two weeks and cost just under $20,000 in API calls, with the agents working autonomously after a human set up the test harnesses and evaluation systems. That's software distillation working at full tilt: decades of compiler theory, language specifications, and prior implementations all absorbed into the model's weights and then expressed as something new and functional. GCC wasn't forked. The C standard wasn't lifted verbatim. But everything those systems represented fed into what Claude produced.

On the other hand, it creates a feedback loop that's hard to reason about. If models distil knowledge from existing software and then that output becomes part of the next training corpus, you could end up with successively refined implementations that drift further from their origins while still carrying the "fingerprints" of the original design decisions. Ideas propagate, evolve, and sometimes end up in places their originators never imagined or consented to.

For open-source maintainers, this is a bit existential. The entire point of open source is that you share your work, others build on it, and the ecosystem improves collectively. Software distillation kind of does that, but in a way that's less visible and doesn't obviously follow the attribution norms that open-source culture has built up. Nobody's issuing a pull request when a model internalises your design patterns.

Nobody's Really Got an Answer Yet

I genuinely don't know how this resolves. Copyright law has struggled to keep pace with software in general — the debates around API copyrightability took a decade to even get a Supreme Court ruling, and that ruling (in the Oracle v. Google case) was pretty fact-specific and hasn't exactly settled the broader questions.

Software distillation adds another layer of ambiguity. We're asking courts and regulators to reason about whether knowledge absorbed during training and expressed as a novel implementation infringes on the original. That requires a pretty sophisticated understanding of how these models actually work, and legal institutions are still getting up to speed on the basics.

My guess — and it's only a guess — is that we'll see some test cases emerge from particularly clear-cut scenarios, probably involving proprietary software where a model's output is unusually close to the original. That'll establish some precedent. The general case of "model trained on open-source code generates an implementation that reflects common patterns" is probably going to stay murky for a long time.

In the meantime, we're all just building with these tools while the legal frameworks try to catch up. Which, honestly, is pretty much how software has always worked.


The thing that keeps nagging at me is that we've been here before with a different technology. When search engines started scraping the entire web, there was a decade of legal wrangling about what they could cache, index, and display. We got through it with a messy patchwork of case law, terms of service, and robots.txt files that nobody fully respects anyway.

Software distillation is probably going to go the same way — not cleanly resolved, just eventually normalised with some guardrails around the edges. Whether that's good or bad probably depends on which side of the distillation you're on.