
Inflated titles in tech
I've had a love-hate relationship with AI for a while. Having a bit of a background in computational statistics, the machine-learning part of AI has always been the uneasy periphery between computational statistics and AI, between data engineering and the oddly named discipline of data-science. I used to say that data-science is basically data engineering with a disregard for the core principles of both engineering and of basic statistics; overfitting as an art form, rather than anything resembling a "science". But I guess it's just part of the whole American inflationary titles thing. People without even a bachelor's degree in science or engineering are referring to themselves as software engineers because they went to a JavaScript and React bootcamp, Teachers without either a PhD or tenure are referring to themselves as Professor, and now data engineers that neither have a PhD nor are doing any actual science are branded as data scientists for some reason, and the only difference with us lowly data engineers seems to be that they made overfitting into something to embrace rather than to avoid. As such the boundary, while part of my previous day job (I moved from data/software engineering to a more research oriented role), has been a bit uneasy. Machine learning was part of my work, and I've used it, but classical stats always felt like the cleaner way to do things. Like the hammer you use to hit ten thousand screws into wood when you only have a hammer and a manual screwdriver. Through machine learning, and just to keep up with everything data engineering, I always kept up with developments in AI, reading papers, running some tests for educational purposes, etc, but I wouldn't call myself an AI expert even if many with a much lower relevant skill set seem to call themselves AI experts today, again title inflation, and I'm not participating in any of that. But seriously, how does talking to an LLM from code and hacking together something on top of some API make you an expert at AI when you don't even understand the basics of how these models actually work, or even that they work by the grace of relentless overfitting? If you don't understand even the basics of classical statistics, no matter if its only college level frequentist basic school stats, you can not possibly be anything close to an AI expert. But enough on inflated titles, on to the core of my love-hate relationship with AI.
Art and fiction
Next to data and software engineering, I have been writing fiction for a while. Something over a decade right now. In the prologue of my novel Ragnarok Conspiracy, that takes place in 2027 but that I wrote in 2017, I extrapolated from my insights in AI and painted a picture of a world with AI that is pretty close to the world we live in right now. In the prologue, a young hacker and scammer from the Philippines is using fake AI generated single Philippine women to scam horny westerners out of some of their crypto. Have a look at the chapter if you doubt my early insights into where things were going with AI.
I always understood that AI and copyright for fiction and art could become a problem. As a long term contributor to open source (close to 30 years now), I considered open source safe because of the legally strong copyright statements being used and their status, but fiction, especially when hosted on digital book platforms, usually lacks the same level of legal precision that software does. Next to my fear of AI making use of the legal loopholes that fiction copyright ambiguity leaves, I've always had a bit of beef with the ownership model of DRM as used in ebooks. It is ownership without ownership. Not borrowing to a friend, nor giving your book away, even switching platforms is like buying a new book cabinet and needing to buy every single book again you ever bought or keeping the old cabinet around just so you don't lose the books. We need paper-like ownership, and if the platform doesn't give it, the copyright should.
In order to address both, in 2023 when AI started to get a bit out of hand already I made a bit of a framework for open source and partially open source fiction, The Open World licence framework. Read it if you care about AI usage of fiction and about ownership models of digital work.
Provenance
As I said, my day job used to live on the edge between data engineering and software engineering, mostly in infosec, crypto and forensics related applications. Doing software engineering I was ignorant enough to think it was absolutely OK to use AI (LLMs in particular) as coding assistants. My trust in the strength and unambiguity of established open source licences and attribution requirements made me just believe that no AI cloud provider would be reckless enough to risk infringing on the copyrights of open source software. And boy, did I turn out to be wrong. I used code assistance for way too long and I should have known better. Using it there were already the problems that many users run into and try to use ways around, that is the problem of provenance with using AI code assistance tools. First of all provenance with how these tools "don't" allow for convenient provenance that integrates with the industry standard provenance tool of the entire software development world right now: git.
There is much more to it, but traditionally git blame has been the backbone of handling provenance in code. Who is to "blame" (for lack of a better word) for every single line of code in a codebase, how is that line related to other related lines, possibly in other files (git commit), and what was the reason for the collection of changes anyway (the commit message). But the problem, the programmer and the AI assistant helping them are two distinct entities and should be treated like that in terms of provenance.
But the provenance problem doesn't start there. It starts at the AI company whose API our code assistance tools are talking to. Having looked at machine learning in computer forensics, I know enough both about the importance of provenance in machine learning, why it takes more resources than ignoring it, especially for basic machine learning, but also why the relative overhead of adding provenance to the models drops as the models get bigger and the cost of compute starts to dominate. Don't get me wrong, the overhead is still large in terms of storage and lookups, just that with growing models the part of the cost from compute needs starts to greatly dominate the overall cost. Hence there is not nearly as much benefit for the big AI companies to not implement provenance from the start as we are often made to believe.
Such provenance would weed out the open wound that currently exists with code regurgitation outside of the most data intensive parts of the training set. When building a basically standard CRUD web app with Python and javaScript, Django and React, chances of getting a licence erased pirated piece of open source code handed to you by one of the code assistance tools is pretty slim. Things change a lot though when you prompt the assistant with niche prompts as I did a few months back and I got close to my own BSD licenced code back, without copyrights and without attribution. Provenance could fix this. Currently none of the big AI companies offering code-assistance models are doing this, and neither would any of the code assistance environments if they started to.
So in short, provenance for code with AI is lacking. The AI companies are making us steal IP by not including provenance, and the code assistance tools at our disposal don't integrate with git in a way as to distinguish between a developer messing up with his own code, or a developer failing to properly vet the proposals that AI assistance is making.
LLMs for code reviews
While code assistance tools have the double provenance tools with generating code, and they require subscriptions costing hundreds of euros a month for full-time developers to do anything useful, there is nothing much wrong with using LLMs to do a quick code review of a piece of code. Just use the standard text chat interface and post the code in a ``` code block and ask for a code review, quick, free, and often useful.
I currently use the following three:
- DeepSeek: Quick shallow list of pointers to look at, best one to start with to do some quick fixes before diving in deep
- Gemini : Decent deep dive into what it finds, what it finds is usually accurate in need of attention, suggestions not too useful though, but you have a mind.
- Grok : Finds more problems than Gemini, but many of them don't actually exist. Will confidently name perfectly valid code "completely broken" because it fails to understand it. Great addition, but don't trust it.
All and all LLMs for code quality through basic code reviews are useful. Absolutely no full replacement for human eyes, but still pretty much a net productivity boost.
Vibespiration
Next to basic code reviews, there is another use case there LLMs excel in helping with: inspiration. Just talk to the LLM, no subscriptions needed, just basic chat, talk to it about your current challenges and ideas, just give it your brain dumps. Often it will spit out useless slop answers, but maybe once in every three or four times it will actually come up with something inspirational to actually solve the problem. Press it, it's possibly in the zone, engage with it in a mutual interactive game of inspiration. I believe strongly that right now until AI companies address the provenance problem with training and AI code assistance tools address the provenance problems with git workflow integration, this vibespiration is the sweet spot for using LLMS for code in 2025. No integration and for sure no vibecoding, but inspiration. Vibespiration so to speak.
Can we trust New Tortuga with our prompts?
Cloud AI is pretty much a black box. We already know the big AI companies aren't doing any provenance that can allow us to trust there is no copyright infringement on code the LLM spits out at us. We know licence erasure of open source code is real and not uncommon. We know from recent lawsuits that big AI companies are training their models on art and fiction based on ideas about the American legal concept of fair use that don't stand up in court, not even in the US. So when AI companies are not being ethical with IP when it comes to open source, art or fiction, do we have reason to believe they are going to handle our prompts in an ethical way? When we are building with code assistance tools or sharing our code for code reviews, are we setting ourselves up for IP theft of our own work in progress? We basically know that right now the US AI industry is making China look like a mere pickpocketing street urchin when it comes to IP theft, the US AI industry is the new Tortuga of IP buccaneering. So should we trust these pirates with our own IP just because we are paying a subscription? Or even if for some naive trusting reason if we would, should we trust them with it if we are non-paying users of the free char oriented version of their product?
I think very few people who actually have thought this through will think they deserve any such trust. So even when going for vibespiration and code reviews, consider LLMs to be after your IP. Maybe not directly, but consider that your IP is likely going to end up as part of its training data; training data that might very well get regurgitated if in the future a competitor ends up feeding it the right prompt. Be very careful of your IP when talking to LLMs, no matter if they are American or Chinese. Our IP isn't safe with them, period. And this I think is a fact that has an impact that most of us underestimate.
Ethical SLMs
While Large Language Models are pretty advanced when they aren't hallucinating or making paying users accessory to IP theft by licence erasing the shit out of open source code, When you just need some basic refactoring or some suitable skeleton template code to start from, or even some things slightly more advanced, Small Language Models that you run on your own hardware are actually quite a reasonable option. You can run some of the smaller versions of less generic models on quite modest hardware like the GPU of a higher end laptop or something like the Jetson Orin Nano Super that you can get for under €300. Spend ten times that on a workstation setup with a high-end GPU, and you can comfortably run the bigger version of one or even multiple of these Small Language Models on your own hardware at a slightly increased energy bill that is still many times lower than an LLM subscription. Sure, the one time investment sets you back a number of months up to a year or so compared to the subscription model, but the lifespan of a workstation with a decent GPU is quite a bit more than a year.
No running out of tokens on a $200 subscription that you thought would last you the entire month but hits the ceiling after one and a half weeks, no fears of your own IP ending up in training set for the next version of the models, and if you pick an ethical SLM like the ones from BigCode, with a much reduced risk of problematic regurgitation. Ethical models at least limit their training set to permissively licenced open source software. This doesn't mean you are immediately off the hook, but as the training set is actually public, you can clone it, and at night when your hardware isn't doing anything more urgent, you can run similarity tests of the new commits against the training body. It takes a bit of work, and a bit of storage, but it is absolutely doable.
A local SLM won't give you the full power of a cloud based LLM, especially not when you go for budget hardware with a 3B model, but it won't give you any of the productivity killing IP headaches either. You can achieve imperfect duct tape provenance with a delay on the model side.
Hooksy SLM agents
Now for the problem with provenance with code assistance tools. Ask yourself, do you really want to use the kind of integration of AI into your IDE that messes up your provenance and merges you with your AI assistent? I think that if you have a mindset anywhere close to professional, then you don't. Imagine there is another way. A gitsy way so to speak. Consider this workflow:
- Create a topic branch.
- Add special prompting comments to code files
- Commit to the topic branch
- Push the topic branch to remote
- Wait for a DM
- Pull the topic branch and evaluate the outcome from the prompts, treat it like you would treat a merge request from a co-worker
- Either fix and merge, or roll back and try again or do it yourself.
The result? You and the SLM aren't recorded in the git provenance as being the same entity.
So how do we achieve this? It is simpler than you may think. Many simple git setups allow you to add hooks that get triggered then you do a push. You can run such a simple git server on your Orin Nano, or if you got that 3000 euro workstation, you can run Gitlab and set up full scale CICD with docker and all the professional stuff, but no matter if you choose cheap and minimal or expensive and professional, the end result is that you are basically creating a simple GIT hook driven AI agent that acts like an actual co-worker connected to git. You do a commit with special comments, they convert these comments into prompts, update the code and basically provide you with a merge request for their contribution that you can do with as you please with all provenance taken care of.
Pensions not jobs to fear for
If you spend some time on social media there is a lot of buzz about AI taking jobs. There are reports of mass layoffs because companies are replacing workers with AI. Programming is often being named as the place where things will go the fastest, and memes all over the place are suggesting software engineers move to jobs like welder and electrician.
The problem: None of this seems anywhere close to the truth. In fact, the opposite might be true. Looking deeper into the mass layoffs, it turns out very few (if any) of the people being made redundant are actually getting replaced with AI, and for the ones that are, the AI in question isn't Artificial Intelligence, AI instead stands for Another Indian in their case. Most downsizing seems to be just good old mass layoffs, only this time, to please the stake holders it is marketed as AI replacement to make companies look modern and with the time when in fact they are struggling and down-sizing.
Meanwhile productivity in software engineering is increasing in AI, but not with hundreds of percentage points like vibe coding advocates and AI company acolytes like to preach, no, even tens of percentage points are mostly anecdotal. In code, productivity gains then to be single digit or barely double digit percentage points. I believe amongst others because of the provenance issues both in the training data and in the git workflow.
In the meantime China is taking the lead in industrial AI showing the real consistent efficiency and productivity boosts deep into the double digit percentage points, while at the same time robotics is leaving the factories. So if anything, it's jobs like welder and electrician that are the most at risk, not because of LLMs but because of robotics and industrial AI advances, mainly from China.
If we add to this the fact that NVIDIA currently has a market cap of 3.8 trillion Euro's, leading the top 10 of companies with the highest market cap in the world, a top ten that includes Microsoft (big OpenAI stakeholder), Alphabet (Google/Gemini), Amazon (big anthropic investor). and Meta (Llama). Five US cloud LLM-investment-heavy companies with a total market cap of 13.7 trillion euro, for contrast, the total M2 money supply of the USD, the world's reserve currency is 19.1 trillion euro, if the bubble grows any bigger. there will be more money in American LLM dependent stock than there exists in actual USD.
Combine this fact with the fact that we established the massive problems with IP and lack of provenance that the big players seem to not take any effort into fixing, then it becomes clear that the LLM part of AI might be a huge bubble with at least five of the 10 most valuable companies heavily exposed to a bubble burst.
The problem is, these companies are actually massive with many European pension funds, as well in US 401k-s. So while for LLM adjacent work like software engineering AI isn't going to take away people's jobs, for real, our pensions aren't actually so safe. The LLM part of the AI industry currently is new Tortuga and once the world realizes that, what has already started I think with the most recent court rulings against OpenAI in favor of authors, once customers and investors wake up to that reality, large chunks of our pensions might evaporate.
And it isn't just an ethics and IP thing. I started off explaining that for a large part AI is based on overfitting. LLMs are currently the greatest and biggest overfitting machines you can imagine. Investors go all in on the promise of AGI, Artificial General Intelligence. Or even the dream of ASI, Artificial Super Intelligence. When I was learning about AI, decades ago, the dream of AGI was clear, reasoning from first principles, the ability to apply knowledge from a field it is trained on onto another unlearned field and not be a complete idiot. Those ideas have all been replaced by benchmarks. The industry keeps moving the goal post from concepts to benchmarks and then sets their LLMs loose on the benchmarks.
And what do LLMs do when confronted with benchmarks? What they do with everything: The overfit to it. LLM companies thus are creating a Hollywood version of AGI. We get Benedict Cumberland while we are told we are getting Allen Turing, and the script, the benchmarks, give us quite a convincing Allan Turing too, but it's not Allan Turing, it's not AGI, it's just a convincing overfitting against the benchmarks.
Investors will not remain ignorant of these facts forever.
Inspiration not integration
I'm not anti-AI, I'm not pro-AI. I don't believe AGI is anywhere close or that AI is really going to take the jobs of many people for a long time. And when it will, it will start in the factories, and then as robotics moves out of the factories, the traitsmen welders, electricians, etc. Eventually maybe programmers too, but that, like AGI, should be decades away. In the meantime LLMs are useful as a tool, but they are dangerous in many ways. Provenance is key and until provenance gets addressed, both in training and in work flows for things like coding, cloud based LLMs at least will be and will remain to be the buccaneers of new Tortuga. Use these LLMS for Inspiration, not for integration, and be careful with your IP then you use them for things like code reviews. And if you are a developer and need more, have a look at local SLMs and at hooksy agents that integrate into a high provenance git workflow.