A team at METR Research recently did a study of the effectiveness of AI in software development and got some very surprising results. At the risk of oversimplifying, they report that the developers they studied were significantly slower when using AI, and moreover, tended not to recognize that fact.
METR’s summary of the paper runs to multiple pages–the full paper is longer–so a couple of lines here can’t really do it justice, but the developers on average had expected AI to speed them up by about 24%, and estimated after doing the tasks that using AI had sped them up by about 20%. In fact, on average, it had taken them about 19% longer to solve problems using AI than it did using just their unadorned skills.
My statistics days are behind me, but the work looked to be of high quality, and the people at METR seem to know how do do a study. Surprising though the result may be, the numbers are the numbers.
Yikes! How Can That Be?
Superficially, METR’s results seem wildly at odds with my experience using AI for coding. I used it for a month, and felt that it vastly increased my output, not by a double digit percentage, but by a large factor. Four times? Five times? More? It’s impossible to know for sure, but a lot.
Yet, after reading the study more carefully, not only did it begin to seem like a dog-bites-man story, I came to feel that their results strongly validate my own experience and understanding of AI. If you take the message to be, “AI doesn’t make developers faster“, you are probably missing the point.
(Don’t take my one paragraph summary too seriously. METR’s summary can can be seen here. The full study can also be reached from within that page.)
Giving AI Coding a Fair Trial
I wrote elsewhere in this blog about recently doing a significant development project specifically in order to understand AI in software development. I used Cursor, but I doubt the specific tool matters much.
The Cursor IDE/Editor is simply awful, like something dug up out of the 1990’s and reanimated by a virus, but the AI’s capabilities were amazing. I can’t say how it compares to other tools, but I was blown away.
The subject was a fairly complex program to analyze the Twitter firehose to find new subjects in real time. Here is a simple block diagram for orientation. The project, when it became stable, was about 30,000 lines, mostly in Go, but also Python, Bash, HTML, JavaScript, and SQL.
It was a good subject for an assessment of AI coding because of diversity. It had lots of disk reading and writing, parsing JSON and CSV data, writing JSON and CSV, string processing, regex, etc. It also uses an SQL database, interacts with an LLM, employs a lot of concurrency and some distribution across machines, as well as Web services and a UI. The components built were:
- A main processor to turn the firehose of Tweets into a low-volume stream of real-time data on all the new things people are talking about.
- A Web service and browser to explore the news-ticker of new subjects
- A loader to put the output into a relational database
- A loader to take the data from the database subject by subject and send it to an Ollama AI for analysis, then store the results back in the database.
- A Web service and browser to explore the AI-processed view of the subjects
Skills
I can say without a shred of false modesty that I am not a Go expert. When I started this, my Go experience was limited to a few months on a project six years ago. I’m OK at SQL for someone who does not pretend to be a database person, and I have never used JavaScript at all. I have only a passing familiarity with HTML.
I do have a lot of general programming knowledge accumulated over 35 years, and a solid understanding of distributed computing, concurrency, etc, but I haven’t actually been a day-to-day developer in a while. For the last five years or so I’ve done enterprise architecture and big-data, but only incidental development.
What Was The Verdict?
30,000 lines of code would be a staggering amount of deployable code for one person to write in a month even if they were an expert in all of the languages used. Productivity varies a lot, but a typical human programmer produces more like a thousand lines a month. I’m still amazed at the scale of what I got done in a few weeks.
Is the code great? It’s a mix. At the small scale, the code Cursor generates is very good. I’m talking about at the level of functions–it is certainly better than I could have done with my rudimentary Go skills. Cursor also wrote a couple of Web servers from scratch with virtually no input from me other than defining what goes in and what comes out. You have to explicitly manage the writing of the computational components used in your Cursor-generated Web server, but Cursor will spit out the entire framework automatically, generate the browser pages, hook them together, etc.
On a larger scale, it made some hilariously stupid mistakes that I didn’t catch until it had become very hard to correct them.
The Giant Caveat
Cursor is a coding tool, not a robot developer. It does some things really well, for instance it can:
- Write almost any function you can coherently describe.
- Generate entire programs such as Web servers, so long as they are highly stereotyped designs. It will build such a framework, and with guidance, it will generate each of the functions you need to stick into the framework to do useful work. If the Web service is for a browser, it will generate the other half, too, and make them work together.
- Integrate diverse pieces and get them bootstrapped from a verbal description. To start this project I typed, “I need a program to read lines of CSV from disk, and write them to Rabbit MQ. And I need another program to read the lines of CSV from Rabbit MQ and print them on the console.” and I was off to the races.
It’s weird to live in a world where you have to expressly say that a machine doesn’t “understand” but of course it doesn’t–it’s a machine, duh. If it understood things it wouldn’t be just a machine anymore. Cursor can’t do anything that requires understanding. A Web server supporting a browser might be complicated, but it is highly stereotyped code. The AI has looked at an uncountable number of such programs and can generate one formulaically.
The designers of programs like Cursor and GPT go to great lengths to have their programs give the appearance of being an intelligence, but it’s showmanship. There’s nobody in there.
To build a program that is not a variation on a standard formula requires an understanding of the program’s function, and that is not something AI does. Cursor will gamely try to build whatever you describe, but it does incredibly stupid things every time if what you ask doesn’t fit into a well known template.
If I had to single out the one core skill in using AI to code effectively, it is the ability to distinguish between tasks that require understanding and tasks that do not.
Structuring the programming work that requires understanding is 100% up to you. AI can create the Lego blocks, but it’s your design.
As a rule of thumb, if something non-standard in your design is complicated enough that it would be easier to explain with a whiteboard, you might as well ask a poodle to do it.
Interestingly, Cursor seems to be incapable of recognizing its own limitation. It will gamely try to do absolutely anything you ask it to even if the task is manifestly beyond its capability. You’d think an AI could be trained to recognize when it’s in over its head.
But blaming Cursor for those problems would be like blaming your car for speeding tickets. I made this mistake repeatedly until I learned this, and the resulting problems were completely my fault for not understanding the nature of AI programming.
So Really, How Much Faster Was It?
It’s hard to say how much faster it made me, but it was a big multiple. I feel like I wrote the core Go program four or five times faster than I could have written it unaided. That is the difference between walking speed and riding a bike as fast as you can.
There are so many imponderables that it’s impossible to give a better estimate. For one thing, I was learning how to use it on the fly, alone. For another, I was relying on it for almost all of the coding, which is probably unusual; I wrote virtually nothing by hand. My hands-on participation was mostly analysis of the code to tell Cursor where it was going wrong. For a third thing, I probably understood what I was building better than is typical because I wrote essentially the same program a dozen years ago.
Another way to look at it is that the speedup was infinite, because I would never have put the effort into becoming sufficiently expert in Go to do this project. It’s not just a matter of the languages. Nothing is hard once you know how, but I wouldn’t have been able to extemporaneously type out a Web server from scratch because, honestly, I didn’t really understand interactions between the browser and the services well enough. I still don’t, really. It would have been a minor research project because I’ve never been a front-end guy. Yet there are two cool looking browser based viewers running on my machine right now.
Bottom line, a practical matter, I couldn’t have done the project on my own in a reasonable amount of time, so I’d say the speedup was somewhere between 4x and infinity.
METR’s Results v My Experience
This bring us to why METR’s results and my experience sound so different.
The study subjects in the METR study were all senior programmers with extensive track records as committers on well-known large open source projects.
If you don’t already know, being a committer on a well known opensource project is a major flex for a developer. Those projects are where the big kids go to impress each other. It’s about bragging rights, but it’s more; it commands respect that is worth major bucks when it comes to salary. To programmers, “open source committer” is what a row of medals on your chest is to an army officer. I have no idea who runs METR, but I’d bet cash money any time that there are a gaggle of opensource committers on staff.
And this is the critical point. The subjects were experts in the code they were working on, solving non-trivial coding problems in the context of million-line projects. Problems worth a couple of hours of effort by top-dog programmers on their own projects don’t have canned solutions. The programmer generally has to understand the program and the problem deeply.
Such problems are almost the opposite of what AI does best. Indeed, in my experience, of anything that it does at all.
As I said, the work that AI makes easy is the dumb 85%. Those thorny problems the test subjects were working on were part of the other 85%.
These developers were working on projects that already have a million lines. AI might suck at the class of problems the test cases were drawn from, but I’d be very confident in saying that starting from a blank page on day one, AI could have gotten those million-line opensource programs off the ground faster than any unassisted human could. AI is fantastically effective at generating all those thousands of lines of bread-and-butter code that can be described succinctly and implemented as function calls. Richard Stallman himself could not keep up with Cursor on that kind of coding because even Richard Stallman can’t type out a description of what he wants significantly faster than a mere mortal can.
Frankly, I’m surprised that AI only slowed them down by 20% because they were doing the part of the work that that requires half a brain.
Conclusion
AI is just a tool, at least so far. And awesome tool, but a tool. Coding with and without AI is similar to the difference between coding in assembler using text editor and coding in Java or Python using a modern IDE like JetBrains.
But like other fancy modern coding tools, it attacks only what Fred Brooks called the accidental or incidental complexity of programming. Tools are extremely effective at incidental complexity. But they add relatively little to the problems of the inherent complexity of programs–the difficulty of understanding and expressing the complexity that is rooted in the problem itself, and not in the code. This is the part where human understanding is as necessary as ever.
And this is why it’s a dog-bites-man story and not a man-bites-dog story. Of course it didn’t help.
* The kind of people who read articles about AI and programming will be annoyed at me calling AI an acronym–it’s an initialism.
