Uncategorized

Is This Bonkers!?

Galen Hunt of Microsoft is assembling a team right now to rewrite their entire product suite, which is currently written in C and C++, by 2030. The stated goal is to code at the rate of a million lines per month per engineer.

Don’t laugh—the goal may be extremely ambitious, but it isn’t obviously insane. There are quantifiable reasons why Mr. Hunt might not be bonkers.

The reasons take some explaining.

Continue reading →

Uncategorized

Unabashed Self-Promotion!

This post has nothing at all to do with technology–it’s shameless self promotion.

I’m not just writing about data and tech anymore. I’m now officially a novelist! My debut novel, a psychological thriller, “And Never Memory of You”, is now available on Amazon/KDP/Kindle as an ebook, in paperback, or for free on Unlimited.

“When fate hands Lucy Bentley the chance to deal with a lowlife that the law won’t do anything about, the small town waitress seizes the moment. Who knew cleaning up after closing time could be so satisfying?“

It’s a story of justice, love, betrayal, and revenge.

The Amazon/KDP page is: https://www.amazon.com/dp/B0GNT5JDNL

If you get curious and read it, a review on Amazon and/or Goodreads would be tremendously helpful. Reviews are easy to do, but there are some minor gotchas, like never having two people on the same account review it. (Both will get discarded.) Details can be found on the book website: https://petercoatesauthor.com/how-to-leave-a-review/

algorithms, data science, not-hadoop, twitter, Uncategorized

AI Needs a Better Acronym*

A team at METR Research recently did a study of the effectiveness of AI in software development and got some very surprising results. At the risk of oversimplifying, they report that the developers they studied were significantly slower when using AI, and moreover, tended not to recognize that fact.

METR’s summary of the paper runs to multiple pages–the full paper is longer–so a couple of lines here can’t really do it justice, but the developers on average had expected AI to speed them up by about 24%, and estimated after doing the tasks that using AI had sped them up by about 20%. In fact, on average, it had taken them about 19% longer to solve problems using AI than it did using just their unadorned skills.

My statistics days are behind me, but the work looked to be of high quality, and the people at METR seem to know how do do a study. Surprising though the result may be, the numbers are the numbers.

Yikes! How Can That Be?

Superficially, METR’s results seem wildly at odds with my experience using AI for coding. I used it for a month, and felt that it vastly increased my output, not by a double digit percentage, but by a large factor. Four times? Five times? More? It’s impossible to know for sure, but a lot.

Yet, after reading the study more carefully, not only did it begin to seem like a dog-bites-man story, I came to feel that their results strongly validate my own experience and understanding of AI. If you take the message to be, “AI doesn’t make developers faster“, you are probably missing the point.

(Don’t take my one paragraph summary too seriously. METR’s summary can can be seen here. The full study can also be reached from within that page.)

Continue reading →

Uncategorized

AI Programming Part 2: The Particulars

Part one of of this article is about AI programming at a high level and where it might take us. This second part is about the practical lessons I learned in using it. Read this piece, take all of my advice as gospel, and by the end of the article, you’ll be an expert too.¹

If you didn’t catch part one, I used AI to rewrite in Go a program that I wrote by hand in Java more than ten years ago. The project comprises about 15,000 lines of Go, plus another 5,000 lines of Python, JavaScript, and shell scripts. It took two weeks and change to write using AI, a small fraction of the time it took to hand-write the original.

The program does a simple thing very fast: it reads the X (formerly Twitter) firehose and extracts the newly emerging subjects. It does this on up to about 8,000 Tweets/second when reading from RabbitMQ, and at about 50,000 Tweets/second reading directly from disk files.

This program makes a good test project for using AI because of its diversity. It’s a high-performance program with lots of concurrency, lots of disk reads and writes, text parsing, tokenizing, some intense algorithmic processing, some esoteric graphing libraries, and a simple Web front end.

Continue reading →

Uncategorized

AI Programming, Part 1

This is the first of two essays about programming with AI. Part two can be found here.

I recently used Cursor to rewrite in Golang a substantial program that I first wrote in Java more than a decade ago. It took two weeks to write it using AI, and since then I’ve messed with it casually, adding some features and making minor changes.

The effort went very well. Cursor informs me that the new version comprises 15,460 lines of Go, 992 lines of Python, 1584 lines of shell scripts, and 3842 lines of HTML and JavaScript. That’s 20,557 lines written in a long couple of weeks, plus documentation and a manual.

That’s a lot of code in very short time, and the throughput of the resulting program is an order of magnitude greater than the original.

Given that level of productivity, it’s easy to see why undergraduates are voting with their feet, running from Computer Science programs like they are plague wards. Unsurprisingly, the old Unix geezers snort at the idea that AI could ever program anything meaningful, goddammit. But so what–those guys snort at everything.

I think both opinions are wrong. The truth is more complex and more hopeful.

Continue reading →

Uncategorized

A Pilgrim’s Progress #4: Panda Series

This is the fourth in a series of posts charting the progress of a programmer starting out in data science. The first post is A Pilgrim’s Progress #1: Starting Data Science. The previous post is A Pilgrim’s Progress #3: NumPy

I’m trying something new out here. These posts are coded in Jupyter which is an extremely handy way to intermingle text and executable code. It comes with Anaconda, which is the best way to get everything going if you’re starting out. For the first couple I cut-and-pasted the material over to WordPress. This time I downloaded the Jupyter file as HTML and pasted it in. Far from perfect but 100x faster. It’s painful to edit once pasted in, so it’s far from a perfect solution. Any ideas?

Pandas are insanely versatile and capable of far more than I’ve covered in this already excessively long set of notes. At best this is a way to get an idea of how they work and a quick tour of what they look like in use.

Continue reading →

Uncategorized

A Pilgrim’s Progress #3: NumPy

This is the third in a series of posts charting the progress of a programmer starting out in data science. The first post is A Pilgrim’s Progress #1: Starting Data Science. The previous post is A Pilgrim’s Progress #2: The Data Science Tool Kit.

What Is NumPy?

NumPy is a library of high-performance arrays for Python. After this I’m going to mostly call it numpy because that’s the name of the package you import. Whatever we call it, numpy supports creating and manipulating arrays of any number of dimensions and the ability to easily reshape them and slice them in complex ways on the fly.

The elements of any numpy array can be accessed in a variety of ways. You can access single elements, of course, but there is a powerful syntax for accessing all sorts of rectilinear slices in one or more dimensions. We’ll look at some of that below.

As the name implies, numpy is designed to support mathematical computing, and is thus packed with convenient features for operating on data as an array or matrix.

Every programmer is used to iterating over the elements of an array using a loop or an iterator, which is a concept that is easily extended to using nested loops to iterate over multi-dimensional structures. Numpy takes a higher-level approach, emphasizing applying operations to an entire array, rather than merely using an array as a repository for data that will be explicitly operated on by loops in your code. Functionally, the two approaches are of equal power–there’s still a loop going on within numpy, but in practice, applying functions to data structures results in simpler, cleaner code that’s easier to understand. The way I look at it is, code you don’t have to write has the fewest bugs, so the less code the better.

Continue reading →

algorithms, data science, data science career, Hadoop, machine learning, Uncategorized

A Pilgrim’s Progress #2: The Data Science Tool Kit

The is the second post about becoming a computer scientist after a career in software engineering. The first part may be found here.

Only a student would think that software developers mostly write computer programs. Coding is a blast–it’s why you get into the field–but the great majority of professional programming time isn’t spent coding. It goes into the processes and tools that allow humans to work together, such a version control and Agile procedures; maintenance; testing and bug fixing; requirements gathering, and documentation. It’s hard to say where writing interesting code is on that list. Probably not third place. Fourth or fifth perhaps?

Linear PCA v nonlinear Principle Manifolds Андрей Зиновьев=Andrei Zinovyev

Fred Brooks famously showed that the human time that goes into a line of code is inversely-quadratic in the size of the project (I’m paraphrasing outrageously.) Coding gets the glory, but virtually all of the significant advances in software engineering since Brooks wrote in the mid-1970’s have actually been in the technology and management techniques for orchestrating the efforts of dozens or even hundreds of people to cooperatively to write a mass of code that might have the size and complexity of War and Peace. That is, if War and Peace were first released as a few chapters, and had to continue to make sense as the remaining 361 chapters come out over a period of many months or even years. Actually, War and Peace runs about half a million words, or 50,000 lines, which would make it quite a modest piece of software. In comparison, the latest Fedora Linux release has 206 million lines of code. A typical modern car might have 150 million. MacOS has 85 million. In the 1970’s four million lines was an immense program.

Continue reading →

Uncategorized

A Pilgrim’s Progress #1: Starting Data Science

This is the first of what I hope will be a series of many posts documenting a pilgrim’s progress from programming to data science.

First of all, let’s talk about the name. It is almost a rule that anything called <something>-science isn’t a science and that will hold here. Science is defined as “an intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment.”

Nothing about data science fits that definition. It’s in the same boat with disciplines like library science, political science, management science, rocket science, and computer science that use mathematics and/or science to do interesting things but aren’t science themselves.

While the sciences study the world itself, data science studies the techniques for understanding the world through data. Data science is applied to some concrete field, be it science, politics, or advertising, but you wouldn’t say it’s “advertising science.” Of course not–it is its own thing. Trying to fit it in under the heading of science is what philosophers call a category error, like considering the manufacturing of firearms to be branch of wildlife management.

Continue reading →

Uncategorized

Not A Review: System76

I don’t usually do product reviews. Not that I’m against them, but this isn’t that kind of blog. I don’t buy or use a wide enough variety of computing equipment to have a valuable opinion. The truth is, as long as my personal computer is fast enough, I don’t have much reason to care about the nuances of processor tradeoffs, bus speeds, and the subtleties of graphics cards. Developing code actually isn’t very demanding in terms of hardware and when the code I write is deployed, it’s usually on swarms of anonymous generic servers managed by people I’ve never met.

What does matter at all levels is the operating system. The OS is the real computer. From inside a computer program you normally cannot see the hardware (unless you’re in a very esoteric field of programming.) All your code sees is the pretty face the OS puts on it. Still less can a user see the hardware. As long as there’s plenty of CPU and disk, the main thing you are aware of is the windowing system and the terminals. You occasionally have to do things that look like they involve hardware, like mounting disks, but even then, what you see is a layers-deep idealization provided by the operating system.

Continue reading →

	Water on A Pilgrim’s Progress #1:…
	Stewyn on Shifting to Hive Part II: Best…
	Glen on Go Go Go
	hadoop 3 Erasure cod… on Erasure Code in Hadoop
	Rajesh KSV on Shifting to Hive Part II: Best…

	Water on A Pilgrim’s Progress #1:…
	Stewyn on Shifting to Hive Part II: Best…
	Glen on Go Go Go
	hadoop 3 Erasure cod… on Erasure Code in Hadoop
	Rajesh KSV on Shifting to Hive Part II: Best…

hadoopoopadoop

Big Data with Hortonworks Hadoop

Author Archives: Peter Coates

Is This Bonkers!?

Unabashed Self-Promotion!

AI Needs a Better Acronym*

Yikes! How Can That Be?

AI Programming Part 2: The Particulars

AI Programming, Part 1

A Pilgrim’s Progress #4: Panda Series

A Pilgrim’s Progress #3: NumPy

What Is NumPy?

A Pilgrim’s Progress #2: The Data Science Tool Kit

A Pilgrim’s Progress #1: Starting Data Science

Not A Review: System76