algorithms, data science, data science career, Hadoop, machine learning, Uncategorized

A Pilgrim’s Progress #2: The Data Science Tool Kit

The is the second post about becoming a computer scientist after a career in software engineering. The first part may be found here.

Only a student would think that software developers mostly write computer programs. Coding is a blast–it’s why you get into the field–but the great majority of professional programming time isn’t spent coding. It goes into the processes and tools that allow humans to work together, such a version control and Agile procedures; maintenance; testing and bug fixing; requirements gathering, and documentation. It’s hard to say where writing interesting code is on that list. Probably not third place. Fourth or fifth perhaps?

Linear PCA v nonlinear Principle Manifolds Андрей Зиновьев=Andrei Zinovyev

Fred Brooks famously showed that the human time that goes into a line of code is inversely-quadratic in the size of the project (I’m paraphrasing outrageously.) Coding gets the glory, but virtually all of the significant advances in software engineering since Brooks wrote in the mid-1970’s have actually been in the technology and management techniques for orchestrating the efforts of dozens or even hundreds of people to cooperatively to write a mass of code that might have the size and complexity of War and Peace. That is, if War and Peace were first released as a few chapters, and had to continue to make sense as the remaining 361 chapters come out over a period of many months or even years. Actually, War and Peace runs about half a million words, or 50,000 lines, which would make it quite a modest piece of software. In comparison, the latest Fedora Linux release has 206 million lines of code. A typical modern car might have 150 million. MacOS has 85 million. In the 1970’s four million lines was an immense program.

One is drawn to software engineering because of the sheer excitement of understanding a problem and encoding a solution, but the industry rarely lets top programmers code indefinitely. It doesn’t make economic sense to waste them on writing code. They are needed for the hard part, which is wrangling the humans that write the code, and reducing the wishes and dreams of other humans to a problem sufficiently well defined that software can satisfy it. These things have their own joy, but it’s probably not what entranced when you chose the field as a sophomore.

Data science is more like the what computers once seemed to promise: that pure effort to use algorithmic means to make something magic happen–to find meaning or understanding that could not be seen except through an algorithmic lens. Fred Brook’s formula works backwards as well as forward: the ratio of benefit to hours-expended can be huge for the tiny programs that data scientists write.

The Tools: A Developing List

Data scientists still develop code, so they don’t escape Git and Agile meetings altogether, but the teams are small and more importantly, they aren’t usually in the main integration and release path. This keeps the massive overhead of coordination from becoming the main event. Modulo my obvious prejudices, what’s left tends to be the fun part.

I’m going to refine and build this list over time. It’s everything that the real-life data science people I know actually use, plus what seems to be covered in all the courses, and the general computer literacy that every serious user needs. Will you use it all? Unlikely. Do you need it all on your first day? No way. But I think that one needs to be aware of the existence of all of these things and at least familiar with a lot of them. They run from the most pedestrian office software to algorithmic exotica.

The list is far from complete.

The Basic Mathematical Tools

  • Statistics and Probability
    • Descriptive statistics
    • Inferential statistics
    • Probability and Bayesian methods.
  • Linear Algebra: Did you take this in school and wonder in what life you would ever care? Well, here you go! Fortunately, nobody does it by hand. However, it’s the conceptual underpinnings of many important algorithms. It’s a way of looking at the world.
  • Basic math: Algebra, Calculus. Familiarity with differential equations. Again, your work day won’t be proving theorems, but you really need it to understand what a lot of the algorithms are doing.


In principle, you can code anything in any language. In practice, data scientists do most of their work in a just few languages, notably Python and R, that are relatively simple to code in because they don’t have lot of the conceptual overhead that makes languages like Java and C++ more suitable for software.

  • Python: the all purpose Swiss army knife. Python has a huge user base and an extremely diverse and a widely used set of libraries for data science, math, and technical computing in general. If an algorithm exists, it probably exists in Python. It is also simple enough to use for many purposes one might otherwise use shell programming (e.g. Bash) or Perl for. Bash is powerful, but quirky to say the least and Perl would be my immediate go-to nominee for ugliest language ever. Python isn’t the fastest language in terms of execution but it’s intuitive, easy to code in, and easy to get help in. With the help of Google you can be productive on day #2. Also, when platforms like Hadoop or Spark support multiple languages, Python is almost always one of them. The key libraries everyone needs are:
    • Numpy: Stands for numerical Python. Pronounced num-pie by most, but some pronounce it to rhyme with lumpy, often to the annoyance of those who prefer num-pie.
      • It is a widely used library that provides high-performance, multi-dimensional array data structures and complimentary numerical processing support.
      • Provides simplified mechanics for reading data and writing it back to disk.
      • From the programmers point of view, numpy is just a Python library, but under the covers, it is written in C, which makes it run an order of magnitude faster than a naive implementation using native Python.
    • Pandas: data analysis and manipulation. Python trivia: the name is short for “panel data” which is a term I’ve heard in no other context.
      • This library is built on numpy and provides a richer environment for manipulating tabular data and time-series data.
      • The most commonly used data structure is a DataFrame which is spreadsheet-like table that is extremely handy. Among other things, it is easily indexed either numerically or by row and column labels.
    • Scipy: Pronounced psi-pie. Not skippy, unfortunately, as it would go better with numpy.
      • It is a vast set of libraries for scientific and technical computing.
      • It has modules for statistics, optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, differential equations and more.
  • The R language is written specifically with statistics and data science in mind.
    • It’s somewhat like Python, very expressive, and it provides some extremely convenient and powerful facilities for the mechanics of reading in data, cleaning it up, and general data wrangling and munging.
    • Statistics people love it but it’s not well adapted to large data sets. In that respect it’s closer to to tools like Matlab.
    • It tends not to be as well supported in big-data environments.
  • SQL: A relational-database query language but countless non-relational databases use it too.
    • SQL is almost purely declarative, as opposed to procedural. In other words, you tell SQL what you want, not how to get it. In the relational model, data is understood to be stored in flat tables (relations) and SQL provides a way to express in a relational algebra complex subsets of the data drawn from multiple tables.
    • If you hear about No-SQL databases and take this to mean that SQL is going away, think again. Most of the world’s data is stored in either in relational databases or in big-data systems that are not relational, but nevertheless understand SQL.

The Relational Model

Relational DB’s: AKA RDBMS, for Relational Database Management System. A general name for a large class of databases. The largest variety of data that does not fall under the heading of “big data” is stored relationally. See SQL above.

  • OLTP: Class of RDBMS optimized to support many users simultaneously querying, inserting, and deleting data. the quintessential operational data store for businesses. A billion rows is on the large side for OLTP.
  • OLAP: Class of RDBMS geared towards reporting and analytics rather than operational use. Data warehousing is OLAP and still have a relational database architecture underneath. A few billion rows would be reasonable.
  • MPP’s: Massively Parallel Processing systems may have hundreds of CPU’s. These usually present a SQL query interface, but underneath have something closer to a big-data, share-nothing architecture. Neteeza and Greenplum are examples. Many billions of rows.
  • Hadoop-like systems: The upper size limit is astronomical. Multi-trillion row, multi-petabyte databases are common. The data is usually accessed through a SQL query language like Hive or Impala, but the underlying storage is entirely unrelated to relations.

It is important for data scientists who will be interacting with relational data (most of them) to understand something about the underlying relational model, how data is indexed, what normalization means, etc. Do you need to be able to design a SQL database? No. And you don’t necessarily need to be the world’s best SQL programmer. Those are a different jobs. But it’s hard to know what data is available to you if you can’t write some decent SQL.

Data Visualization

This is a huge area that covers many approaches. In my experience, most companies usually pipe information for quantitative display though the commercial systems that they use for reporting, while the data scientists themselves use Python or R libraries while they’re researching problems and to show interim results in meetings, etc.

  • Language libraries available with Python and R, such as Python’s Matplotlib, PLPlot, and Bokeh
  • Simple display capabilities of Excel
  • Industrial strength products like Tableau and Looker.
  • Fancy custom graphics

The Algorithms

These are literally too numerous to write here. Broad categories include those below. We’ll be adding to these.

  • Supervised learning looks at labeled data to create a model for predicting results for new data: E.g., Linear Regression, Logistic Regression, CART, Naive-Bayes, K-Nearest Neighbor, etc.
  • Unsupervised learning doesn’t use known outputs that correspond to known inputs, but attempts to model the underlying structure of the data. Association, clustering, and dimensionality reduction are common goals. Apriori, K-Means, PCA
  • Ensemble learning: Techniques for combining other models, for example Random Forests, XGBoost, etc.
  • Reinforcement learning is through trial and error. Algorithms typically start with random guesses and move toward a better estimate iteratively. Self driving algorithms, natural language processing, and gaming often use these algorithms.
  • NLP algorithms for both understanding and generating natural language.
  • Neural Nets

Humble Excel

Yes, or its step-sisters OpenOffice and LibreOffice. You won’t use it as your main analytic tool, but:

  • An incredible amount of the worlds data is maintained via Excel so you’ll often be receiving data in Excel formats and people may want results in Excel files. It is used for surprisingly heavy duty processing such as Monte Carlo simulations. Much data available in no other form exists in Excel.
  • It handles data in the millions of rows. Many millions.
  • It has a built-in query language for Access and SQL server which can be useful in gathering data.

Big Data

Many big data systems either support data science processing directly or output data used by data scientists and the applications they write.

  • Hadoop-like systems: Conceptually batch-oriented even when they present a query-interface.
  • NoSQL: distributed query-oriented systems.
  • Spark, Flink, etc. and similar platforms support running distributed concurrent programs, often in Python, over Hadoop-scale datasets.

The practical tools

  • Linux: Data scientists work Mac, Linux, and even Windows, but familiarity with Linux is essential as it is the most common deployment platform and the platform from which data is most likely to be sourced.
  • Bash: The most common Linux shell, i.e., command-line interface. It’s what runs on the terminal.
    • Bash is a powerful, if rather eccentric computer language that is designed to be a home base for the vast array of Unix utilities for manipulating your computer: kicking off computer programs, processing and moving data, all sorts of filtering, stream editing, copying data between machines, etc.
    • In many computing environments, bash shell scripts drive nearly all of the scheduled processing such as pulling data from SQL. It runs on every Mac, Linux or Unix machine.
  • Git: This is by far the dominant source-code control tool. Git allows multiple programmers to work in parallel without stepping on each other’s code and organizes the code bases that constitute a release. Every project will typically live in a “git repository” from which remote users can clone a copy to work on, and move changes back to the main repository in a disciplined manner.
  • Agile: This is a family of methodologies for managing projects, particularly software projects.
    • Almost every modern company uses some version of this system of assigning and tracking tasks, monitoring progress, and keeping team members in sync.
    • It’s ubiquitous in software development and usually spills over to data science, product, and similar teams that are close to the software organization.
  • Docker: This is a “container” for executing code.
    • Docker containers are mini environments that from the point of view of a computer program look just like a computer running some particular operating system configured with all the usual software, services, connections, disk space, etc.
    • In fact they are just containers, somewhat like virtual machines, that could be running on any operating system, be it in the cloud, in the data-center, or on your laptop.
    • Container-based deployment is extremely convenient because:
      • A developer or data scientist can develop code locally in exactly the same container that will run in production say, in the cloud.
      • Containers are usually deployed by systems that manage keeping them running. The ideal Docker program is written so that it can pick up from anywhere or run in any number of copies.
    • Virtually any operating system and any of the tools that run on it can run in Docker.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s