Uncategorized

A Pilgrim’s Progress #3: NumPy

This is the third in a series of posts charting the progress of a programmer starting out in data science. The first post is A Pilgrim’s Progress #1: Starting Data Science. The previous post is A Pilgrim’s Progress #2: The Data Science Tool Kit.

What Is NumPy?

NumPy is a library of high-performance arrays for Python. After this I’m going to mostly call it numpy because that’s the name of the package you import. Whatever we call it, numpy supports creating and manipulating arrays of any number of dimensions and the ability to easily reshape them and slice them in complex ways on the fly.

The elements of any numpy array can be accessed in a variety of ways. You can access single elements, of course, but there is a powerful syntax for accessing all sorts of rectilinear slices in one or more dimensions. We’ll look at some of that below.

As the name implies, numpy is designed to support mathematical computing, and is thus packed with convenient features for operating on data as an array or matrix.

Every programmer is used to iterating over the elements of an array using a loop or an iterator, which is a concept that is easily extended to using nested loops to iterate over multi-dimensional structures. Numpy takes a higher-level approach, emphasizing applying operations to an entire array, rather than merely using an array as a repository for data that will be explicitly operated on by loops in your code. Functionally, the two approaches are of equal power–there’s still a loop going on within numpy, but in practice, applying functions to data structures results in simpler, cleaner code that’s easier to understand. The way I look at it is, code you don’t have to write has the fewest bugs, so the less code the better.

There is a large library of math functions and other operations that can be applied to arrays in various ways:

  • Element-wise across the array.
  • Aggregation that produce a single number.
  • Functions that accumulate values such as cumulative sum.
  • Operations on pairs of arrays.
  • You can also apply any custom function designed to operate on a single value to an entire array of values.

If You’re Still New To Python

Python has some functional flavor, but it’s not a functional language. In particular, both Python lists and numpy arrays are mutable, i.e., you can change the values after they are created.

This isn’t a Python lesson but it’s worth reviewing “calling conventions” to be sure you’re clear on how Python works. Languages have different conventions for what it means to pass an object to a function. Some languages pass a reference to the argument. This convention is known as pass-by-reference. Other languages pass in a copy of the argument, which is called pass-by-value. In a very literal-minded sense, Python is pass-by-value but there’s a huge gotcha. It means almost the opposite of what you might think.

Say you create an array called my_array=np.array([3,7])and pass it to your function as foo(my_array).

Inside the function, the thing you passed in will have some arbitrary name. Let’s call it arg1.

The function foo() can modify the array and the modifications will show up in the original my_array outside of your function. For instance, if foo() has the line arg1[0]=1, then outside of your function, my_array will be [1,7]. It can’t be pass-by-value, because if it were, my_array would continue to look like [3,7], right? Wrong! What got passed by value wasn’t the array [3,7]. It was the the value of the variable my_array itself, which is actually just a pointer to the array [3,7]. Python did pass the argument by value, but you have to remember the address is the value, not the thing it points to.

Is this splitting hairs? Not at all. If you wrote your address on a piece of paper and gave it to me, that’s pass-by-value. If at some later time I looked at the paper, went to the indicated address, and threw a rock at your window, you would have to call the glass repair guy because it would happen to your actual house, not to a separate copy. But, on the other hand, if I took out the paper and erased your address or erased it and wrote a different address there, nothing would happen to your house. The number on the door woudn’t change; you wouldn’t be rendered homeless; the house wouldn’t sink into the ground like Carrie’s house.

Inside the function, arg1 is a copy of the value of my_array, not a reference to it, and definitely not a copy of the object. You can modify the object that arg1 points to, and you can assign something else to arg1, but nothing you do affects the value of my_array or changes what it points to.

NumPy v Pandas v Plain Vanilla Python

We’ll be getting into Pandas in a later post, but a comparison is worth mentioning here.

There is nothing in either Pandas or NumPy that you can’t do yourself in Python. True, but not helpful. As a general rule, it’s wise to default to not using fancy libraries for simple things. For simple code that’s not too demanding in terms of size, speed, or intricate reshaping of data, I favor plain Python. It’s less for me to know, and less for whoever comes after me to know.

But if you’re going apply the code to large data sets or do scientific computing on data of any size, you’re in Numpy/Pandas/SciPy territory.

NumPy and Pandas cover similar areas. Much of what you can do with one you can do with the other. I suspect for most people, the deciding factor will be which you are more familiar with and which is favored by you co-workers.

NumPy is extremely powerful but simpler.

  • Less to learn in order to be useful.
  • A great choice for higher dimensional data, very large data sets.
  • Has a very wide selection of math functions.
  • Less specifically adapted to tasks of data-wrangling/munging than Pandas.
  • Significantly faster than Pandas and screamingly faster than plain Python.
  • The clear choice if you anticipate needing C bindings. Usually this will not be required because your algorithm is easier to implement in C than in Python. More likely you have some kind of complex code that has already been implemented in C and you want to use it in Python. There are vast amounts of C code for scientific computing and much Fortran77 code that can be converted to C by f2c or via manual transliteration.

Pandas provides more features than NumPy.

  • Particularly strong with 1-D data and time series.
  • Widely used in the financial and econometric domain.
  • Has extensive domain-specific libraries for technical computing in many fields.
  • Better data importing features.
  • Better for interacting with SQL.
  • Notably slower than numpy.
  • Better for complex data-wrangling.

One reason that Pandas and Numpy seem similar is that Pandas is built on top of NumPy.

The last Pandas strength mentioned above can be very important. Probably half or more of a a typical data scientist’s time goes into wrestling with data. The sheer dumb mechanics: dealing with missing values, converting data types, changing the shape, taking subsets, and all of the myriad tasks that go under the heading of Extract-Transform-Load (ETL). Pandas has many higher level features for much of this.

All that being said, Pandas being implemented on top of NumPy means that it’s usually quite easy to go back and forth. Your pandas DataFrame can become a numpy array in a blink, and vice-versa.

Under The Covers

That’s all great stuff that gives us much to explore, but the underlying numpy implementation is worth mentioning for a couple of reasons.

Speed and Space

Speed is a paradoxical subject in modern computing. On the one hand, modern machines are so incredibly fast that peformance hardly matters for most code. That’s why we can have wonderfully expressive and easy to program languages like Python. The pokey little 2.5 GHZ CPU on your laptop executes about a quarter of a billion machine instructions in the time it takes to blink your eye (app. 1/10 second.) That’s the speed of each core, of which you probably have several. At this speed range, ten times as fast or 1/10 as fast makes no detectable difference for most code.

The trouble is, numerical algorithms work the small subset of your program that isn’t “most code” like a rented mule, applying mathematical functions to very large data sets, often inside nested loops or other constructs that multiply the amount of computation per element. Many computations converge iteratively to refine a result. What’s worse, mere multiplication by the size of the dataset or the number of iterations isn’t necessarily the end of it. That’s only true for algorithms with “linear” complexity. The run-time of many important algorithms grows disproportionately to the size of the data set.

Multi-dimensional arrays are by far the most important data structure for technical computing. Numpy, which looks like just another Python library from the outside, optimizes this relatively small amount of critical code using the kinds of techniques that languages like Python are designed to shield us from. The internals are written in highly tuned C code that greatly reduces the amount of memory required along with much of the indirection and piecemeal memory allocation that would be necessary if it were implemented directly in Python.

The result is that core numerical computing code can be more efficient by an order of magnitude than the equivalent functionality would be in pure Python, while not breaking the Python idiom from the programmer’s perspective. It’s just faster–you don’t really have to consider why.

View v Reality

You do want to be aware of the underlying implementation of numpy’s shape-shifting features. Transformations of the shape of multi-dimensional numpy arrays are implemented in such a way as to be much less expensive than one might imagine. Numpy tries to avoid generating a new copy of the same data whenever possible.

If a program needs to somehow reshape a data set, say, to turn a flat array into a two or three-dimensional array, numpy does’t rewrite the one-D array as an array of arrays. It doesn’t have to because its normal lookup code uses the values in the tuple that defines its shape to compute the actual location(s) of the elements. Converting one shape to another is usually just a matter of adjusting the contents of the tuple, This means that shape changes usually cost practically nothing.

We’ll take a look at slicing data in more detail below but for the moment, it’s just what it sounds like. Numpy lets you slice out sets of rows, columns and combinations from an array of any number of dimensions.

When slicing data, numpy tries to present the programmer with a view, so as to avoid copying and reshuffling data. See in the example below what happens when we assign a four-row, four-column shape to what was originally defined as a flat array of 16 elements. There is more than one way to do this, but here we simply assign the tuple (4,4) to the array’s shape attribute. That shows how superficial the ‘shape’ of a numpy array really is.

Slices Are Usually Views Too

Next, we slice off a single column, assign it to a variable and set all the elements in the slice equal to 99.
We see by printing out the original 4×4 matrix that the column-zero values all changed. The slice was just a view, not an independent dataset!

NumPy v Pandas v Plain Vanilla Python

We’ll be getting into Pandas in a later post, but a comparison is worth mentioning here.

There is nothing in either Pandas or NumPy that you can’t do yourself in Python. True, but not helpful. As a general rule, it’s wise to default to not using fancy libraries for simple things. For simple code that’s not too demanding in terms of size, speed, or intricate reshaping of data, I favor plain Python. It’s less for me to know, and less for whoever comes after me to know.

But if you’re going apply the code to large data sets or do scientific computing on data of any size, you’re in Numpy/Pandas/SciPy territory.

NumPy and Pandas cover similar areas. Much of what you can do with one you can do with the other. I suspect for most people, the deciding factor will be which you are more familiar with and which is favored by you co-workers.

NumPy is extremely powerful but simpler.

  • Less to learn in order to be useful.
  • A great choice for higher dimensional data, very large data sets.
  • Has a very wide selection of math functions.
  • Less specifically adapted to tasks of data-wrangling/munging than Pandas.
  • Significantly faster than Pandas and screamingly faster than plain Python.
  • The clear choice if you anticipate needing C bindings. Usually this won’t be required because there is something that’s easier to implement in C but because there is some kind of complex code that has already been implemented in C or in some other language and then cross-compiled to C.

Pandas provides more features but it’s implemented in numpy underneath.

  • Strong with 1-D data and time series.
  • Particularly suitable for financial or econometric domain.
  • Better data importing features.
  • Better for interacting with SQL.
  • Better for complex data-wrangling.

One reason that Pandas and Numpy seem similar is that Pandas is built on top of NumPy.

The last Pandas strength mentioned above can be very important. Probably half or more of a data scientist’s time goes into wrestling with data. The sheer dumb mechanics: dealing with missing values, converting data types, changing the shape, taking subsets, and all of the myriad tasks that go under the heading of Extract-Transform-Load (ETL).
Pandas has many higher level features for much of this.

All that being said, Pandas being implemented on top of NumPy means that it’s usually quite easy to go back and forth. Your pandas DataFrame can become a numpy array in a blink, and vice-versa.

Under The Covers

That’s all great stuff that gives us much to explore, but the underlying numpy implementation is worth mentioning for a couple of reasons.

Speed and Space

Speed is a paradoxical subject in modern computing. On the one hand, modern machines are so incredibly fast that peformance hardly matters for most code. That’s why we can have wonderfully expressive and easy to program languages like Python. The pokey little 2.5 GHZ CPU on your laptop executes about a quarter of a billion machine instructions in the time it takes to blink your eye (app. 1/10 second.) That’s the speed of each core, of which you probably have several. At this speed range, ten times as fast or 1/10 as fast makes no detectable difference for most code.

The trouble is, numerical algorithms work the small subset of your program that isn’t “most code” like a rented mule, applying mathematical functions to very large data sets, often inside nested loops or other constructs that multiply the amount of computation per element. Many computations converge iteratively to refine a result. What’s worse, mere multiplication by the size of the dataset or the number of iterations isn’t the end of it. The run-time of many important algorithms grows disproportionately to the size of the data set.

Multi-dimensional arrays are by far the most important data structure for technical computing. Numpy, which looks like just another Python library from the outside, optimizes this relatively small amount of critical code using the kinds of techniques that languages like Python are designed to shield us from. The internals are written in highly tuned C code that greatly reduces the amount of memory required along with much of the indirection and piecemeal memory allocation that would be necessary if it were implemented directly in Python.

The result is that this core numerical computing code can be more efficient by an order of magnitude than the equivalent functionality would be in pure Python, while not breaking the Python idiom from the programmer’s perspective. It’s just faster–you don’t have to consider why.

View v Reality

You do want to be aware of the underlying implementation of numpy’s shape-shifting features. Transformations of the shape of multi-dimensional numpy arrays are implemented in such a way as to be much less expensive than one might imagine. Numpy tries to avoid generating a new copy of the same data whenever possible.

If a program needs to somehow reshape a data set, say, to turn a flat array into a two or three-dimensional array, numpy does’t rewrite the one-D array as an array of array. It just uses the values in the tuple that defines its shape to compute the actual location(s) of the elements. Converting one shape to another is usually just a matter of adjusting the contents of the tuple, so shape changes usually cost practically nothing.

We’ll take a look at slicing data in more detail below but for the moment,it’s just what it sounds like. Numpy lets you slice out sets of rows, columns and combinations from an array of any number of dimensions.

When slicing data, numpy tries to present the programmer with a view, so as to avoid copying and reshuffling data. See in the example below what happens when we assign a four-row, four-column shape to what was defined as a flat array of 16 elements. There is more than one way to do this, but here we simply assighn the tuple (4,4) to the array’s shape attribute. That shows how superficial the ‘shape’ of a numpy array really is.

Slices Are Usually Views Too

Next, we slice off a single column, assign it to a variable and set all the elements in the slice equal to 99.
We see by printing out the original 4×4 matrix that the column-zero values all changed. The slice was just a view, not an independent dataset!

array = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16])
array = np.array(np.arange(16))
print('The original 16 element array: \n{}'.format(array))
array.shape=(4,4)
print('The original 4x4 array: \n{}'.format(array))
slice = array[:,0]
print('The first column, a 4x1 slice of the array: \n{}'.format(slice))

# Test that slice.base is the same as the original array--it's a view.
print('base: \n{}'.format(slice.base is array))

for i in range(4):
    slice[i]=99
print('The original 4x4 array is changed! \n{}'.format(array))

## Note you can go the other way, too. This is defined from a list of lists
## but numpy stores it as contiguous values and uses the implied shape.
##
array2 = np.array([[1,2,3,4], [5,6,7,8],[9,10,11,12],[13,14,15,16]])
print('The shape is \n{}'.format(array2.shape))
print('The array: \n{}'.format(array2))
array2.shape=16 
print('The array: \n{}'.format(array2))
 
# If the base is anything other than None, it is a view on something.
print('The base of the slice: \n{}'.format(slice.base))

# The base  will be identical with whatever it is a view of.
print('base is identical with original array: \n{}'.format(slice.base is array))

# If you need a copy instead of a view, you can make one.
print('base for a copy: \n{}'.format((np.copy(slice)).base))

# Note that it can be confusing if you test for equality to the object you sliced.
# It can be safer to test for 
ss = slice[0:2]
print('base for a slice of a slice: \n{}'.format(ss.base))
print('Is ss.base the same as slice? : \n{}!'.format(ss.base is slice))
print('Is it a slice of something: \n{}'.format(ss.base is not None))

The original 16 element array: 
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
The original 4x4 array: 
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]
The first column, a 4x1 slice of the array: 
[ 0  4  8 12]
base: 
True
The original 4x4 array is changed! 
[[99  1  2  3]
 [99  5  6  7]
 [99  9 10 11]
 [99 13 14 15]]
The shape is 
(4, 4)
The array: 
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]]
The array: 
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16]
The base of the slice: 
[[99  1  2  3]
 [99  5  6  7]
 [99  9 10 11]
 [99 13 14 15]]
base is identical with original array: 
True
base for a copy: 
None
base for a slice of a slice: 
[[99  1  2  3]
 [99  5  6  7]
 [99  9 10 11]
 [99 13 14 15]]
Is ss.base the same as slice? : 
False!
Is it a slice of something: 
True

Sometimes you just want to use slice of values that you know you won’t change, but sometimes you want a separate copy, for instance, to modify or because you can then discard the rest of the data.

You can find out which you’ve gotten with np.array.base (see above). If your object is a copy, it will return None. Otherwise, it will return the base object. As a matter of defensive programming, it often make sense to test for any base rather than a specific base. Note above that we take a slice of a slice, and its base object isn’t the thing we sliced, but the original array.

One Thing That Can Trip You Up

For numpy to be able to provide you with a view instead of a copy, a requested slice has to be consistent with the way the data is stored in memory.

This isn’t always possible. For instance, numpy “fancy” indexing is a shorthand way to access a possibly multi-dimensional range of an array by slicing using arrays of indices.

This kind of access is not always logically compatible with returning views because (for at least one reason) the shape of the result obtained with fancy indexing is determined by the shape of the array of indices rather than by the shape of the array being indexed.

For instance, you can apply a 2×2 array of indexes to a one dimensional array in order to get a 2×2 result. This is cross-grained to the way the data is laid out in memory. Therefore, fancy indexing returns copies, not views.

Creating NumPy Arrays

NumPy is a library, so you need to include it as follows seen below. It’s a common convention to always use “import numpy as np” for consistency.

Below, we create a python list, and cast it to a numpy array. We then create some other arrays using numpy functions that produce arrays of zeros, ones, evenly spaced values, etc.

There are convenience functions for creating arrays of all one’s or all zeroes. Notice the “dtype” parameter, which is often useful so that your default values will match the type of the values you will later set explicitly.

import numpy as np
# An ordinary Python List
a = [1,2,3,4,5,6,7,8,9]

# Cast it to a numpy array
npa = np.array(a)
print('An array from a list: {}'.format(npa))

# The same size array 1:9 created two other ways!
npar = np.arange(1,10)
print('An an array created for a range: {}'.format(npar))

npl = np.linspace(start=1, stop=9, num=10, dtype='int')
print('An array of integers via linspace: {}'.format(npl))

npo = np.ones(9)
print('An array of ones with default type: {}'.format(npo))

npo = np.ones(9,dtype='int')
print('An array of ones specifying type=int: {}'.format(npo))

npz = np.zeros(9)
print('An array of zeros: {}'.format(npz))

npz = np.zeros(9,dtype='int')
print('An array of zeros: {}'.format(npz))

An array from a list: [1 2 3 4 5 6 7 8 9] 
An an array created for a range: [1 2 3 4 5 6 7 8 9] 
An array of integers via linspace: [1 1 2 3 4 5 6 7 8 9] 
An array of ones with default type: [1. 1. 1. 1. 1. 1. 1. 1. 1.] 
An array of ones specifying type=int: [1 1 1 1 1 1 1 1 1] 
An array of zeros: [0. 0. 0. 0. 0. 0. 0. 0. 0.] 
An array of zeros: [0 0 0 0 0 0 0 0 0]

Manipulating Shapes

Numpy can turn any 1-D array with N elements into any valid shape so long as requested shape doesn’t imply a change the number of elements. Remenber, the shape is just a view on the underlying data, so it makes no sense to ask for a shape the data can’t conform to.

You can reshape the N=9 elements of this array [1 2 3 4 5 6 7 8 9] into a 3×3 matrix with npa.reshape(3,3) but you can’t reshape it to 3×2 or 3×4.

print('Our original 9-element array: \n{}'.format(npa))
rnpa = npa.reshape(3,3)

print('Here it is reshaped to a three-row array 3x3 \n{}'.format(rnpa))

# Note that the following commented-out code would blow up with a ValueError
# rnpa = npa.reshape(3,2)
# 
# So would
# rnpa = npa.reshape(3,4)

# Another way to do the same thing. Simply setting the shape back to a 9 elemeents 
# in a one-D array does the trick.
rnpa.shape=9
print('Here it is a 1x9 array again \n{}'.format(rnpa))

# An appropriate n-tuple giving the values for each dimension sets it back again.
rnpa.shape=(3,3)
print('Here it is a 3x3 array again \n{}'.format(rnpa))

Our original 9-element array:  [1 2 3 4 5 6 7 8 9] 
Here it is reshaped to a three-row array 3x3  [[1 2 3]  [4 5 6]  [7 8 9]] 
Here it is a 1x9 array again  [1 2 3 4 5 6 7 8 9] 
Here it is a 3x3 array again  [[1 2 3]  [4 5 6]  [7 8 9]]

Arithmetic And Similar Operations

You can do basic arithmetic operations on an array and a scalar. Given an array, A and a number, n, A+n adds n to each element of A. The operation creates a new array–it does not modify A.

If two arrays have the same shape and size you can do element-wise operations on them, such as add the two arrays. If A and B are arrays of the same size, A + B creates an array of equal size where each element is the sum of the corresponding elements of A and B.

You can do the same with multiplication by using the * operator. Beware that this gives you what is called the Hadamard product, which is an array of the same shape as the originals where each element is the product of the corresponding elements in the original arrays.

You can also do dot-products, matrix multiplication and various elements of linear algebra, but these are accomplished with explicit function calls. There’s a lot you can do in this area using numpy, but in almost all cases I can think of, you’d probably want to use SciPy for these calculations. Numpy provides operations on arrays, but it’s not really geared to linear algebra or other domain-specific computing the way SciPy is.

Some Basic Matrix-Like Array Operations

a1 = np.array([1,2,3])
a3 = a1 * 2
print('Product of array and scalar: {}'.format(a3))

a3 = a1 + 2
print('Sum of array and scalar: {}'.format(a3))

## We make a second array
a2 = np.array([4,5,6])

a3 = a1 + a2
print('Element-wise sum of two arrays: {}'.format(a3))

a3 = a1 * a2
print('Element-wise product of two arrays: {}'.format(a3))

adotb = a1.dot(a2)
print('Dot product of a and b: {}'.format(adotb))

Product of array and scalar: [2 4 6] 
Sum of array and scalar: [3 4 5] 
Element-wise sum of two arrays: [5 7 9] 
Element-wise product of two arrays: [ 4 10 18] 
Dot product of a and b: 32 
 

Some Of The Available Built-In Functions

There are a tremendous number of mathematical operations available. Some are applied to each element, others pairwise to corresponding elements, as well as aggregations over an array.

  • All standard trigonometry functions and degree/radian conversions
  • Rounding, floor, ceiling, truncation, etc.
  • Numerous logarithic and exponential functions in various bases
  • Mathematical functions applied elementwise among pairs of arrays including arithmetic, powers, boolean and comparisons, modulus, quotients, remainders, etc.
  • Complex number operations, conjugation, etc.
  • Roots, powers, absolute values, signs, interpolation, etc.
  • Aggregation operations such as mean, mode, stdev, sum, product, cumulative sums, min, max, all, any, etc.

Example of applying a comparison operation element-wise to two arrays.

Given a1 and a2, arrays of equal size, np.greater(a1, a2) results in an array with True/False as each element of a1 is greater than its counterpart in a2.

from numpy import random

# Get 10 consecutive integers
a = np.arange(1,11)
print('Ten consecutive numbers 1:10: \n{}'.format(a))

# Get 10 random integers in the same range
r = random.randint(10,size=10)
print('Ten random integers in range [1:10]: \n{}'.format(r))

# There is also greater_equal, less, less_equal.
b = np.greater(a,r)
print('A element-wise comparison a>b: \n{}'.format(b))
Ten consecutive numbers 1:10: 
[ 1  2  3  4  5  6  7  8  9 10]
Ten random integers in range [1:10]: 
[3 2 6 8 2 4 8 7 0 4]
A element-wise comparison a>b: 
[False False False False  True  True False  True  True  True]

Applying a simple math function to each element of an array

The first of these are the more common kind or per-element operation, simply applying the function to each element independently.

The third example is slightly different, because it accumulates a sum, returning an 𝑛-element result for an 𝑛-element input. There are a few other functions that work like this.

print('Square root of first 10 elements of an array: \n{}'.format(np.sqrt(r[0:10])))
print('Square of first 10 elements of an array: \n{}'.format(np.square(a[1:10])))

print('Cumulative sum:{}'.format(r[0:10].cumsum())) 
Square root of first 10 elements of an array: 
[1.73205081 1.41421356 2.44948974 2.82842712 1.41421356 2.
 2.82842712 2.64575131 0.         2.        ]
Square of first 10 elements of an array: 
[  4   9  16  25  36  49  64  81 100]
Cumulative sum:[ 3  5 11 19 21 25 33 40 40 44]

Apply a custom function to each element of an array

The numpy math functions understand an array used as an argument. A plain vanilla Python math function usually does not. If you try the same thing with, say, math.sqrt(nparray) it will blow up with a TypeError because it doesn’t mean anything to take the square root of an array. The same thing goes for custom functions you write yourself.

Despair not. Numpy has a function, np.vectorize(a_function) that wraps your function with the mechanism to handle the elements individually. Given a function as an argument, it returns a wrapper function that can take an array as input and apply your function to the elements.

Below, we import Python’s math package and construct a function that we’d like to apply to entire arrays.

import math

## This only works for primitive operations inside.
def wacky_func(x):
    return math.cos(x) * x

# The following fails with a TypeError
# print('Apply a custom function to each element: \n{}'.format(wacky_func(a[1:5])))

wacky_vec = np.vectorize(wacky_func)

print('The array we want to apply our function to: \n{}'.format(a[1:5]))

print('Apply a custom function to each element: \n{}'.format(wacky_vec(a[1:5])))


The array we want to apply our function to: 
[2 3 4 5]
Apply a custom function to each element: 
[-0.83229367 -2.96997749 -2.61457448  1.41831093]

Aggregation Functions

You don’t always want to apply a function to 𝑛 elements to get 𝑛 results. Sometimes you are looking for a single aggregated result, such as an average.

Numpy has numerous aggregating functions for sum, mean, std, etc.

print('Some aggregations')
a = np.arange(1,101)
r = random.randint(100,size=100)
print('Sum of onsecutive:{} Sum of random {}'.format(a.sum(), r.sum())) 
print('Consecutive array min:{} max:{} mean:{} stdev:{}'.format(a.min(), a.max(), a.mean(), a.std())) 
print('Random array min:{} max:{} mean:{} stdev:{}'.format(r.min(), r.max(), r.mean(), r.std())) 
Some aggregations
Sum of consecutive:5050 Sum of random 4993
Consecutive array min:1 max:100 mean:50.5 stdev:28.86607004772212
Random array min:1 max:96 mean:49.93 stdev:26.23556936679667

Indexing

Indexing is a big subject in numpy. It’s a few blog posts all by itself so I won’t try to cover it all. Numpy provides a variety of ways to access subsets of an array. We can look at individual elements, rows, columns, or combinations. Indexing into higher-dimension arrays is similar to indexing into a 2-D array, so I’ll just hit some high-points about 2-D arrays.

The arrays are in row-major order, so in a 2-D array, the first is rows, the next is columns.

There are two ways to name a specific element of a two-D array. a[r,c] and a[r][c] mean the same value, i.e., the value at row r and column c.

A colon in a box separates the first and last elements you want. If there is nothing on the left of the colon, it means “starting a 0”. IF there is nothing on the right it means “up to the end”. If there is a number to the left, think of it as “up to but not including” so the end value will be one bigger than the largest index you want.

Remember, the : separates beginning and end in one dimension, but you can have many dimensions and each can use this notation. This means you can extract rectilinear slices in any number of dimensions. My own ability to visualize higher dimensional shapes starts to drop with one dimension and plunges precipitously after two.

a = np.array([1,2,3,4,5,6,7,8,9])
a.shape=(3,3)
print('The raw array: \n{}'.format(a))

## The element at the 0th row and 0th column.
print('\nAn element indexed by [0,0]:{}'.format(a[0,0]))
print('Same element indexed by [0][0]:{}'.format(a[0][0]))

## all of the columns in the second row, i.e., the 1th row, as we number from zero.
print('\nAll columns of the second row of the array: \n{}'.format(a[1][:]))

## All of the columns int the rows from 0 to 1, inclusinve
##  We could drop the 0 without changing the returned slice.
print('\nall columns in rows 0 and 1: \n{}'.format(a[0:2][:]))
print('all columns in rows 0 and 1: \n{}'.format(a[:2][:]))

print('\nThere is a difference between an element, and an element alone in an array.')
t = a[0][0]
print('[0][0] yields a scalar value:{}  type:{}'.format(t,type(t)))
t = a[0][0:1]
print('[0][0:1] yields a one dimensonal array with one value:{}  type:{}'.format(t,type(t)))
 
print('\nLast two elements in the first row: \n{}'.format(a[0][1:3]))

biga = np.arange(36)
biga.shape=(3,3,4)
print('\nA 3-D array shaped:{}, an array of arrays of arrays \n{}'.format(biga.shape,biga))
sub = biga[2,2,3]
print('\nThe far corner, a[2,2,3] : {}'.format(sub))
print('Notice, each index is one less than the size.')

print('\nWe have a 3-D array, so a[0], the first element in dimension 1 is 2-D array')
sub = biga[0]
print('{}'.format(sub))
print('The first element of first element of dimension 1, a[0,0] is a 1-D array')
sub = biga[0,0]
print('\nfirst row: \n{}'.format(sub))
sub = biga[0,0,0]
print('\nThe first element of the first element of the first element, a[0,0,0] is {}'
     .format(sub)) 

The raw array: 
[[1 2 3]
 [4 5 6]
 [7 8 9]]

An element indexed by [0,0]:1
Same element indexed by [0][0]:1

All columns of the second row of the array: 
[4 5 6]

all columns in rows 0 and 1: 
[[1 2 3]
 [4 5 6]]
all columns in rows 0 and 1: 
[[1 2 3]
 [4 5 6]]

There is a difference between an element, and an element alone in an array.
[0][0] yields a scalar value:1  type:<class 'numpy.int64'>
[0][0:1] yields a one dimensonal array with one value:[1]  type:<class 'numpy.ndarray'>

Last two elements in the first row: 
[2 3]

A 3-D array shaped:(3, 3, 4), an array of arrays of arrays 
[[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]

 [[12 13 14 15]
  [16 17 18 19]
  [20 21 22 23]]

 [[24 25 26 27]
  [28 29 30 31]
  [32 33 34 35]]]

The far corner, a[2,2,3] : 35
Notice, each index is one less than the size.

We have a 3-D array, so a[0], the first element in dimension 1 is 2-D array
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]
The first element of first element of dimension 1, a[0,0] is a 1-D array

first row: 
[0 1 2 3]

The first element of the first element of the first element, a[0,0,0] is 0

Fancy Indexing

We touched on fancy indexing above. The things to remember is that:

  • Fancy indexing uses arrays of values instead of ranges. The values can be either literal indexes or booleans in matching positions.
  • Fancy indexing returns copies of the data, rather than views
  • The shape of the output is determined by the indexing, not by the underlying data. A two-D array can result from fancy-indexing on a one-D array.

Fancy indexing takes some practice, but it can make it almost trivial to extract complex subsets of structured data that would be laborious to encode by hand.

That’s All The NumPy That Fits!

Numpy is a large subject. We’re just scratching the surface here. There are more gaps than coverage in every topic in this post. But time spent getting fluent with NumPy will be well rewarded.

Maybe next time we can take a quick look at displaying numpy results graphically.

Standard

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s