Uncategorized

A Pilgrim’s Progress #4: Panda Series

This is the fourth in a series of posts charting the progress of a programmer starting out in data science. The first post is A Pilgrim’s Progress #1: Starting Data Science. The previous post is A Pilgrim’s Progress #3: NumPy

I’m trying something new out here. These posts are coded in Jupyter which is an extremely handy way to intermingle text and executable code. It comes with Anaconda, which is the best way to get everything going if you’re starting out. For the first couple I cut-and-pasted the material over to WordPress. This time I downloaded the Jupyter file as HTML and pasted it in. Far from perfect but 100x faster. It’s painful to edit once pasted in, so it’s far from a perfect solution. Any ideas?

Pandas are insanely versatile and capable of far more than I’ve covered in this already excessively long set of notes. At best this is a way to get an idea of how they work and a quick tour of what they look like in use.

Pandas Part One: Series

The Pandas library is one of the most widely used in the data science world. We’ll look at the two most popular Pandas data structures, the one-D Series and the two-D DataFrame and leave its lesser-used three-D and four-D data structures for another time. Series and DataFrame are higher-level constructs built on the NumPy library. The Pandas library is just a tool but Series and DataFrame are so widely used that anyone working in the field needs to be fully conversant with them. We’ll look mostly at Pandas Series in this post and the DataFrame next time.

Series and DataFrame provide much convenience when working with one and two dimensional data. These are by far the most common data dimensionalities, so the library covers a lot of data science turf. Both data structures are used to hold the data that fuels algorithms, but they also offer numerous features for the other half of a data scientist’s life: the mundane, time-consuming, and endlessly diverse task of wrangling data:

  • Data Wrangling
  • Data Aggregation and Transformation
  • Reading and Writing to disk or SQL
  • Data Alignment
  • Merging and Joining Data Sets

Why Not Just Use NumPy?

You could. People do. But Pandas are higher-level, providing many conveniences for things that you’d otherwise have to code yourself. Every time you wrote a data-based program you’d basically be re-writing things that are already in Pandas, and probably in a less robust and more buggy way, because Pandas have been beat on by experts for years.

One of the key features of Pandas data structures is known as intrinsic data alignment. Both Series and DataFrame objects have indices that remain attached to their corresponding entries until they are explicitly changed. This is true even for slices of the original data structure. They carry their labels with them. Positional access is also available, but index-based access means that data from multiple sets can be merged easily. “Alignment” means that pair-wise operations will apply to pairs of similarly indexed values in the data, with orderly handling of cases where an entry in one data structure has no match in the other.

Almost anything you can think of in the way of handling data is already there. All you have to do is think about the problem you’re trying to solve.

The Panda Series

Series is a one-dimension data storage structure that is useful for time-series and similar applications. Most of the basic concepts will be similar for DataFrames as well, but we’ll look at those in the next post. It’s good to know about Series even if all your data is two-D because the columns of a DataFrame are not only functionally similar to Series objects, they literally are Series objects under the covers.

For most purposes, you should think of a series as being of uniform type, e.g., all integers, or all floats, or all strings. You actually can create a series with a mix of say, integers and strings, but many functions won’t work. Technically, once you mix types, the series automatically becomes a series of uniform type, because all the elements will be promoted to object type, rather than remaining strings and integers, but mixing the underlying types will still defeat many operations you might wish to do. The type() function will tell you the data-type of the series.

One place that the keep-it-homogenious rule might be violated is when converting the elements of a sequence to some new type. The easisest way to do this is by creating a new series from the old one, but if the data sets are huge, you might want to do it in place. If so, while the conversion is going on, your series will host two different types. For instance, you might have a set of test scores between 0.0 and 100.0. You could write a function that buckets integer or floating point values into categories A,B,…,F or into P/F.

Creating a Series

A series can be created from pretty much anything that has the nature of being a row of something, be it ndarray, dict, or a list. You can even make one from a scalar plus the number of elements you want.

If no index is specified, the series is automatically indexed with consecutive integers starting with zero. This can trip you up at first because the positions and the indices will happen to be the same. That’s really just a fluke. The indices could have started anywhere and been in any order. They don’t even have to be unique, or even numbers. If you choose to specify the index, it must be a hashable type, usually string labels or integers. You can’t change the index but you can replace it entirely simply by swapping in a new list of hashables.

The following are some ways to create a series.

In [652]:
import numpy as np
import pandas as pd
import random as rand
# Creation from just a list
lst = rand.sample(range(0, 100), 6)
ser = pd.Series(lst)
print('Series from just a list: \n{}'.format(ser))
index = ['A','B','C','D','E','F']
ser = pd.Series(lst, index=index)
print('\nSeries from a list and list of labels: \n{}'.format(ser))
altindex = ['Q','R','S','T','U','V']
ser.index=altindex
print('\nThe same series with index replaced after creation: \n{}'.format(ser))

# Creation from a dict 
vals = [65,66,67,68,69,79]
dict = {index[i]: vals[i] for i in range(len(index))} 
print('\nA dictionary: \n{}'.format(dict))
ser = pd.Series(dict)
print('\nSeries from dictionary: \n{}'.format(ser))

# Creation from a numpy array
nda = np.array(lst)
ser = pd.Series(nda)
print('\nSeries from numpy array: \n{}'.format(ser))
ser = pd.Series(nda, index=index)
print('\nSeries from numpy array + index: \n{}'.format(ser))
  
ser = pd.Series(0,index=index)
ser['D']=1
print('\nSeries from a scalar and index array. one element has been hand-set: \n{}'.format(ser))
Series from just a list: 
0    26
1    64
2    21
3    66
4    11
5    37
dtype: int64

Series from a list and list of labels: 
A    26
B    64
C    21
D    66
E    11
F    37
dtype: int64

The same series with index replaced after creation: 
Q    26
R    64
S    21
T    66
U    11
V    37
dtype: int64

A dictionary: 
{'A': 65, 'B': 66, 'C': 67, 'D': 68, 'E': 69, 'F': 79}

Series from dictionary: 
A    65
B    66
C    67
D    68
E    69
F    79
dtype: int64

Series from numpy array: 
0    26
1    64
2    21
3    66
4    11
5    37
dtype: int64

Series from numpy array + index: 
A    26
B    64
C    21
D    66
E    11
F    37
dtype: int64

Series from a scalar and index array. one element has been hand-set: 
A    0
B    0
C    0
D    1
E    0
F    0
dtype: int64

Accessing and Slicing a Series

There are quite a few ways to access the data in Series. We see below some of the ways you can index from labels.

  • The first access is a single element.
  • The second through fourth accesses produce slices.
  • The fifth is accessed through “fancy indexing”, i.e.,using an array of indices.
In [653]:
index = ['A','B','C','D','E','F']
ser = pd.Series(lst, index=index)
print('Our list: \n{}'.format(ser))
print('\nElement C: {}'.format(ser['C']))
print('\nSlice from C to the end: \n{}'.format(ser['C':]))
print('\nSlice from beginning to C: \n{}'.format(ser[:'C']))
print('\nSlice from B to C: \n{}'.format(ser['B':'C']))

fi_list = ['B','C','F']
print('\nFancy indexing: \n{}'.format(ser[fi_list]))
Our list: 
A    26
B    64
C    21
D    66
E    11
F    37
dtype: int64

Element C: 21

Slice from C to the end: 
C    21
D    66
E    11
F    37
dtype: int64

Slice from beginning to C: 
A    26
B    64
C    21
dtype: int64

Slice from B to C: 
B    64
C    21
dtype: int64

Fancy indexing: 
B    64
C    21
F    37
dtype: int64

Slices Are Views

We can modify a series. Notice the series created from a scalar and an array of indices–one of the values was manually set to one.

You can take a slice that is the values from index ‘C’ to the end and we set the value at index ‘D’ to 1000. It does what you’d expect–when we print the slice we see that the value at ‘D’ is 1000.

But be very sure you expect to see that the value for ‘D’ in the original series is changed to 1000 as well. That’s because the slice is just a view on the series, not an independent copy. This is just as you’d see in a one-dimensonal numpy array (because Panda data structures are built on numpy data structures.)

Fancy Index Access Yields a Copy

Let’s try the same thing with a similar slice obtained via fancy indexing, in which we supply a list of indices instead of a range. (Note the cute trick for setting all the values at the same time.) It does what you’d expect to the slice, but when we print the original series we see that it’s unchanged. That’s because fancy indexing gives you a copy not a view. This isn’t an oversight by the Panda people. In this case, the array of indices corresponded to the way that they were laid out in memory, but they didn’t have to be. In general, fancy index access is not consistent with the underlying array structure. In the example after that, we see fancy indexing used to obtain a series with the values out of order.

In [654]:
slice = ser['C':]
print('Slice from C to the end: \n{}'.format(slice))
slice['D']=1000
print('Slice from C to the end: \n{}'.format(slice))
print('The original series: \n{}'.format(ser))


fi_list = ['C','D','E','F']
fi_slice = ser[fi_list]
print('Same slice obtained with fancy indexing: \n{}'.format(fi_slice))
fi_slice[:]=0 
print('All values in the fancy-index slice changed: \n{}'.format(fi_slice))
print('The original series is unchanged: \n{}'.format(ser))


fi_list = ['A','F','C','E','B','D']
fi_slice=ser[fi_list]
print('A non-slice obtained with fancy indexing: \n{}'.format(fi_slice))

slice = ser[0:3]
print('A slice obtained with numeric indices: \n{}'.format(slice)) 
slice = ser[[1,3,5]]
print('A slice obtained with numeric indices: \n{}'.format(slice)) 
Slice from C to the end: 
C    21
D    66
E    11
F    37
dtype: int64
Slice from C to the end: 
C      21
D    1000
E      11
F      37
dtype: int64
The original series: 
A      26
B      64
C      21
D    1000
E      11
F      37
dtype: int64
Same slice obtained with fancy indexing: 
C      21
D    1000
E      11
F      37
dtype: int64
All values in the fancy-index slice changed: 
C    0
D    0
E    0
F    0
dtype: int64
The original series is unchanged: 
A      26
B      64
C      21
D    1000
E      11
F      37
dtype: int64
A non-slice obtained with fancy indexing: 
A      26
F      37
C      21
E      11
B      64
D    1000
dtype: int64
A slice obtained with numeric indices: 
A    26
B    64
C    21
dtype: int64
A slice obtained with numeric indices: 
B      64
D    1000
F      37
dtype: int64

Series Elements Are Mutable

You can modify the values in a series by assigning to them using any of the methods you might use for directly accessing or slicing the series.

You can also extend a series. It’s as simple as assigning to the next position.

In [655]:
ser = pd.Series(rand.sample(range(0, 100), 3))
print('The series: \n{}'.format(ser))
ser[1]=0
print('The series after setting ser[1]=0: \n{}'.format(ser))
ser[3]=0
print('The series after setting a value for a non-existent index ser[3]=0: \n{}'.format(ser))
ser[-1]=0
print('They really aren\'t positions. The index -1 is just another label: \n{}'.format(ser))
The series: 
0    71
1    57
2    87
dtype: int64
The series after setting ser[1]=0: 
0    71
1     0
2    87
dtype: int64
The series after setting a value for a non-existent index ser[3]=0: 
0    71
1     0
2    87
3     0
dtype: int64
They really aren't positions. The index -1 is just another label: 
 0    71
 1     0
 2    87
 3     0
-1     0
dtype: int64

Positions and Indices Aren’t The Same Thing

We saw above that when we indexed by strings the results still had integer positions that you could use for access.

Here, we use the default pandas-generated sequential indices, which look like array-indices but they aren’t. Whether you take a slice via range, or access as set of values via fancy indexing, they retain the numeric labels.

We’ve seen that changing the values in a slice obtained via a range changes the values in the underlying original series. Note however that the index of the slice is an attribute of the slice, not of the underlying data. It defaults to the index of the original series but you can change it without affecting the original.

We also reset the index on the new Series obtained via fancy indexing.

Don’t forget–even if you obtain a view via a slice, you can always make a copy of it if it suits your needs.

In [656]:
## We
lst = rand.sample(range(0, 100), 6)
ser = pd.Series(lst)
print('Series from just a list: \n{}'.format(ser))

slice = ser[3:]
print('A slice (view) from 3 to the end: \n{}'.format(slice))
slice.index=[0,1,2]
print('Reset index on slice to make it consecutive from zero: \n{}'.format(slice))
print('The reset leaves the original series unchanged: \n{}'.format(ser))

fi_index = [5,3,1]
slice = ser[fi_index]
print('A new series via fancy indexing retains the numeric labels: \n{}'.format(slice)) 
 
slice.index=[0,1,2]
print('The new series re-indexed: \n{}'.format(slice))
Series from just a list: 
0    59
1    74
2    40
3    33
4    22
5    77
dtype: int64
A slice (view) from 3 to the end: 
3    33
4    22
5    77
dtype: int64
Reset index on slice to make it consecutive from zero: 
0    33
1    22
2    77
dtype: int64
The reset leaves the original series unchanged: 
0    59
1    74
2    40
3    33
4    22
5    77
dtype: int64
A new series via fancy indexing retains the numeric labels: 
5    77
3    33
1    74
dtype: int64
The new series re-indexed: 
0    77
1    33
2    74
dtype: int64

Indices Aren’t Always Unique

While we’re on the subject of indexing, beware that indexes don’t have to be unique. There are occasions when non-unique indices are a desired feature of your program, but often you create them when you’re concatenating series objects.

Below, we concatenate s, with four elements and s2, with three. The result is a 7-element list with three pairs of identical indices.

To remedy this we simply replace the index with a list generated using range(0,cat.size).

In [657]:
s = pd.Series([23,72,78,11], index=[1,2,3,4])
s2 = pd.Series([1,2,3], index=[3,4,5])
cat = pd.concat([s,s2])
print('Some indices appear twice: \n{}\n'.format(cat))
cat.index=list(range(0,cat.size))
print('The re-indexed list: \n{}'.format(cat))
Some indices appear twice: 
1    23
2    72
3    78
4    11
3     1
4     2
5     3
dtype: int64

The re-indexed list: 
0    23
1    72
2    78
3    11
4     1
5     2
6     3
dtype: int64

Loc and ILoc

You can also access the elements of a series with the loc and iloc mechanisms as below. The array.loc[…] mechanism lets you access individual elements or slices via lables, while the iloc[…] mechanism lets you ignore the labels and go at the positions. If you use numeric indexes and iloc will still ignore them and use their positions.

In [658]:
lst = rand.sample(range(0, 100), 6)
index = ['A','B','C','D','E','F']
ser = pd.Series(lst, index=index)
print('Our list: \n{}'.format(ser))
v = ser.loc['A']
print('\nser.loc[\'A\'] \n{}'.format(v))

v = ser.loc['A':'D']
print('\nser.loc[\'A\':\'D\'] \n{}'.format(v))

v = ser.iloc[1:4]
print('\nser.loc[1:4] \n{}'.format(v))
 
# Same array of values, but different numeric indices
# Notice that when we use iloc, it doesn't matter whether the indices are numbers--it's 
# the postions that matter. The index numbers are really just labels.
#
# Using iloc[1:4] gets the first three positions regardless of the numeric index  
index = [4,5,6,7,8,9]
ser = pd.Series(lst, index=index)
v = ser.iloc[0:3]
print('\nser.iloc[0:3] \n{}'.format(v))
 
ser = pd.Series(lst)
v = ser.iloc[0:3]
print('\nser.loc[:3] \n{}'.format(v))
Our list: 
A    96
B    12
C    36
D    43
E    34
F    13
dtype: int64

ser.loc['A'] 
96

ser.loc['A':'D'] 
A    96
B    12
C    36
D    43
dtype: int64

ser.loc[1:4] 
B    12
C    36
D    43
dtype: int64

ser.iloc[0:3] 
4    96
5    12
6    36
dtype: int64

ser.loc[:3] 
0    96
1    12
2    36
dtype: int64

Filtering a Series With a Function

We create two Boolean-valued lambda functions to apply to a series to filter out a subset of values.

In the first case, we apply the lambda directly in brackets.

In the second example we create a function to simplify the syntax.

In [659]:
lst = list(rand.sample(range(0, 8), 8))
ser = pd.Series(lst)
limit=3 

print('ser is an 8-element series of consecutive values randomly reordered: \n{}'.format(ser)) 
print('\nThe series index: {}'.format(ser.index)) 

print('\nHere we have a lambda function inlined in the brackets to extract the range we want.')
new = ser[(lambda x,y: x>y)(ser,limit)]
print('ser[(lambda x,y: x>y)(ser,limit)] gets values > {}: \n{}'.format(limit,new))

print('\nIt\'s the same thing but more readable if you declare the lambda function separately.')
f = lambda x,y: x>y
print('Testing a boolean lambda function that tests 5>3:{}'.format(f(5,3))) 

new = ser[f(ser,limit)]
print('Apply f() ser[f(ser,{})]to get values > {}: \n{}'.format(limit,limit,new))

print('\nIt\'s even more readable if you hide the syntax behind a function call.')

def lt_filter(series, function, limit):
    return series[function(ser,limit)]

new = lt_filter(ser,f,limit)
print('Apply lt_filter(ser,limit) to get values < {}: \n{}'.format(limit,new))
ser is an 8-element series of consecutive values randomly reordered: 
0    6
1    1
2    3
3    5
4    0
5    2
6    4
7    7
dtype: int64

The series index: RangeIndex(start=0, stop=8, step=1)

Here we have a lambda function inlined in the brackets to extract the range we want.
ser[(lambda x,y: x>y)(ser,limit)] gets values > 3: 
0    6
3    5
6    4
7    7
dtype: int64

It's the same thing but more readable if you declare the lambda function separately.
Testing a boolean lambda function that tests 5>3:True
Apply f() ser[f(ser,3)]to get values > 3: 
0    6
3    5
6    4
7    7
dtype: int64

It's even more readable if you hide the syntax behind a function call.
Apply lt_filter(ser,limit) to get values < 3: 
0    6
3    5
6    4
7    7
dtype: int64

Operations

Vectorized Operations

Vectorized operations work on corresponding elements of pairs of vectors as they are indexed, whether the indices are strings or numbers. It’s not positional as you might expect based on the model of an array or vector.

Both of the two series below, v1 and v2 are initially indexed by the same sequence [0,1,2,3,4]. When we do vector addition, the indices only happen to match the positions. Look what happens when we assign a different index to one of the arrays.

In [660]:
index = [0,1,2,3,4]
v1 = pd.Series([1,2,3,4,5],index=index)
v2 = pd.Series([6,7,8,9,10],index=index)
print('v1: \n{}'.format(v1))
print('v2: \n{}'.format(v2))
vsum = v1 + v2
print('\nThe sum of v1+v2 has each added to the value with the same index: \n{}'.format(vsum)) 
print('Therefore the result has five values. ')
rindex = [1,4,3,2,0] 
v2.index=rindex  
print('\nWe reorder one index and leave the other the same. v2: \n{}'.format(v2))
vsum = v1 + v2
print('\nWe still get a five element result but different sums: \n{}'.format(vsum))

print('\nRe-indexing the second series to include two indices that are not in the first series')
rindex = [0,1,2,11,12] 
v2.index=rindex  
print('v2: \n{}'.format(v2))
vsum = v1 + v2
print('\nThe vector sum now has NaN\'s for elements do not match: \n{}'.format(vsum))
v1: 
0    1
1    2
2    3
3    4
4    5
dtype: int64
v2: 
0     6
1     7
2     8
3     9
4    10
dtype: int64

The sum of v1+v2 has each added to the value with the same index: 
0     7
1     9
2    11
3    13
4    15
dtype: int64
Therefore the result has five values. 

We reorder one index and leave the other the same. v2: 
1     6
4     7
3     8
2     9
0    10
dtype: int64

We still get a five element result but different sums: 
0    11
1     8
2    12
3    12
4    12
dtype: int64

Re-indexing the second series to include two indices that are not in the first series
v2: 
0      6
1      7
2      8
11     9
12    10
dtype: int64

The vector sum now has NaN's for elements do not match: 
0      7.0
1      9.0
2     11.0
3      NaN
4      NaN
11     NaN
12     NaN
dtype: float64

Vectorized Operations With Indices That Are Not Unique

We saw that if the same set of distinct indices appears in two different series objects that are operands of a vectorized operation, the resulting series is of the same length as the operands.

We also saw that if the two operands each have distinct ndices but the indices differ between the two series, then the result will have the cardinality of the union of the two sets of indices. Each matching pair will contribute one element and each unmatched element in each series will create one NaN.

But there is no requirement that indices of a series be distinct and for many reasons it is not unusual to have duplicates. What happens if you do a vertorized operation on series objects with duplicate indices?

Below we define two series with the same values and the same indices, but one of the indices (4) appears three times in each index. What happens is each element in one series gets matched with every possible match in the other series. So here we get 3 1:1 matches and 9 matches of the duplicated index that appears three times in each series. A total of 12 matches.

In [661]:
s1 = pd.Series([1,2,3,4,5,6], index=[1,2,3,4,4,4])
s2 = pd.Series([1,2,3,4,5,6], index=[4,4,4,1,2,3])
print(s1)
print(s2)
sum = s1+s2
print(sum) 
1    1
2    2
3    3
4    4
4    5
4    6
dtype: int64
4    1
4    2
4    3
1    4
2    5
3    6
dtype: int64
1    5
2    7
3    9
4    5
4    6
4    7
4    6
4    7
4    8
4    7
4    8
4    9
dtype: int64

Various Ways Of Accessing A Series

A series is an array of values coupled with a parallel set of labels called the index. Because the values are in an array, they also have a position relative to the start of the series. You have to be careful to keep the two concepts distinct.

Notice that when we create the series below from Python list alone, we still have an index. By default, that index is consectuive integers starting with zero. This means by default it lines up with the postions of the respective elements, but that needn’t be, and often is not, the case. We could have specified the index when we created the series or changed it later. Moreover, is we take a slice of the series, the indices of the elements won’t change to match the positions relative to the slice. Panda indices are sticky.

Access via get(k)

The get()function is somewhat like the get() function for a hash table. It interprets its argument as a key and returns the corresponding value. If no such index is found, get() returns None, unless you have set a different default retrn value such as 0, as might be appropriate for a numeric series. See the examples below.

The indexing operator []

The indexing operator [] can be somewhat confusing. The series below consists of the numbers 0:4, inclusive, but the index runs from 1:5, inclusive (purely for convenience in the exaple. It could have started with 11 or 1000.)
If we try to access ser[0], we get a ‘Key Error’ because the [] operator uses the label, not the position, and there is no label 0.
Had we not explicitly declared the index, there would have been such a label and it would not have failed.

You might think that using the [] operator to get a range would fail too if we asked for a range that does not exist. Nope. Notice that the range [0:4] does not fail. It returns all results with labels that are in that range, which are element with labels 1:4. Likewise, [0:6], for which neither the start nor the end indices exist does not fail. Even if you choose a range that includes no matching values, it doesn’t fail; it still returns the set of all matches in hte range, which happens to be empty.

In [662]:
ser = pd.Series([0,1,2,3,4],index=[1,2,3,4,5])
print('The series \n{}\n'.format(ser))

print('Here we use get() for one index that exists and two that do not.')
print('get(4)  {}'.format(ser.get(4))) 
print('get(99) {}'.format(ser.get(99))) 
print('get(\'Bob\') {}\n'.format(ser.get('Bob'))) 



print('ser[2] fetches one value: {}'.format(ser[2]))
print('ser[99] blows up with a key error: {}'.format('FAIL!'))
print('ser[\'Bob\'] blows up with a key error: {}\n'.format('FAIL!'))

print('ser[0:4]\n{}'.format(ser[0:4])) 
print('ser[0:6]\n{}'.format(ser[0:6]))

print('\nNow we create the series again with the index backwards, [5,4,3,2,1]')
ser = pd.Series([0,1,2,3,4],index=[5,4,3,2,1])
print('\nser[0:4] still works \n{}'.format(ser[0:4])) 
print('\nser[0:6] still works \n{}'.format(ser[0:6]))
print('\nser[100:200] still returns an empty series\n{}'.format(ser[100:200]))
The series 
1    0
2    1
3    2
4    3
5    4
dtype: int64

Here we use get() for one index that exists and two that do not.
get(4)  3
get(99) None
get('Bob') None

ser[2] fetches one value: 1
ser[99] blows up with a key error: FAIL!
ser['Bob'] blows up with a key error: FAIL!

ser[0:4]
1    0
2    1
3    2
4    3
dtype: int64
ser[0:6]
1    0
2    1
3    2
4    3
5    4
dtype: int64

Now we create the series again with the index backwards, [5,4,3,2,1]

ser[0:4] still works 
5    0
4    1
3    2
2    3
dtype: int64

ser[0:6] still works 
5    0
4    1
3    2
2    3
1    4
dtype: int64

ser[100:200] still returns an empty series
Series([], dtype: int64)

Accessing a Series Using loc[]

Loc, like the indexing operator, uses labels, not positions. Like the indexing operator, but unlike get() it fails when you attempt access with a label that does not exist. Like the indexing operator, as long as you’re using a range it’s ok for no matching key to be found.

In [663]:
ser = pd.Series([0,1,2,3,4],index=[1,2,3,4,5])
print('The series \n{}\n'.format(ser))
print('\nser.loc[1] {}'.format(ser.loc[1]))
print('\nser.loc[0] {}'.format('FAILS!'))
print('\nser.loc[6] {}'.format('FAILS!'))
print('\nser.loc[3:100] gets the elements 4 and 5.\n{}'.format(ser.iloc[3:100]))
print('\nser.loc[\'a\':\'x\'] succeeds but returns nothing.\n{}'.format(ser.loc['a':'x']))

 
The series 
1    0
2    1
3    2
4    3
5    4
dtype: int64


ser.loc[1] 0

ser.loc[0] FAILS!

ser.loc[6] FAILS!

ser.loc[3:100] gets the elements 4 and 5.
4    3
5    4
dtype: int64

ser.loc['a':'x'] succeeds but returns nothing.
Series([], dtype: int64)

Loc doesn’t add much to simply using the index operator [] directly when manipulating a series but it comes into its own with DataFrame objects.

Accessing A Series with iloc[]

We’ve seen that indices are bound to the values they index, which is not always what you want. The iloc[] function is for use when you want to access your Series (or Dataframe) object by position, rather than by index.

Both loc and iloc are used with either Series or Dataframe objects but while loc doesn’t add very much to Series access, iloc does, allowing you to do things that would be otherwise quite cubersome.

We see below that ser.iloc[0] gets the object at position 0, which is 0. To get this with loc you’d need to know its index. The rest is quite similar to loc except that you’re always talking about a position regardless of the index.Below we repeat the same three accesses on a series that is identical with the first except that the indices are letters. We get the same values.

In [664]:
ser = pd.Series([0,1,2,3,4],index=[1,2,3,4,5])
print('The series \n{}\n'.format(ser))
print('\nser.iloc[0] {}'.format(ser.iloc[0]))
print('\nser.iloc[0:] \n{}'.format(ser.iloc[2:]))
ser = pd.Series([0,1,2,3,4],index=['a','b','c','d','e'])
print('\,The series \n{}\n'.format(ser))
print('\nser.iloc[0] {}'.format(ser.iloc[0]))
print('\nser.iloc[0:] \n{}'.format(ser.iloc[2:]))
The series 
1    0
2    1
3    2
4    3
5    4
dtype: int64


ser.iloc[0] 0

ser.iloc[0:] 
3    2
4    3
5    4
dtype: int64
\,The series 
a    0
b    1
c    2
d    3
e    4
dtype: int64


ser.iloc[0] 0

ser.iloc[0:] 
c    2
d    3
e    4
dtype: int64

Some Boolean Access Methods for Series

A simple Boolean expression can be place in [] with or without the loc[] method to filter a series.

You can also pass loc[] an array of True/False values so long as it has the same length as the series. There is an example below.

More usefully, you can pass in an arbitrary function (technically a “callable”) that takes one argument of the appropriate type and returns a Boolean value. If you pass foo(x) as follows, loc[foo], it will be applied to the values (not the keys). Note that callables for DataFrames differ slightly.

This allows complex filtering criteria to be used. In the example, it’s a lambda function declared inline, but it could have been a function created in the usual way.

Should you want to do the same trick on the keys (not sure that would be good program design), you could accomplish it by computing a Boolean array by applying an appropriate Boolean valued function to the index.

In [665]:
ser = pd.Series([0,1,2,3,4],index=[1,2,3,4,5])
print('\nfiltering on a simple boolean expression with [ser>2]: \n{}'.format(ser[ser>2]))
print('\nfiltering on a simple boolean expression with ser.loc[ser>2]: \n{}'.format(ser.loc[ser>2]))

boolarray = [True,True,False,False,False]
print('\nser.loc[boolean-array]\n{}'.format(ser.loc[boolarray]))  
print('\nser.loc[lambda x : x%2==0]\n{}'.format(ser.loc[lambda x : x%2==0]))
filtering on a simple boolean expression with [ser>2]: 
4    3
5    4
dtype: int64

filtering on a simple boolean expression with ser.loc[ser>2]: 
4    3
5    4
dtype: int64

ser.loc[boolean-array]
1    0
2    1
dtype: int64

ser.loc[lambda x : x%2==0]
1    0
3    2
5    4
dtype: int64
In [ ]:

Enlargement of a Series

A series can have more elements tacked onto the end simply by assigning to an indexed position that does not already exist. If ser has n elements, ser[some-unknown-index]= will increase its size by one.

The at[] and iat[] Operations

The plain [] and loc[] methods of access do many things, so there is a certain amount of overhead that is not necessary for simple accesses. The at[] operation cut directly to the chase and simply looks up the requested value. Likewise, the iat[] operation looks up a value positionally.

Dealing With Missing Data

Pandas supply a number of functions for fixing data.

Imagine you have two Series datasets that you’d like to add together element-wise. Many of the indices appear in both datasets, but some are unique to one or the other, as in series1 and series2 below. If you just add them, sum=series1+series2, you’ll end up with a lot of NaN values because n + None has no mathematical meaning.

  • clean=sum.dropna() will discard all the NaN values. This is sometimes what you want, but other times it makes no sense to discard one value simply because you don’t have a second.
  • sum=fillna(n) supplies a default value (n) to all the Nan entries. You’re not discarding the values, but ont he other hand, you’re dropping any meaning from the value that you do have.
  • s1.add(s2,fill_value=0) actually fixes the data by supplying a default for the missing value rather than by fixing the defective result. The identity for addition, 0, is the most useful value here.
  • s2.multiply(s2.fill_value=1) This works much like the add() function.
In [666]:
series1 = pd.Series([1,2,3,4,5], index=[1,2,3,4,5])
series2 = pd.Series([3,4,5,6,7], index=[3,4,5,6,7])
sum = series1 + series2
print('Sum of two series"\n{}'.format(sum)) 
print('Dropping the NaN entries"\n{}'.format(sum.dropna())) 
print('Replacing NaN entries with default"\n{}'.format(sum.fillna(0)))  
print('Fixing with add() "\n{}'.format(series1.add(series2,fill_value=0))) 
print('Fixing with add() "\n{}'.format(series1.multiply(series2,fill_value=1))) 
Sum of two series"
1     NaN
2     NaN
3     6.0
4     8.0
5    10.0
6     NaN
7     NaN
dtype: float64
Dropping the NaN entries"
3     6.0
4     8.0
5    10.0
dtype: float64
Replacing NaN entries with default"
1     0.0
2     0.0
3     6.0
4     8.0
5    10.0
6     0.0
7     0.0
dtype: float64
Fixing with add() "
1     1.0
2     2.0
3     6.0
4     8.0
5    10.0
6     6.0
7     7.0
dtype: float64
Fixing with add() "
1     1.0
2     2.0
3     9.0
4    16.0
5    25.0
6     6.0
7     7.0
dtype: float64

Transforming the Values of a Series

In [667]:
s = pd.Series([23,72,78,94,99,65])
print('Series with grades expressed as numbers."\n{}'.format(s)) 
s.loc[:] = s.apply(lambda x: 'P' if x>75 else 'F')
print('Same series converted to pass/fail."\n{}'.format(s)) 

s = pd.Series([23,72,78,94,99,65])
for i in range(s.size):
    s.iat[i] = 1 if s.iat[i]>75 else 0 
print('Same series converted to 1/0."\n{}'.format(s)) 
Series with grades expressed as numbers."
0    23
1    72
2    78
3    94
4    99
5    65
dtype: int64
Same series converted to pass/fail."
0    F
1    F
2    P
3    P
4    P
5    F
dtype: object
Same series converted to 1/0."
0    0
1    0
2    1
3    1
4    1
5    0
dtype: int64

The End

This is just a quick summary of the things you can do with the Pandas Series. We’ll take a look at the two-D DataFrames in the next post.

Standard

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s