Python for Data Analysis

  • the book is available here

2

  • The Python Cookbook will be a good read to make up for gaps in my knowledge.
  • Python is an interpretive language.
    • vs. a compiled language, that needs a compiler to convert all instructions to machine code before they can be run
    • what difference does this make?
  • This is a fun tidbit: The pandas name itself is derived from panel data, an econometrics term for multidimensional structured datasets, and a play on the phrase Python data analysis.
  • Need to check out the statsmodel library.
  • Introspection
    • I love the use of this word in this context
    • Using some_function? or ?some_function, you can access the docstring
    • Might be useful in my own work to prevent scrolling or opening up tabs
  • Reminder that lists are mutable and if you assign (bind) 2 variables to the same list, they’ll both change when you change one of them.

3: Built-In

Data Structures & Sequences

  • Tuples
    • Interesting thing about tuples: the *rest is related to the **kwargs
values = 1, 2, 3, 4, 5
a, b, *rest = values
print(a)
print(b)
print(*rest)
1
2
3 4 5
  • Lists
    • Some methods: append, insert, pop, remove, extend, sort
    • Quote: insert is computationally expensive compared with append, because references to subsequent elements have to be shifted internally to make room for the new element. If you need to insert elements at both the beginning and end of a sequence, you may wish to explore collections.deque, a double-ended queue, which is optimized for this purpose and found in the Python Standard Library.
    • x in list, x not in list syntax
      • Quote: Checking whether a list contains a value is a lot slower than doing so with dictionaries and sets, as Python makes a linear scan across the values of the list, whereas it can check the others (based on hash tables) in constant time.
    • Quote: Note that list concatenation by addition is a comparatively expensive operation since a new list must be created and the objects copied over. Using extend to append elements to an existing list, especially if you are building up a large list, is usually preferable.
    • sort has a key argument:
b = ["saw", "small", "He", "foxes", "six"]
b.sort(key=len)
b
['He', 'saw', 'six', 'small', 'foxes']

  • Dictionaries
    • aka hash map or associative array
    • del, pop, update (merge 2 dicts),
    • creating dicts from sequences:
      • Way 1: iterate over 2 sequences
      • Way 2: Using tuples
        • ‘a dictionary is essentially a collection of 2-tuples’
# Way 1
mapping = {}
for key, value in zip(key_list, value_list):
    mapping[key] = value
# Way 2
tuples = zip(range(5), reversed(range(5)))
mapping = dict(tuples)
mapping
{0: 4, 1: 3, 2: 2, 3: 1, 4: 0}
  • Dictionaries contd.
    • setdefault is a thing
      • as is defaultdict from collections module
    • dictionary keys can be tuples as well as scalars (int, string, float)
      • basically any immutable data type
      • this means you can convert a list into a tuple and use it as a key, which is kinda neat
  • Sets
    • have some interesting methods e.g. a.difference_update(b) & a.symmetric_difference(b)

  • zip pairs up elements of sequences (lists, tuples) to create a list of tuples:
s1 = ['a', 'b', 'c']
s2 = [1, 2, 3]
list(zip(s1, s2))
[('a', 1), ('b', 2), ('c', 3)]
  • In addition to list comprehension, there’s also dict and tuple comprehension:
strings = ["a", "as", "bat", "car", "dove", "python"]
# list
print([x.upper() for x in strings if len(x) > 2])
# tuple
print({len(x) for x in strings})
# dict
print({value: index for index, value in enumerate(strings)})
['BAT', 'CAR', 'DOVE', 'PYTHON']
{1, 2, 3, 4, 6}
{'a': 0, 'as': 1, 'bat': 2, 'car': 3, 'dove': 4, 'python': 5}
  • nested comprehension exists, but it isn’t as fun

Functions

As a rule of thumb, if you anticipate needing to repeat the same or very similar code more than once, it may be worth writing a reusable function. Functions can also help make your code more readable by giving a name to a group of Python statements.

  • Scopes
    • Functions are assigned local namespaces
      • local namespace is destroyed after function finishes run
    • Functions can access higher or global namespaces
    • global and nonlocal keywords for assigning variable inside a function to higher namespaces
  • Jeez, just look at the following example:
    • Works because functions are objects
import re

def clean_strings(strings):
    result = []
    for value in strings:
        value = value.strip()
        value = re.sub("[!#?]", "", value)
        value = value.title()
        result.append(value)
    return result

states = ["   Alabama ", "Georgia!", "Georgia", 
          "georgia", "FlOrIda","south   carolina##", 
          "West virginia?"]
clean_strings(states)
['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South   Carolina',
 'West Virginia']
# alternative
def remove_punctuation(value):
    return re.sub("[!#?]", "", value)

# all the ops you do on a string added to this list
clean_ops = [str.strip, remove_punctuation, str.title]

def clean_strings_alt(strings, ops):
    result = []
    for value in strings:
        for func in ops:
            # apply those functions on your value
            value = func(value)
        result.append(value)
    return result

clean_strings_alt(states, clean_ops)
['Alabama',
 'Georgia',
 'Georgia',
 'Georgia',
 'Florida',
 'South   Carolina',
 'West Virginia']
  • map is interesting
    • map applies a function to a sequence
    • this example also shows that you can pass functions to other functions as inputs
print(strings)
set(map(len, strings))
['a', 'as', 'bat', 'car', 'dove', 'python']
{1, 2, 3, 4, 6}
print(states)
list(map(remove_punctuation, states))
['   Alabama ', 'Georgia!', 'Georgia', 'georgia', 'FlOrIda', 'south   carolina##', 'West virginia?']
['   Alabama ',
 'Georgia',
 'Georgia',
 'georgia',
 'FlOrIda',
 'south   carolina',
 'West virginia']
  • Anonymous (lambda) functions
    • functions with a single statement
    • return some value
def apply_to_list(some_list, f):
    return [f(x) for x in some_list]

ints = [4, 0, 1, 5, 6]

apply_to_list(ints, lambda x: x * 2)
[8, 0, 2, 10, 12]
strings = ["foo", "card", "bar", "aaaa", "abab"]
strings.sort()
strings
['aaaa', 'abab', 'bar', 'card', 'foo']
strings = ["foo", "card", "bar", "aaaa", "abab"]
strings.sort(key=lambda x: len(set(x))) #sort by num of unique chars in str
strings
['aaaa', 'foo', 'abab', 'bar', 'card']
  • Generators
    • itertools module
      • groupby, chain(*iterables), combinations(iterable, k), permutations(iterable, k), groupby(iterable[, keyfunc]), product(*iterables, repeat=1)
  • Error & exception handling
    • try/except is for when you want things to fail gracefully
float("something")
ValueError: ignored
def attempt_float(x):
    try:
        return float(x)
    except:
        return x

attempt_float("something")
'something'
# there are other types of error
# e.g. type error
float((1, 2))
TypeError: ignored
# might want to specify which error type to except, therefore:
def attempt_float(x):
    try:
        return float(x)
    except ValueError:
        return x

# or a tuple of errors
def attempt_float(x):
    try:
        return float(x)
    except (TypeError, ValueError):
        return x
  • finally is for the final things that must happen regardless of what happens with try:
f = open(path, mode="w")

try:
    write_to_file(f)
finally:
    f.close()
  • else is for what happens if try succeeded:
f = open(path, mode="w")

try:
    write_to_file(f)
except:
    print("Failed")
else:
    print("Succeeded")
finally:
    f.close()

5: Pandas

import pandas as pd
import numpy as np
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Ohio", "Texas", "California"])
frame
Ohio Texas California
a 0 1 2
c 3 4 5
d 6 7 8
# this exists
frame.loc[["a", "c"], ["California", "Texas"]]
California Texas
a 2 1
c 5 4