Three quick python nuggets for beginner data scientists

Receive news and tutorials straight to your mailbox:

This blog post gives an overview of 3 core Python features and utilities which you can adopt in your scripts to help write more concise and easier to understand code:

  1. List comprehensions
  2. Slicing
  3. collections.Counter

This post presents several examples that will help you refactor existing Python code using these features.

List comprehensions

List comprehensions are a great tool when you need to build up a list of data. It is specifically useful when the resulting list is made up of a series of operations or when the resulting list is a result of another list. Take this example, which calculates the euclidean distance from a list of points using a traditional imperative-style programming approach by accumulating new elements into a resulting list.

import math
def get_distance(points):
    diffs_squared_distance = []
    for a, b in points:
        diffs_squared_distance.append(pow(a - b, 2))
    return math.sqrt(sum(diffs_squared_distance))

By using list comprehensions you can refactor this code in a more concise and declarative form:

def get_distance(points):
    diffs_squared_distance = [pow(a - b, 2) for (a, b) in points]
    return math.sqrt(sum(diffs_squared_distance))

The list comprehension is the following code: [pow(a - b, 2) for (a, b) in points] It contains three parts:

  1. An input. Here it is the list of points.
  2. A variable representing the elements in the list. In this case it is a tuple (a, b)
  3. An output expression producing the elements of the output list. In this case pow(a - b, 2)

In other words, the code above says that given a list of points, you will produce pow(a - b, 2) for each (a, b) available in the list. The result is a list itself. Note that list comprehensions can also include an optional condition when generating the list. You can find more information in the Python documentation

Slicing

Python provides a built-in feature that lets you select a subset of elements in a list. This is called slicing and you may find that useful when you are manipulating a data set where samples are stored as elements of a list.

For example:

def slice_list(data, start_index, end_index):
    sliced_data = []
    for i in range(start_index, end_index):
        sliced_data.append(data[i])
    return sliced_data 

slice_list(data, 1, 4)

In python you can do exactly that using a built-in feature for slicing: data[1:4]

You can even ignore the end index to indicate that you want all the remaining elements in the list:

data[1:] is equivalent to data[1: len(data)]

You may also wish to produce a list without the last element. You can do this by using -1 as an index too:

data[:-1] is equivalent to data[0: len(data) - 1]

collections.Counter

It is a frequent task to provide summary information about a dataset. For example, you may wish to calculate the frequency of symbols in a data set because the symbols represent different categories in your data. Traditionally, you may implement this yourself. For example:

import collections
symbols = ['o', 'x', 'o', 'o', 'x', '-', '-', '-']
count = collections.defaultdict(int)
for s in symbols:
    count[s] += 1
print(count)

This will produce the output defaultdict(int, {'-': 3, 'o': 3, 'x': 2})

Using the collections.Counter utility class you can simply do:

collections.Counter(symbols)

which will return the following dictionary Counter({'-': 3, 'o': 3, 'x': 2})

However, the Counter utility provides more flexible operations. For example, you may wish to find the first two most common elements:

count = collections.Counter(symbols)
count.most_common(2)

which produces[('-', 3), ('o', 3)]

You can find the iPython notebook with the examples here.