This page covers a common problem when loading data into Pandas --- when Pandas gets confused about whether values in a column are text or numbers.

## An example

In [None]:
import numpy as np
import pandas as pd
pd.set_option('mode.chained_assignment','raise')

We return to the example data file that you may have seen in the [text encoding](https://matthew-brett.github.io/cfd2019/chapters/07/text_encoding) page.

You can download the data file from [imdblet_latin.csv](https://matthew-brett.github.io/cfd2019/data/imdblet_latin.csv).

In [None]:
films = pd.read_csv('imdblet_latin.csv', encoding='latin1')
films.head()

Now imagine we are interested in the average rating across these films:

In [None]:
ratings = films['Rating'].copy()
ratings.mean()

## The problem

The problem is that we were expecting our ratings to be numbers, but in fact, they are strings.

We can see what type of thing Pandas has stored by looking at the `dtype`
attribute of a Series, or the `dtypes` attribute of a data frame.

In [None]:
films.dtypes

In [None]:
ratings.dtype

In fact both these bits of information say the same thing -- that the 'Rating'
column stores things in the "object" or "O" type.  This is a general type that
can store any Python value.   It is the standard type that Pandas uses when
storing text.

Why does Pandas use text for the 'Rating' column?

A quick look at the first rows gives the answer:

In [None]:
ratings.head()

The film "Paris, Texas (1984)" has a value "N/K" for the rating. This can't be a number, so Pandas stored this column in a format that allows it to store "N/K" as text.

If that wasn't obvious, another way of checking where the problem value is, to `apply` the function `float` to the column values.

When we `apply` a function to a Series, it does this:

* For each value in the Series it:
  * Calls the function, with the value as the single argument.
  * Collects the new value returned from the function, and appends it to a new
    Series.
* Returns the new Series.

The result is a Series that is the same length as the original series, but
where each value in the new series is the result of calling the function on the
original value.

Recall that the `float` function converts the thing you pass into a floating
point value:

In [None]:
v = float('3.14')
v

In [None]:
type(v)

Now we try applying `float` to the problematic column:

In [None]:
ratings.apply(float)

One way of dealing with this problem is to make a *recoding* function.

A recoding function is a function that we will apply to a Series.  That means that we call the function for every value in the Series.  The function argument is the value from the series.  The function returns the new value, for a new Series.

In [None]:
def recode_ratings(v):
    if v == 'N/K':  # Return missing value for 'N/K'
        return np.nan
    # Otherwise make text value into a float
    return float(v)

We test our function:

In [None]:
recode_ratings('8.3')

In [None]:
recode_ratings('N/K')

We make a new Series by calling the recode function:

In [None]:
new_ratings = ratings.apply(recode_ratings)
new_ratings.head()

We can insert this back into a copy of the original data frame:

In [None]:
films_fixed = films.copy()
films_fixed.loc[:, 'Rating'] = new_ratings
films_fixed.head()