01 Introduction to Python: EDA 101

!pip install numpy pandas seaborn matplotlib

import numpy as np
import pandas as pd

Pandas

The most usefull and commonly used library for tabular data.

url = 'https://raw.github.com/mattdelhey/kaggle-titanic/master/Data/train.csv'
titanic = pd.read_csv(url)
titanic.info()

titanic

titanic.describe()

titanic.sort_values(by='age', ascending=False).head(5)

Indexing can be tricky.

titanic[['age', 'name']].head(5)

titanic.iloc[[2, 5, 6], 2:5]

type(titanic)

You can extract a numpy array

type(titanic.values)  # depracted
type(titanic.to_numpy())

ages = titanic.age.to_numpy()
ages.shape, ages.dtype

See more details here: 10 Minutes to pandas (actually it requires much more)

http://pandas.pydata.org/pandas-docs/stable/10min.html

Matplotlib

A workhorse of scientific visualization in Python.

from matplotlib import pyplot as plt

[deprecated] Set figure appearance in notebook (no pop up).

# %matplotlib inline

Seaborn

A high-level library for visualization and exploratory data analysis.

!pip install seaborn

import seaborn as sns

# sns.set() allows to use a more attractive color scheme for plots
sns.set()

sns.catplot(x="pclass", kind="count", data=titanic)

sns.catplot(data=titanic, x="pclass", hue="sex", kind="count")

fg = sns.FacetGrid(titanic, hue="sex", aspect=3)
fg.map(sns.kdeplot, "age", fill=True)
fg.set(xlim=(0, 80));

fg = sns.FacetGrid(titanic, col="sex", row="pclass", hue="sex", height=2.5, aspect=2.5)
fg.map(sns.kdeplot, "age", fill=True)
fg.map(sns.rugplot, "age")
sns.despine(left=True)
fg.set(xlim=(0, 80));

See more example of Seaborn visualizations for the Titanic dataset here

https://gist.github.com/mwaskom/8224591

Hands-on

Upload data from the csv file
Check column names
Look for dependencies between features and the target vector

Scikit learn

A machine learning library

!pip install scikit-learn

from sklearn.neighbors import KNeighborsClassifier

Let’s do little bit of processing to make some different variables that might be more interesting to plot. Since this notebook is focused on visualization, we’re going to do this without much comment.

titanic = titanic.drop(["name", "ticket", "cabin"], axis=1)
titanic["sex"] = titanic.sex.map({"male":0, "female":1})
titanic = pd.get_dummies(titanic, dummy_na=True, columns=['embarked',])
titanic.head(6)

titanic.count()

titanic.dropna(inplace=True)
titanic.head(6)

titanic.count()

# extract X - features & y - targets
X = titanic.drop('survived', axis=1)
y = titanic.survived

Now it’s time to build a model

# initialize a classifier
clf = KNeighborsClassifier()

# train the classifier
clf.fit(X, y)

# calculate predictions
y_predicted = clf.predict(X)

# estimate accuracy
print('Accuracy of prediction is {}'.format(np.mean(y == y_predicted)))

#you can also specify some parameters during initialization
clf = KNeighborsClassifier(n_neighbors=10)

clf.fit(X, y)
y_predicted = clf.predict(X)
print('Accuracy of prediction is {}'.format(np.mean(y == y_predicted)))

# you can also predict probabilities of belonging to a particular class
proba = clf.predict_proba(X)
proba_df = pd.DataFrame(proba, index=y.index, columns=[0, 1])
proba_df['true'] = y

fg = sns.FacetGrid(proba_df, hue="true", aspect=3)
fg.map(sns.kdeplot, 0, fill=True)
plt.xlabel('Predicted probability of survivance')
plt.legend(['survived=0', 'survived=1'])