!pip install numpy pandas seaborn matplotlib
01 Introduction to Python: EDA 101
import numpy as np
import pandas as pd
Pandas
The most usefull and commonly used library for tabular data.
= 'https://raw.github.com/mattdelhey/kaggle-titanic/master/Data/train.csv'
url = pd.read_csv(url)
titanic titanic.info()
titanic
titanic.describe()
='age', ascending=False).head(5) titanic.sort_values(by
Indexing can be tricky.
'age', 'name']].head(5) titanic[[
2, 5, 6], 2:5] titanic.iloc[[
type(titanic)
You can extract a numpy array
type(titanic.values) # depracted
type(titanic.to_numpy())
= titanic.age.to_numpy()
ages ages.shape, ages.dtype
See more details here: 10 Minutes to pandas (actually it requires much more)
http://pandas.pydata.org/pandas-docs/stable/10min.html
Matplotlib
A workhorse of scientific visualization in Python.
from matplotlib import pyplot as plt
[deprecated] Set figure appearance in notebook (no pop up).
# %matplotlib inline
Seaborn
A high-level library for visualization and exploratory data analysis.
!pip install seaborn
import seaborn as sns
# sns.set() allows to use a more attractive color scheme for plots
set() sns.
="pclass", kind="count", data=titanic) sns.catplot(x
=titanic, x="pclass", hue="sex", kind="count") sns.catplot(data
= sns.FacetGrid(titanic, hue="sex", aspect=3)
fg map(sns.kdeplot, "age", fill=True)
fg.set(xlim=(0, 80)); fg.
= sns.FacetGrid(titanic, col="sex", row="pclass", hue="sex", height=2.5, aspect=2.5)
fg map(sns.kdeplot, "age", fill=True)
fg.map(sns.rugplot, "age")
fg.=True)
sns.despine(leftset(xlim=(0, 80)); fg.
See more example of Seaborn visualizations for the Titanic dataset here
https://gist.github.com/mwaskom/8224591
Hands-on
- Upload data from the csv file
- Check column names
- Look for dependencies between features and the target vector
Scikit learn
A machine learning library
!pip install scikit-learn
from sklearn.neighbors import KNeighborsClassifier
Let’s do little bit of processing to make some different variables that might be more interesting to plot. Since this notebook is focused on visualization, we’re going to do this without much comment.
= titanic.drop(["name", "ticket", "cabin"], axis=1)
titanic "sex"] = titanic.sex.map({"male":0, "female":1})
titanic[= pd.get_dummies(titanic, dummy_na=True, columns=['embarked',])
titanic 6) titanic.head(
titanic.count()
=True)
titanic.dropna(inplace6) titanic.head(
titanic.count()
# extract X - features & y - targets
= titanic.drop('survived', axis=1)
X = titanic.survived y
Now it’s time to build a model
# initialize a classifier
= KNeighborsClassifier()
clf
# train the classifier
clf.fit(X, y)
# calculate predictions
= clf.predict(X)
y_predicted
# estimate accuracy
print('Accuracy of prediction is {}'.format(np.mean(y == y_predicted)))
#you can also specify some parameters during initialization
= KNeighborsClassifier(n_neighbors=10)
clf
clf.fit(X, y)= clf.predict(X)
y_predicted print('Accuracy of prediction is {}'.format(np.mean(y == y_predicted)))
# you can also predict probabilities of belonging to a particular class
= clf.predict_proba(X)
proba = pd.DataFrame(proba, index=y.index, columns=[0, 1])
proba_df 'true'] = y
proba_df[
= sns.FacetGrid(proba_df, hue="true", aspect=3)
fg map(sns.kdeplot, 0, fill=True)
fg.'Predicted probability of survivance')
plt.xlabel('survived=0', 'survived=1']) plt.legend([