!pip install numpy pandas seaborn matplotlib01 Introduction to Python: EDA 101
import numpy as np
import pandas as pdPandas
The most usefull and commonly used library for tabular data.
url = 'https://raw.github.com/mattdelhey/kaggle-titanic/master/Data/train.csv'
titanic = pd.read_csv(url)
titanic.info()titanictitanic.describe()titanic.sort_values(by='age', ascending=False).head(5)Indexing can be tricky.
titanic[['age', 'name']].head(5)titanic.iloc[[2, 5, 6], 2:5]type(titanic)You can extract a numpy array
type(titanic.values) # depracted
type(titanic.to_numpy())ages = titanic.age.to_numpy()
ages.shape, ages.dtypeSee more details here: 10 Minutes to pandas (actually it requires much more)
http://pandas.pydata.org/pandas-docs/stable/10min.html
Matplotlib
A workhorse of scientific visualization in Python.
from matplotlib import pyplot as plt[deprecated] Set figure appearance in notebook (no pop up).
# %matplotlib inlineSeaborn
A high-level library for visualization and exploratory data analysis.
!pip install seabornimport seaborn as sns# sns.set() allows to use a more attractive color scheme for plots
sns.set()sns.catplot(x="pclass", kind="count", data=titanic)sns.catplot(data=titanic, x="pclass", hue="sex", kind="count")fg = sns.FacetGrid(titanic, hue="sex", aspect=3)
fg.map(sns.kdeplot, "age", fill=True)
fg.set(xlim=(0, 80));fg = sns.FacetGrid(titanic, col="sex", row="pclass", hue="sex", height=2.5, aspect=2.5)
fg.map(sns.kdeplot, "age", fill=True)
fg.map(sns.rugplot, "age")
sns.despine(left=True)
fg.set(xlim=(0, 80));See more example of Seaborn visualizations for the Titanic dataset here
https://gist.github.com/mwaskom/8224591
Hands-on
- Upload data from the csv file
- Check column names
- Look for dependencies between features and the target vector
Scikit learn
A machine learning library
!pip install scikit-learnfrom sklearn.neighbors import KNeighborsClassifierLet’s do little bit of processing to make some different variables that might be more interesting to plot. Since this notebook is focused on visualization, we’re going to do this without much comment.
titanic = titanic.drop(["name", "ticket", "cabin"], axis=1)
titanic["sex"] = titanic.sex.map({"male":0, "female":1})
titanic = pd.get_dummies(titanic, dummy_na=True, columns=['embarked',])
titanic.head(6)titanic.count()titanic.dropna(inplace=True)
titanic.head(6)titanic.count()# extract X - features & y - targets
X = titanic.drop('survived', axis=1)
y = titanic.survivedNow it’s time to build a model
# initialize a classifier
clf = KNeighborsClassifier()
# train the classifier
clf.fit(X, y)
# calculate predictions
y_predicted = clf.predict(X)
# estimate accuracy
print('Accuracy of prediction is {}'.format(np.mean(y == y_predicted)))#you can also specify some parameters during initialization
clf = KNeighborsClassifier(n_neighbors=10)
clf.fit(X, y)
y_predicted = clf.predict(X)
print('Accuracy of prediction is {}'.format(np.mean(y == y_predicted)))# you can also predict probabilities of belonging to a particular class
proba = clf.predict_proba(X)
proba_df = pd.DataFrame(proba, index=y.index, columns=[0, 1])
proba_df['true'] = y
fg = sns.FacetGrid(proba_df, hue="true", aspect=3)
fg.map(sns.kdeplot, 0, fill=True)
plt.xlabel('Predicted probability of survivance')
plt.legend(['survived=0', 'survived=1'])