6 Preprocessing data
6.1 Joining / merging separate tables
import pandas as pd
= pd.merge(df1, df2, how = "inner", on = "reference_column") merged_df
More info: pandas.pydata.org
6.2 Missing & wrong data
Some algorithms assume that all features of all samples have numerical values. In these cases missing values have to be imputed (i.e. inferred) or (if affordable) the samples with missing feature values can be deleted from the data set.
Iterative imputor by sklearn
For features with missing values, this imputor imputes the missing values by modelling each feature using the existing values from the other features. It uses several iterations until the results converge.
! This method scales with O(nd^3), where n is the number of samples and d is the number of features.
from sklearn.experimental import enable_iterative_imputer # necessary since the imputor is still experimental
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
= RamdomForestRegressor(n_estimators = 8, max_depth = 6, bootstrap = True)
rf_estimator = IterativeImputer(random_state=0, estimator = rf_estimator, max_iter = 25)
imputor imputor.fit_transform(X)
More info: scikit-learn.org
Median / average imputation
Simply replace missing values with the median or average of the feature:
import pandas as pd
"feature"] = df["feature"].fillna(df["feature"].median()) df[
= ifelse(is.na(dataset[, co_i]),
dataset[, co_i] ave(dataset[, co_i], FUN = function(x) mean(x, na.rm = TRUE)),
dataset[, co_i])
Deleting missing values
import pandas as pd
="any") # how="all" would delete a sample if all values were missing df.dropna(how
More info: pandas.pydata.org
Deleting duplicate entries
Duplicate entries need to be removed (exception: time series), to avoid over representation and leakage into test set.
import pandas as pd
=False) df.drop_duplicates(keep
Replacing data
import pandas as pd
apply(lambda x: 0 if x=='zero' else 1) df.Col.
Filter out data
import pandas as pd
= df[(df["Feature1"] == 0) & (df["Feature2"] != 0)] df
6.3 Continuous data
Polynomial transform
You spread out small and large values of a feature to help the algorithm to distinguish cases. It can also be used to combine two features to represent mutually supporting effects.
from sklearn.preprocessing import PolynomialFeatures
= PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)
poly "feature1", "feature2"]]) poly.fit_transform(df[[
Reduce skew
Heavy skew in a distribution can be a problem for many models (outlier effects). To reduce it you can use a power transform to map the data to a Gaussian distribution…
from sklearn.preprocessing import PowerTransformer
= PowerTransformer()
pt "skew_feature"]) pt.fit_transform(df[
More info: scikit-learn.org
… or a quantile transform to map the data to a uniform (or Gaussian) distribution
from sklearn.preprocessing import QuantileTransformer
= QuantileTransformer(n_quantiles=100, output_distribution="uniform") # alternvative distribution: "normal"
qt "skew_feature"]) qt.fit_transform(df[
More info: scikit-learn.org
6.4 Categorical data
There are multiple ways to encode categorical data, especially non-vectorized data, to make it suitable for machine learning algorithms. The string values (e.g. “male”, “female”) of categorical features have to be converted into integers. This can be done by two methods:
Ordinal encoding
An integer is assigned to each category (e.g. “male”=0, “female”=1)
from sklearn.preprocessing import OrdinalEncoder
= OrdinalEncoder(min_frequency=0.05)
ord_enc # multiple columns can be transformed at once
ord_enc.fit(X) = ord_enc.transform(X) X_transf
More info: scikit-learn.org
$col = factor(dataset$col,
datasetlabels=c(1,2,3))
This method is useful when the categories have an ordered relationship (e.g. “bad”, “medium”, “good”). If this is not the case (e.g. “dog”, “cat”, “bunny”) this is to be avoided since the algorithm might deduct an ordered relationship where there is none. For these cases one-hot-encoding is to be used.
For encoding the label for classification tasks, you can also use the scikit-learn’s LabelEncoder
. More info here: scikit-learn.org
One-hot encoding / dummy variables
One-hot encoding assigns a separate feature-column for each category and encodes it binarily (e.g. if the sample is a dog, it has 1 in the dog-column and 0 in the cat and bunny column).
from sklearn.preprocessing import OneHotEncoder
= OneHotEncoder(handle_unknown='ignore')
onehot_enc
onehot_enc.fit(X) onehot_enc.transform(X)
More info: scikit-learn.org
import pandas as pd
= ["Sex", "Type"], drop_first=True) pd.get_dummies(X, columns
More info: pandas.pydata.org
Discretizing / binning data
You can discretize features and targets from continuous to discrete/categorical (e.g. age in years to child, teenager, adult, elderly).
"Age"], bins=[0,12, 20, 65, 150], labels =["child", "teenager", "adult", "elderly"]) pd.cut(df[
More info: pandas.pydata.org
Pros:
It makes sense for the specific problem (e.g. targeting groups for marketing).
Improved signal-to-noise ratio (bins work like regularization).
possibly highly non-linear relationship of continuous feature to target is hard to learn for model.
Better interpretability of features, results and model.
Can be used to incorporate domain knowledge and make learning easier.
Cons:
Your model and results lose information
Senseless cut-offs between bins can create “artificial noise” and make learning harder.
More info: stackexchange.com
See also: wikipedia: Sampling (signal processing).
Combining rare categories
Rare categories can lead to noise in the data and blow up the amount of features when using one-hot encoding. These categories should be combined, when there are only few occurrences (e.g. When analysing page visits, combine the categories “blackberry”, “jolla”, “windows phone” into the category “other”).
import pandas as pd
import numpy as np
= pd.value_counts(df["feature"])
counts_ser = counts_ser[(counts_ser/counts_ser.sum()).lt(0.05)].index # using 5% cut-off
categories_to_mask "feature"] = np.where(df["feature"].isin(categories_to_mask),'other',df["feature"]) df[
More info: stackoverflow
In sklearn, rare categories can be filtered out when one-hot encoding the feature using the parameter min_frequency
.
from sklearn.preprocessing import OneHotEncoder
= OneHotEncoder(handle_unknown='ignore', min_frequency=0.05)
enc "feature"]) enc.fit_transform(df[
More info:scikit-learn.org
Use the parameter rare_to_value
of the setup
function.
from pycaret.time_series import ClassificationExperiment # or use other Experiment type
= ClassificationExperiment()
exp ="Sales", rare_to_value = 0.05) exp.setup(train_df, target
More info: PyCaret Docs
6.5 Date- and time-data
You can convert to the datetime format as follows:
import pandas as pd
=True) pd.to_datetime(df.date_col, infer_datetime_format
You create columns for year, month, day like this:
import pandas as pd
'year'] = df.Date.dt.year
df['month'] = df.Date.dt.month
df['day'] = df.Date.dt.day df[
6.6 Graph representation of data
The similarity/distance between points can be represented in graphs. The data points are represented as nodes, the distances/similarities as edges.
6.7 Text data
These are the common steps of pre-processing text data:
Cleaning text data
The aim is to remove errors, parts that are irrelevant for the task and to standardize.
The Clean-text only requires one command for several cleaning tasks:
Install the packge:
pip install clean-text
Usage (see steps in parameters):
from cleantext import clean
"some input",
clean(=True, # fix various unicode errors
fix_unicode=True, # transliterate to closest ASCII representation
to_ascii=True, # lowercase text
lower=False, # fully strip line breaks as opposed to only normalizing them
no_line_breaks=False, # replace all URLs with a special token
no_urls=False, # replace all email addresses with a special token
no_emails=False, # replace all phone numbers with a special token
no_phone_numbers=False, # replace all numbers with a special token
no_numbers=False, # replace all digits with a special token
no_digits=False, # replace all currency symbols with a special token
no_currency_symbols=False, # remove punctuations
no_punct="", # instead of removing punctuations you may replace them
replace_with_punct="<URL>",
replace_with_url="<EMAIL>",
replace_with_email="<PHONE>",
replace_with_phone_number="<NUMBER>",
replace_with_number="0",
replace_with_digit="<CUR>",
replace_with_currency_symbol="en" # set to 'de' for German special handling
lang
)
# or simply:
"some input", all= True)
clean(
# use within pandas:
"text"] = df["text"].apply(lambda txt : cleantext.clean_words(txt)) df[
The command clean_words
additionally returns the words as a list.
More info:
aim - Guide to CleanText
Gitub - clean-text repo
import pandas as pd
import re
"text"] = df["text"].str.lower() # make all words lowercase
df["text"] = df["text"].str.replace('ü', 'u') # replace characters
df["text"] = df["text"].str.replace(r"https?:\/\/.\S+","", regex = True) # remove URLs
df["text"] = df["text"].str.replace(r"<.*?>","", regex = True) # remove html-tags
df[
# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
= re.compile("["
emoji_pattern u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', text)
"text"] = df["text"].apply(lambda text : remove_emoji(text))
df[
"text"] = df["text"].str.strip() # strip away leading and trailing spaces
df["text"] = df["text"].str.replace(r"[^\w\s]", "", regex = True) # remove punctuation
df[
# Rarely used
"text"] = df["text"].str.lstrip("123456789") # strip away leading numbers rstrip for trailing numbers (all combinations of characters will be stripped)
df["text"] = df["text"].str.replace(r"\(.*?\)","", regex = True) # remove everything between brackets
df["year"] = df["year"].str.extract(r'^(\d{4})', expand=False) # extracts year numbers df[
Tokenization
Tokenization is the act of splitting a text into sentences or words (i.e. tokens).
Word-Tokenization
Split the text into words:
import nltk
'punkt')
nltk.download(from nltk.tokenize import word_tokenize
= word_tokenize(cleaned_text) words
SpaCy uses a sophisticated text annotation method.
- Download trained English linguistic annotation model:
!python -m spacy download en_core_web_sm
- Tokenize text:
import spacy
= spacy.load("en_core_web_sm")
nlp = nlp(text_doc)
doc = [(token.text, token.pos_, token.dep_) for token in doc] tokens
Attributes:
pos_
: Part-of-speech (e.g. noun, adjective, punctuation),
dep_
: Syntactic dependency relation (e.g. “Does … have” \rightarrow Does (auxiliary verb), have (root verb))
More info:
SpaCy - Features
Sentence Tokenization
from nltk.tokenize import sent_tokenize
= sent_tokenize(sentences_text) sentences
import spacy
= spacy.load("en_core_web_sm")
nlp = nlp(text_doc)
doc = [sent for sent in doc.sents] sentences
More info: Tutorial on SpaCy Sentencer
Vectorization
Transform sequence of tokens into numerical vector that can be processed by models.
Word count encoding
This is part of the bag-of-words method. It works as follows:
Create a vocabulary / corpus of all words in the training data.
Each word in the vocabulary becomes its own feature
For each document, count how many times the word occurs.
from sklearn.feature_extraction.text import CountVectorizer
= CountVectorizer()
count_vect = count_vect.fit_transform(doc_array) word_counts
More info: sklearn - extract features from text
Pros:
- Simple and easily interpretable.
Cons:
Order and relation between words is lost
Sparse representation is not easily usable for many models. (Large vocabularies make it worse \rightarrow Use stemming)
Term frequency-inverse document frequency (tf-idf)
This measure reflects the importance of a word to a document:
Term frequency: What is the frequency of this word in this document.
Inverse document frequency: How rare is this word among all documents.
Thus, terms that occur a lot in one document but rarely in others get a higher value.
from sklearn.feature_extraction.text import TfidfTransformer
= TfidfTransformer()
tf_transformer = tf_transformer.fit_transform(word_counts) # uses wordcounts from count-vectorizer word_tf_idfs
More info: sklearn - extract features from text
Padding
Since some sequences are shorter than others, we need to fill up the remaining parts of them ones with zeros. Thus we achieve sequences of the same length. First we need to make an ordinal encoding and create word-sequences.
from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
# Convert text to sequence
= Tokenizer(num_words = vocab_size)
tokenizer
tokenizer.fit_on_texts(train_texts)= tokenizer.texts_to_sequences(train_texts)
X_train_sequences # Padd the sequences
= pad_sequences(train_texts, padding='post', maxlen=max_sequence_length*1.5) train_texts_padded
Embedding
Embedding is the mapping of words from the sparse one-hot-encoded space into a dense space, that should reflect the meaning of the words (i.e. similar words are close together).
This is done in neural networks via an embedding layer:
= Sequential()
model =vocab_size,
model.add(layers.Embedding(input_dim=embedding_dim,
output_dim=sequence_length))
input_length# ... add further layers ...
compile() model.
You can reuse trained embeddings for other tasks. See transfer-learning More info: Google Machine Learning - Prepare Your Data
6.8 Image data
from tf.keras.preprocessing.image import ImageDataGenerator
= ImageDataGenerator(samplewise_std_normalization=True,
datagen =180,
rotation_range=20,
shear_range=0.1,
zoom_range=True,
horizontal_flip=True,
vertical_flip=0.7)
validation_split= datagen.flow_from_directory(directory = "data/dir",
imgs_train =(256, 256),
target_size=32,
batch_size="categorical", # classes will be determined from subdirectory names
class_mode="training")
subset= datagen.flow_from_directory(directory = "data/dir",
imgs_test =(256, 256),
target_size=32,
batch_size="categorical", # classes will be determined from subdirectory names
class_mode="validation") subset
More info: keras.io
Load image dataset
from tf.keras.utils import image_dataset_from_directory
= image_dataset_from_directory(directory="path/tofolder", labels="inferred",color_mode="rgb", image_size=(256, 256), validation_split=0.7, subset="both"
imgs_train, imgs_test ="int") label_mode
load single image
from tf.keras.utils import load_img
from tf.keras.utils.image import img_to_array
= load_img(path="path/toimg.png", grayscale=False, color_mode="rgb", target_size=(256, 256))
img = img_to_array(img) img
More info: keras.io
Augmentation
from tf.keras import layers, Sequential
import numpy as np
= Sequential([
resize_and_rescale 1./255),
layers.Rescaling("horizontal_and_vertical"),
layers.RandomFlip(0.2),
layers.RandomRotation(
])
= []
images for idx in range(10):
= data_augmentation(img)
augmented_image
images.append(augmented_image)
= np.array(images) img_ar
More info: keras.io
# Adapted from https://github.com/bnsreenu
import os
import numpy as np
import glob # To go through folders
import cv2
= 0.7
train_split = 256
img_size
= []
images_train = []
images_test = []
labels_train = []
labels_test
for dir_path in img_dir:
= dir_path.split("/")[-1]
label print(label)
= glob.glob(os.path.join(dir_path, "*.jpeg"))
img_paths for img_idx, img_path in enumerate(img_paths):
= cv2.imread(img_path, cv2.IMREAD_COLOR)
img = cv2.resize(img, (img_size, img_size))
img = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
img = img/(255/2)-1 # scales 0 to 255 range to -1 to 1 (more or less zero-centered)
img if img_idx < train_split * len(img_paths):
images_train.append(img)
labels_train.append(label)# flip image horizontally:
1))
images_train.append(cv2.flip(img,
labels_train.append(label)# flip image vertically:
0))
images_train.append(cv2.flip(img,
labels_train.append(label)else:
images_test.append(img)
labels_test.append(label)
= np.array(images_train)
images_train = np.array(labels_train)
labels_train = np.array(images_test)
images_test = np.array(labels_test) labels_test
7 Standardization
Many machine learning models assume that the features are centered around 0 and that all have a similar variance. Therefore the data has to be centered and scaled to unit variance before training and prediction.
from sklearn.preprocessing import StandardScaler
= StandardScaler()
scaler scaler.fit_transform(input_df)
More info: scikit-learn.org
library(caret)
= preProcess(training_set, method = c("center", "scale"))
preProcessor = predict(preProcessor, training_set)
train_set_transformed = predict(preProcessor, test_set) test_set_transformed
More info: caret documentation: centering and scaling
Another option for scaling is normalization. This is used, when the values have to fall strictly between a max and min value.
More info: scikit-learn.org
8 Splitting in training- and test-data
You need to split your training set into test- and training-samples. The algorithm uses the training samples with the known label/target value for fitting the parameters. The test-set is used to determine if the trained algorithm performs well on new samples as well. You need to give special considerations to the following points:
Avoiding data or other information to leak from the training set to the test-set
Validating if the predictive performance deteriorates over time (i.e. the algorithm will perform worse on new samples). This is especially important for models that make predictions for future events.
Conversely, sampling the test- and training-sets randomly to avoid introducing bias in the two sets.
# assuming you already imported the data and separated the label column:
from sklearn.model_selection import train_test_split
= train_test_split(X, y, test_size=0.33, random_state=42) X_train, X_test, y_train, y_test
More info: scikit-learn.org
library(caTools)
set.seed(42)
= sample.split(dataset$label_col, SplitRatio = 0.8) # generates a vector with TRUE and FALSE entries
split = subset(dataset, split == TRUE)
training_set = subset(dataset, split == FALSE) test_set
9 Feature selection
Usually the label does not depend on all available features. To detect causal features, remove noisy ones and reduce the running and training costs of the algorithm, we reduce the amount of features to the relevant ones. This can be done a priori (before training) or using wrapper methods (integrated with the prediction algorithm to be used).
! There are methods that have feature selection already built-in, such as decision trees.
9.1 A priori feature selection
A cheap method is to remove all features with variance below a certain threshold.
from sklearn.feature_selection import VarianceThreshold
= VarianceThreshold(threshold=0.1)
selector selector.fit_transform(X)
More info: scikit-learn.org
Mutual information score
works by choosing the features that have the highest dependency between the features and the label.
I(X, Y) \\ = D_{KL} \left( P(X=x, Y=y), P(X=x) \otimes P(Y=y) \right) \\ = \sum_{y \in Y} \sum_{x \in X} { P(X=x, Y=y) \log\left(\frac{P(X=x, Y=y)}{P(X=x)P(Y=y)}\right) }
where, D_{KL} is the Kullback–Leibler divergence (A measure of similarity between distributions). The \log-Term is for quantifying how different the joint distribution is from the product of the marginal distributions.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_classif # for regression use mutual_info_regression
= SelectKBest(mutual_info_classif, k=8).fit_transform(X, y) X_new
More info: scikit-learn.org
wikipedia.org/wiki/Mutual_information
9.2 wrapper methods
Using greedy feature selection as a wrapper method, one commonly starts with 0 features and adds the feature that returns the highest score with the used classifier.
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.tree import DecisionTreeClassifier
= DecisionTreeClassifier()
classifier = SequentialFeatureSelector(classifier, n_features_to_select=8)
selector selector.fit_transform(X, y)
More info: scikit-learn.org