14  Checklists

15 Tips for machine learning projects

15.1 General advice

General advice for machine learning from Pedro Domingos:

  • Let your knowledge about the problem help you choose the candidate algorithms. E.g. You know the rules on which comparing samples makes most sense \rightarrow Choose instance based learners. If you know that statistical dependencies are relevant \rightarrow choose Graph based models.

  • Don’t underestimate the impact of feature engineering: Many domain specific features can boost the accuracy.

  • Get more samples and candidate features (instead of focussing on the algorithm)

  • Don’t confuse correlation with causation. Just because your model can predict something, it does not mean that the features cause the target and you thus cannot easily deduct a clear action from it.

15.2 Common mistakes

Be aware: This list will never capture everything that can go wrong. ;-)

  • Data Leakage: Information from Samples in your test data have leaked into your training data.

    • You have not deleted duplicates beforehand
    • You falsely assumed that your samples where drawn independently and have sampled the training set randomly. (E.g. multiple samples from the same patient, time series data)
    • You have the class label encoded in the training features in a way that you will not find in “Nature”.
    • You just used the wrong training / test set while programming.
    • You did feature engineering like finding n-grams or Max, Min of data using your test-set data.
    • Remedy: Careful preliminary data analysis, deduplication,
  • Using the wrong quality measures on unbalanced data: E.g. Accuracy on unbalanced data is not a reasonable quality measure.

  • Inconsistent preprocessing: If you preprocess your training data in a certain way, you have to do the same with the test- and prediction-data.

    • Remedy: Use one preprocessing pipeline that you can use for training, testing and prediction.
  • Curse of dimensionality:

    • You use too many features for the amount of samples that you have
    • Your distance measure is not suitable for high-dimensional space (e.g. Hamming distance, Euclidean distance)
    • Remedy: Use lower-dimensional mapping, feature selection.
  • Overfitting:

    • You use a too complex algorithm (too many degrees of freedom) for the amount of data you have
    • You have too many features
    • Remedy: Get more samples, reduce the dimensionanlity, feature selection, regularization, bagging, boosting, stacking.
  • Bad Data:

    • Your data is not representative of what you would find in the “real world”. (skewed population, too old data, only of specific sensors, locations…)
    • Your have many missing values among your features.
    • The data that you have is only remotely linked to the target that you want to predict.
    • There are erroneous entries in your data.
    • Remedy: Clean data at source, impute data, clean data during preprocessing, get more representative data, limit scope of application.

16 Data Import Checklist

  • Was the data-set correctly imported?

    • No column index as first row values.

    • No trailing comment as last row values.

  • Are the sample values what you expect?

  • Are columns in correct and efficient data type?

    • Has there been a shift of data between columns / rows?

    • Are there strings in a column for numerical values?

  • Is the range what you expect?

    • Are there heavy outliers?

    • Is the data biased towards certain values?

  • How many empty values are there?

17 Feature selection & engineering checklist

Selected advice from paper from Guyon and Elisseeff:

  • If you have domain knowledge: Use it.

  • Are your features commensurate (same proportion): Normalize them.

  • Do you suspect interdependent features: Construct conjunctive features or products of features.

Other advice:

  • Features that are useless on their own, can be useful in combination with other features.

  • Using multiple redundant variables can be useful to reduce noise.

  • There are also models (e.g. lasso regression, decision trees) that have feature selection built into the model (i.e. by only allowing for a certain number of features to be used or penalizing the use of additional features).