Automated machine learning is a tool for creating a suite that automates the task of applying machine learning to problems. I have decided to create my own automated learning system and to begin this journey I have started working on feature engineering. In this Kaggle notebook I went through creating generic functions to handle the feature engineering of a dataset: https://www.kaggle.com/taranmarley/automl-from-scratch-1
In traditional machine learning there is a lot of heavy lifting provided by the person analyzing the data. While there is nothing wrong with this approach and a human’s knowledge of context might be of great benefit. That said a person also creates a bias in the data based on their existing beliefs for the same reasons. This means automated learning provides a benefit even to experts by allowing a non biased baseline to be created.
The linked notebook goes through first how to detect NaN values in the dataset. These values are unsuitable for many machine learning algorithms so we need to detect them first so that we might be able to deal them. This is the generic function I created for this:
This is an example of the sort of documented generic function that compromises my initial work on this future AutoML library. My next step is to work on dealing with these NaN values by first creating new columns that indicate that the dataset originally had no value in this row and then put replace the value with zero. If you picture a table with a column of [age] where there is no entry for age this is replaced with 0 and then a new column is created that shows [age_was_null] which will be values of 0 and 1. Where no age was recorded this column will show 1.
A similar process is followed to get rid of duplicates. In this process id columns need to be eliminated from the search as they will always be unique and then the data without the unique ids is investigated for duplicates. If desirable these duplicates can be automatically eliminated to include only the first such instance of them.
The next step was breaking up columns by string. This was important for columns like Name where splitting up the names could give you the family name and therefore this yields information. Say my name: Taran Marley. The first part isn’t generally useful but the Last name can give family information.
The above are examples of what can be achieved but there is so much more in the notebook. Going through label encoding where we need to turn object or string columns of our data into numerical ones that machine learning methods can understand. In this way names might be changed into categorical numbers.
I also transform the feature data into uniform distributions. This can be helpful in reducing the ability of a machine learning algorithm to latch onto statistical noise and thereby get stuck into guessing based on information that isn’t really there or doesn’t apply to data in the wider world. It must be remembered that the goal of machine learning isn’t to simply create a model that works well on the data you already know but will be a general solution that will maintain its accuracy on unseen data.
I had a lot of fun working on the feature engineering part of this project and look forward to doing more work in the future. In future updates I want to take the feature engineering part further by moving onto a custom boosting model and generating interactions. I can also move onto working on the two other sections of the autoML I would like to achieve: Exploratory Data Analysis and Model Training.
Leave a Reply