Car Auction Algorithm

In a financial world where things are being bought or sold, there is always a possibility of an item being bought to be returned. The downside of returning a product is the reduction of trust in the organisation that sold the item and an extra logistics cost that is covered by the organisation who sold it. So, if too many items are returned, the cost weight on the organisation would be too high and unfavorable to the organisation. In this project, I looked into factors that may cause returning of cars bought in an auction and then created a machine learning algorithm to predict which car would be a bad buy.

Car-Auction: Text

Machine learning process

Data content

The data was one with about 48000 cars and it contained informations about different cars, their make, model, country where it was manufactured, place where the car was auctioned, year of production, vehicle age, whether the sale was online or not,price of car at the time it was sold, location of buyer the car was sent to and whether the car was returned or not after being bought.

Data Cleaning and wrangling

The data was put into a pandas framework where the numerical values was checked for mean, median, mode and the unique properties of the categorical variables were also checked.

Rows with missing data was removed as well as duplicate rows. Columns that were supposed to be categorical variables but were numerical in nature was changed.

Data Analysis and Visualisation

Using matplotlib as well as pandas packages of python, relationship was plotted between the age of vehicle and the percentage returned, country of manufacture and the percentage returned and finding relationships between other variables and the possibility of the car being returned.

Machine learning process

The data was separated into inputs and target columns with the target column being the column with data on whether cars were returned or not. The data was further subdivided into training set and validation set in the ratio of 70:30. Due to missing numerical values in numerical columns, Simple Imputer was used to input the missing data with the mean of the column. The numerical columns were further scaled from 0 to 1 in order not to give the higher numbers an edge in the Algorithm process by using standard scaler. Lastly, since machine learning algorithms can't recognize string variables, the categorical variables, were encoded into 1's and 0's using OneHot encoder.

Finally the model was built using Logistic regression and this has an accuracy score of over 90%.

Finally one of the most important feature for prediction was shown to be a car without a wheel type ID

Car-Auction: Text