Jacksons Machine Learning vs Internally Projected Sales

---------------------------------------------------------

Members:

Abstract:

---------------------------------------------------------

Our project is to evaluate Jackson's/ExtraMile stores internal sales projections and actual sales,
then compare them to our machine learning sales predictions. We will evaluate the stores based off
their attributes each store possesses from source files provided by Jackson's. We will utilize
Jupyter Notebooks and a variety of algorithms to evaluate the data for the project.

all of the code will be written in Python (anaconda3) directly into a Jupyter Notebook file.
The process will begin by cleaning the data provided, this will involve resolving columns and fields
that possess no data or rather Null Fields. Once that is resolved we will then feature engineer the
data to be better equipped to analyze and train the model. Feature engineering involves creating or
removing columns to better tailor them to the algorithm and training. This will ensure the predictions
are more accurate for the Jackson's.

This will allow Jackson's to use the Jupyter Notebook with different csv files to get new predictions.
With the data they will be able to see what attributes the stores have that impact the sales in both
the positive and negative manners, allowing Jackson's/ExtraMile to adjust accordingly to improve sales.

Project Description

---------------------------------------------------------

The team initially started with 3 algorithms and eventually narrowed the selection down to XGBoost.
XGBoost is great at handling Time Series Forecasting, which is ideal when handling data heavily tied to
dates and time frames. The algorithm creates a series or trees (user defined), after each tree is finished
a new tree is created but with a minor improvement each time. When all trees are created, they're
mashed together to create a complete tree.

The end evaluations were successful by meeting a 75%-80% accuracy with the predictions, with a 15% threshold.
The few stores that failed to meet expectations were typically stores where we had limited data, that meaning
we had less than a few months worth of data for those stores. First we utilized the cleaning method to ensure
the store square footage had the proper values as that is an important characteristic of the stores.
This was done by implementing the cleaning script below which fills in the null fields with the appropriate value
depending on the store size.

Cleaning Script

The majority of the other null fields were easily remedied by the algorithm itself, as XGBoost is a self cleaning
algorithm in a manner of speaking. With for instance the predicted sales of a store, these values were blank
for the vast majority of the stores. Many were blank based off business reasons, many stores are owned by Chevron
, while many are owned by individuals who have different marketing techniques which changes the values and is often
not disclosed. XGBoost handled this for us by essentially filling in the value with an averaged number for the stores.
Below is a visual of the raw predictions before it was further refined and improved.

Raw Predictions

The next step was to train the data to be able to improve the predictions and allow us to answer the questions at hand.
The training was done with a 4/5th to 1/5th ratio, 4/5th for training and 1/5th for the prediction comparison.

Training Set

As stated earlier the predictions were fairly accurate with an 70-75% accuracy on most stores.
We did not have the time to fully isolate the characteristics that were directly responsible for
deviations in actual sales vs the predictions. We were able to reduce the %error for the RMSE
from around 975 to 120 which took the accuracy from 50% to the current 75% roughly.

Prediction