Udacity Data Science Nanodegree — Capstone Project
** All code for this project is included in this Github repository **
Starbucks Corporation is an American multinational chain of coffeehouses and roasting reserves headquartered in Seattle, Washington. As the world’s largest coffeehouse chain, Starbucks operates more than 30,000 locations worldwide in more than 70 countries. To connect with its customer, Starbucks has a reward mobile application to send advertisements and offers.
In this project, I will be analyzing data coming out of this app and try to find trends and relations between users information and offer data. Finally, I will build a machine learning model to predict whether a user will respond to an offer or not.
Each user on the application has an account that can include demographic information on the user. A user can make a purchase, receive an offer, view an offer or complete an offer.
There are three types of offers that can be sent: buy-one-get-one (BOGO), discount, and informational.
- BOGO: a user needs to spend a certain amount to get a reward equal to that threshold amount.
- Discount: a user gains a reward equal to a fraction of the amount spent.
- Informational: mere advertisement for a drink
There are three datasets available:
- portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
- profile.json — demographic data for each customer
- transcript.json — records for transactions, offers received, offers viewed, and offers completed
The problem that we are trying to answer is how does a customer respond when an offer is sent to them.
The strategy that we will be following is:
- Data preprocessing and cleaning: we will look deeper at the data and understand its content. Data will then be cleaned from anomalies, null values, and duplicates.
- Data analysis and visualization: data will be further analyzed and visualized to answer more detailed questions relating to our problem.
- Data modelling: we will try to build a machine learning model that will predict whether a user will complete an offer or not. model is evaluated based on f1-score.
Data Preprocessing and Cleaning:
1- portfolio.json (10 x 6):
id (string): offer id.
offer_type (string): type of offer i.e. BOGO, discount, informational.
difficulty (int): minimum required spend to complete an offer.
reward (int): reward is given for completing an offer.
duration (int): time for the offer to be open, in days.
channels (list of strings): channels include mobile, web, email and/or social.
To clean this dataset we did the following:
- Expand “channels” into binary columns of all different channels in the dataset (email, web, mobile, social)
- Expand “offer_type” into binary columns for all different offer types.
2- profile.json (17000 x 5)
age (int) — age of the customer
became_member_on (int) — date when customer created an app account
gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
id (str) — customer id
income (float) — customer’s income
For this dataset, we did the following:
- Remove Nan and invalid values (eg: Age = 118).
2,175 users didn’t input any information and therefore they were removed from the dateset. All users who had Age = 118 were among the 2,175 users with missing information which seems like an auto placeholder for null values.
- format ‘became_member_of’ column to Y-M-D
transcript.json (306534, 4)
event (str) — record description (ie transaction, offer received, offer viewed, etc.)
person (str) — customer id
time (int) — time in hours since start of test. The data begins at time t=0
value — (dict of strings) — either an offer id or transaction amount depending on the record
For this dataset, we did the following:
- Expand “event” into binary columns of all different events in the dataset
- Edit the ‘value’ column and expand it into (offer_id, amount, reward) columns.
Offer_id, reward — for offer events
amount — for transaction events
- 374 duplicate entries were removed.
Finally, for simplicity, the hash values of user_id and offer_id were all mapped to integers.
Data analysis and visualization:
In this section, we will answer more detailed questions relating to the problem statement.
Q1) What are the users age and gender distribution in the dataset?
We see that the majority of users are of ages 50–70 and males are 57% of the entire dataset. 1% of users didn't identify themselves as females or males.
Q2) How is the data distributed between different events?
The dataset contains 45% transaction events and 55% offer related events.As expected, not all received offers were viewed and not all received offers were completed. In fact, only 11% of all offers sent were completed by users.
Q3. What are the percentages of each offer type sent?
In order to dive deeper and explore the offers with relation to the users and their types, we will to create a new dataset that includes the 3 datasets.
Almost equal number of BOGO and discount offers were sent. Around 14,000 of all offers sent were just advertisement.
Now let’s look at successfully completed offers and see what are the factors that influence the offer completion.
Successfully completed offers are offers where users received an offer, viewed it, and then completed the offer during the offer period.
In order to find it, we will create a new column called ‘time_expire’ which would identify the time in hours where an offer would expire.
time_expire = time_received + duration* 24
Then we will check the following condition:
- time_received ≤ time_viewed
2. time_viewed ≤time_completed
3. time_completed ≤time_expire
Q4. Which offer type had the highest completion rate?
Although the total number of discount and BOGO offers sent are equal, it seems that discount offers are more likely to get completed. This is mostly due to the fact that you don’t need to spend a certain amount to get a discount offer, in contrast to BOGO where you are required to meet a certain amount of spending.
Q4. What is the relation between user demographics (age, gender) and offer completion?
It looks like males are slightly more inclined to complete offers especially discount offers.
It also looks like the age had no effect in offer completion as the graph generated is similar to the original age distribution graph in the entire dataset.
Finally, if we look closely at the last graph we see that females are more attracted towards the $10 offer than males.
Q5. What is average amount of spending by age?
To complete the full picture, we will look at the factors that affect the spending, since these users are of higher value for Starbucks.
As the age of the user increase, the average spending increases.
Now that we have analyzed the dataset, we will proceed by creating a model that would predict whether a user will respond to an offer or not. There are 4 scenarios that can happen:
- A user will view and complete the offer.
- A user will just view the offer.
- A user will not view the offer, but will complete it anyway (without prior knowledge of the offer existence)
- A user will not view the offer and will not complete it.
Since Starbucks are targeting users that will view the offer and complete it afterwards, our prediction would be a binary value as such:
1: User will view and complete the offer
Three classifier algorithms will be used
- Decision Tree Classifier
- k-nearest neighbors
To evaluate a model, we will look into f1-score. The F1 score can be interpreted as a weighted average of the precision and recall which conveys the balance between them.
Looking at Precision value alone would ignore the False Negatives and would make us miss valuable customers that can potentially complete an offer.
Similarity, looking at Recall value alone would ignore False Positives which can make us send offers to everyone and flood users with offers they are not interested in.
For that, F1 score is the best choice in this case as it provides the balance between them.
In order to proceed with the prediction, we will need to create a new dataframe that will include the targeted features and the prediction column. The features that will be analyzed are:
A new column will be created “offer_success” to show whether a user will successfully view and complete the offer.
Offer types BOGO and discount have a clear criteria for completion and can be founded by looking at the event column with value “completed offer” and then double check that the timing of completion and viewing and offer expiration are consistent.
However, since informational offers are advertisement offers that don’t have a completion criteria, we will need to define how to consider them successful.
One way would be to look at all transactions and check if a transaction has occurred during an informational offer period. These transactions are considered to be influenced by the offer and thus the informational offer was successful. This is, of course, under the condition that a user has received and viewed the informational offer, then proceed to make a transaction.
After data was prepared, GridSearch was used to tune the parameters of each classifier algorithm.
Screenshots of the results are taken along with the hyper parameter values that was evaluated for each algorithm and the best values that was found using cross-validation:
A summary report of the above results are shown in this table using common test data for all models.
Looking at the results above, KNeighborsClassifier performed the best with an f1-score of 0.64 for users who will not complete offer and 0.77 for users who complete an offer. The average accuracy is 0.71.
KNeighborsClassifier achieved this result using the following best parameters:
In the above analysis, we looked at Starbucks data and saw how offer features and user demographics affect how a user would respond to a an offer.
After doing dome deeper analysis and visualization we saw that some features such as gender affected a user response to an offer. Males were more likely to complete an offer and especially discount offers whereas women preferred high reward offers ($10).On the other hand, we saw that the age of the user has no effect in that matter.
We discovered some trends on user spending and we saw that users who are older (in their 50’s or 60’s) are more likely to spend more.
we also looked at how the data is spread and we saw that only 11% of the offers sent were actually completed and 19% was viewed. The remaining offers were not touched.
To wrap things up, we built a classification model to predict whether a user would complete an offer or not. We saw that KNeighborsClassifier had the best results. The model was able to classify completed offers better and it had an average accuracy of 71% which is good given the limited dataset and the small number of effective features. [5,10,30] neighbors were tested with gridsearch to give the above result.
We had missing information for so many users in the dataset and therefore their data was dismissed. The experiment was also run for a short period of time, which doesn’t give enough input into how users react to the offers.
One improvement for the model was made after normalizing the income feature as it had very large values compared to the rest of the data.
In conclusion, we went through several steps with the analysis to answer our problem statement.
The steps can be summarized as follows:
- Understand the dataset and all its features.
- Clean and modify the data to prepare it for visualization and modelling.
* Null and duplicate values were removed
* For categorical values such as offer_type and event, separate columns were generated to replace them.
* Invalid data such as missing user information was handled.
- Visualize and explore data.
* Data distribution of age, gender, event, and offer types were looked at.
* Relation between user demographics and response to offers.
- Identify successfully completed offers from unsuccessful offers and represent them properly.
- Predict user response to offers using ML classification models.
- Evaluate results and choose the most appropriate model.
In this project, I learned so many things and explored different areas of data science.
My favorite step was step 3 of visualizing data. My most challenging step was step 4 as it was difficult dealing with the fact that users can complete an offer without viewing it or that an offer can be sent to user multiple times.
One thing that can be done is to increase the duration of the experiment. A period of one month is short and would not be sufficient to describe user behavior as users spending can be different every month. A period of 3 or 6 months is more suitable.
We can also create an A/B experiment to split users and check how well our prediction model work.