Exploratory Data Analysis and Visualization of Olist (Brazilian E-commerce) Public Dataset
- This project was carried out as part of Group M Final Project at RevoU Full Stack Data Analytics Program (Batch 1–2021)
Introduction
Along with the rapid growth of the E-commerce business around the world, it made me and my team in Group M curious to did an analysis project of an E-commerce from Brazil, called Olist.
*We got the publicly released Olist Dataset (2016–2018) from https://www.kaggle.com/olistbr/brazilian-ecommerce/version/2.
Olist provides real growth opportunities for Small, Medium, and Large business. Olist has more than 4.5 million registered products and support for more than 100 thousand stores.
During 2011–2017, Olist had a growing number of sales. We can see promising future for Olist. As the market is getting bigger, it is important to win the competition. Findings in this report will help stakeholders by giving insight to make business decisions.
Scope of Problems
As the Ecommerce sales is massively growing in Brazil year to year, there are findings that Olist still hasn’t reached all of its potential customers and maintain the customer to keep buying. This report was created to show Olist’s performance through 2016–2018 and we can see which rooms to improve.
We relied on descriptive statistics to understand the possible optimization of transactions for each region, delivery time, customer satisfaction, category sales, customer retention rate, and RFM Customer Segmentation. We hope that from this we can apply the right steps to optimize the Olist’s business.
#Here is how we worked on this project
Data Preparation and Cleaning
We started the process by using Google Sheets, we performed these following steps:
- Preparation- Joining olist_orders_dataset, olist_order_items_dataset, olist_products_dataset, olist_customers_dataset, olist_order_payments_dataset, olist_geolocation_dataset, and olist_order_reviews_dataset
- Cleaning- Remove Non “Delivered” data at ‘order_status’ column
- Cleaning- Remove Duplicates at ‘order_id’ column
After doing some of the steps above, then we used Python to perform these following steps:
- Preparation- Import all the libraries to be used in cleaning and Exploratory Data Analysis (EDA)
import pandas as pdimport datetime as dtimport numpy as npimport matplotlib.pyplot as plt
2. Preparation- Reading csv file from url
# reading csv file from urlsheet_url = 'https://docs.google.com/spreadsheets/d/1saBGAYO7fz0_w2NE1b3mEofc2b9HqzbDeq0ebg0IsLo/edit#gid=1400750999'sheet_url_trf = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')df = pd.read_csv(sheet_url_trf)df
3. Preparation- Checking dataset info
df.info()
4. Preparation- Checking nulls data
pd.isnull(df).sum()
5. Cleaning- Delete null rows
df = df.dropna(subset=['order_approved_at'])df = df.dropna(subset=['order_delivered_carrier_date'])df = df.dropna(subset=['order_delivered_customer_date'])df = df.dropna(subset=['product_category_name'])
6. Cleaning- Remove outliers data of ‘payment_value’ column
- Creating boxplot to visualize the ‘payment_value’ data
df.boxplot(column='payment_value', fontsize=10, rot=0, grid=False, figsize=(10,10))
- Finding Interquartile from the ‘payment_value’ data
p_q1 = df['payment_value'].quantile(0.25)p_q3 = df['payment_value'].quantile(0.75)p_iqr = p_q3 - p_q1boxplot_p_min = p_q1 - 1.5 * p_iqrboxplot_p_max = p_q3 + 1.5 * p_iqrprint('pv_Q1 :', p_q1)print('pv_Q3 :', p_q3)print('pv_IQR :', p_iqr)print('pv_Min :', boxplot_p_min)print('pv_Max :', boxplot_p_max)
- Finding length of dataframe before removing outliers from the ‘payment_value’ data
len(df)
- Removing outliers from the ‘payment_value’ data
df_clean = df_clean[(df_clean[‘Clean Rooms’] >= boxplot_cr_min) & (df_clean[‘Clean Rooms’] < boxplot_cr_max)]len(df_clean)
7. Cleaning- Remove outliers data of ‘price’ column and ‘freight_value’ column by mirroring step 6 and adjusting the code to the data used
8. Cleaning- Remove “0” value of ‘product_photos_qty’
df = df[df.product_photos_qty !=0]
9. Cleaning- Remove “0” value of ‘product_weight_g’, ‘product_length_cm’, ‘product_height_cm’, and ‘product_width_cm’ by mirroring step 8 and adjusting the code to the data used
*Before we move on to the next chapter, we will do Recency, Frequency and Monetary (RFM) Customer Segmentation using Python
Here we show the codes:
1.
df_recency = df.groupby(by='customer_unique_id',as_index=False)['order_purchase_timestamp'].max()df_recency.columns = ['customer_unique_id', 'order_purchase_timestamp']recent_date = df_recency['order_purchase_timestamp'].max()df_recency['Recency'] = df_recency['order_purchase_timestamp'].apply(lambda x: (recent_date - x).days)df_recency.head()
2.
frequency_df = df.groupby(by=['customer_unique_id'], as_index=False)['order_purchase_timestamp'].count()frequency_df.columns = ['customer_unique_id', 'Frequency']frequency_df.head()
3.
df['Total'] = df['payment_value']monetary_df = df.groupby(by='customer_unique_id', as_index=False)['Total'].sum()monetary_df.columns = ['customer_unique_id', 'Monetary']monetary_df.head()
4.
rf_df = df_recency.merge(frequency_df, on=’customer_unique_id’)rfm_df = rf_df.merge(monetary_df, on=’customer_unique_id’).drop(columns=’order_purchase_timestamp’)rfm_df.head()
5.
rfm_df[‘R_rank’] = rfm_df[‘Recency’].rank(ascending=False)rfm_df[‘F_rank’] = rfm_df[‘Frequency’].rank(ascending=True)rfm_df[‘M_rank’] = rfm_df[‘Monetary’].rank(ascending=True)
6.
# normalizing the rank of the customersrfm_df[‘R_rank_norm’] = (rfm_df[‘R_rank’]/rfm_df[‘R_rank’].max())*100rfm_df[‘F_rank_norm’] = (rfm_df[‘F_rank’]/rfm_df[‘F_rank’].max())*100rfm_df[‘M_rank_norm’] = (rfm_df[‘F_rank’]/rfm_df[‘M_rank’].max())*100rfm_df.drop(columns=[‘R_rank’, ‘F_rank’, ‘M_rank’], inplace=True)rfm_df.head()
7.
rfm_df[‘RFM_Score’] = 0.15*rfm_df[‘R_rank_norm’]+0.28 * \rfm_df[‘F_rank_norm’]+0.57*rfm_df[‘M_rank_norm’]rfm_df[‘RFM_Score’] *= 0.05rfm_df = rfm_df.round(2)rfm_df[[‘customer_unique_id’, ‘RFM_Score’]].head(7)
8.
rfm_df[“Customer_segment”] = np.where(rfm_df[‘RFM_Score’] >4.5, “Top Customers”,(np.where(rfm_df[‘RFM_Score’] > 4,“High value Customer”,(np.where(rfm_df[‘RFM_Score’] > 3,“Medium Value Customer”,np.where(rfm_df[‘RFM_Score’] > 1.6,‘Low Value Customers’, ‘Lost Customers’))))))rfm_df[[‘customer_unique_id’, ‘RFM_Score’, ‘Customer_segment’]].head(20)
9.
plt.pie(rfm_df.Customer_segment.value_counts(),labels=rfm_df.Customer_segment.value_counts().index,autopct=’%.0f%%’)plt.show()
10.
rfm_df.head(10)
11. Export data to excel, so we can move forward to the EDA and Visualization process by using Tableau
rfm_df.to_excel(‘data_olist_RFM.xlsx’, index=False )
Exploratory Data Analysis and Visualization
Total Transaction Gap between States is Very High and Avg Freight Value is more Expensive outside Sao Paulo
- The top sales is up to 34.567 in Sao Paulo, but the second sales was only up to 9.840 (71% less)
- It could be caused by the high freight value in other states
- The Ebit | Nielsen survey in Brazil showed 70% respondents pointed to freight cost as an opportunity for improvement
Relationship between Delivery Time and Customer Rating, showed the longer the delivery time, the smaller rating score we will get
- We found that in the box chart, when the customer gave a rating score of 1, the customer received a longer delivery time.
- We use 11 days as a benchmark. If it’s more than 11 days, it exceeds the acceptable delivery time for the customer.
Pareto Chart for Overview sales between Q4 2016 — Q3 2018
- Approximately 80% of sales generated came from six group category; namely, Household needs, Sport & Hobby, Beauty, Computer & Laptop, and Mom & baby.
Pareto category sales decline when the market is growth
- Computer & Laptop growth significantly in Q1 2018 but drop significantly in Q2 — Q3 2018.
- Sales trend for the Pareto category in Q2-Q3 2018 tends to decrease. Against market growth in 2018.
New Customers and Repeat Customers Order on 2016–2018
- There is significant new customer orders increase from 2017–2018
- Customers repeat orders in this graphic are very low
- Need to retain and grow existing customers
Cohort Analysis for customer retention rate on 12 months
- Cohort Analysis Showed Customer Retention Rate is very Low
- Monthly Customer retention rate is below 1% rate
- Need to improve the frequency and the touchpoints with those customers
RFM Customer Segmentation dominated by Low Value Customers
- Low value Customer indicated customers who stop purchase long time for some reasons
- Only 1,69% customers spend high amount transaction and still active purchase
- Re-engage with existing customers to drive greater revenue and keep them longer
Characteristic and Marketing Strategy based on RFM Customer Segmentation
Top Customers
Characteristic: Likes engaging with us
Marketing Strategy:
- Provide loyalty & referral reward programs
- Can apply new product testing to this segment, because they are possible to help in giving us valuable feedback and recommending our product to others
- Can be targeted for our most expensive products
High Value Customers
Characteristic: Spending less money than the Top Customers, but they are still our valued customers
Marketing Strategy:
- Personalize products by recommending products based on previous purchases
- Implement the review programs (executed in apps / social media)
- Due to high recency avg, survey and more analysis are needed to find the causes
- Give special promos/prices for a limited time
Low Value Customers
Characteristic: Shopping infrequently and spending less money than others
Marketing Strategy:
- Focus on increasing monetization and frequency
- Personalize products by recommending products based on previous purchases
- Increase the brand awareness to increase shopping frequency
Summary & Our Recommendations
Low sales in states outside San Paulo:
- Summary- There’s an indication that high freight values caused the gap of total transaction between Sao Paulo and other state is very high
- Recommendation- Developing new strategy to lower the freight cost for customers
The longer the delivery time, the smaller rating score we will get:
- Summary- Data shows acceptable delivery time for customers by looking at the review score.
- Recommendation- (1) Improve the delay situation and customer satisfaction, (2) Also delivery time control can be improved by having olist logistics center.
Pareto category sales declined while the market is growing:
- Summary- Pareto chart shows some importance product category sales for Olist
- Recommendation- Olist can boost its sales by focus on Household need and fashion categories
Low Customer Retention rate and Value Customer:
- Summary- New customers is increase from 2017–2018 but retention rate is very low and from RFM, we know that our customer is dominated by Low Value Customer
- Recommendation- (1) Customers need more attention with product recommendation and price incentives, (2) Analyze customer leaving behavior
We present all of the above visualizations combined into multiple dashboards
Reference
— — — — — — — — —
The Group M Team : Rona Fauzan Noer, Felix Sulistio, Dwi Mariastuti, and Chandra Agung Wijaya
Mentor : Andrian T
— — —
Specifically, I contributed to the following work:
1. Cleaning data outliers, ‘0’ data, and data nulls using Python
2. Crafting visualization using Tableau and analyzing data in the “Relationship between Delivery Time and Customer Rating” part
3. Analyzing data on “Characteristic and Marketing Strategy based on RFM Customer Segmentation” part
4. Finalizing all the visualizations that my teammates have worked on with Tableau
5. Crafting Dashboard from data that has been analyzed by my team using Tableau
— — —
Since this is my team’s first analysis project, maybe the results that we provide are still far from perfect. In this post, I am very open and appreciate if you have feedback and insight about this.
Connect with me on LinkedIn
Thank you!