Exploratory Data Analysis and Visualization of Olist (Brazilian E-commerce) Public Dataset

Rona Fauzan Noer
8 min readFeb 1, 2022

--

  • This project was carried out as part of Group M Final Project at RevoU Full Stack Data Analytics Program (Batch 1–2021)
Photo: Olist

Introduction

Along with the rapid growth of the E-commerce business around the world, it made me and my team in Group M curious to did an analysis project of an E-commerce from Brazil, called Olist.

*We got the publicly released Olist Dataset (2016–2018) from https://www.kaggle.com/olistbr/brazilian-ecommerce/version/2.

Olist provides real growth opportunities for Small, Medium, and Large business. Olist has more than 4.5 million registered products and support for more than 100 thousand stores.

During 2011–2017, Olist had a growing number of sales. We can see promising future for Olist. As the market is getting bigger, it is important to win the competition. Findings in this report will help stakeholders by giving insight to make business decisions.

Scope of Problems

As the Ecommerce sales is massively growing in Brazil year to year, there are findings that Olist still hasn’t reached all of its potential customers and maintain the customer to keep buying. This report was created to show Olist’s performance through 2016–2018 and we can see which rooms to improve.

We relied on descriptive statistics to understand the possible optimization of transactions for each region, delivery time, customer satisfaction, category sales, customer retention rate, and RFM Customer Segmentation. We hope that from this we can apply the right steps to optimize the Olist’s business.

#Here is how we worked on this project

Data Preparation and Cleaning

We started the process by using Google Sheets, we performed these following steps:

  1. Preparation- Joining olist_orders_dataset, olist_order_items_dataset, olist_products_dataset, olist_customers_dataset, olist_order_payments_dataset, olist_geolocation_dataset, and olist_order_reviews_dataset
  2. Cleaning- Remove Non “Delivered” data at ‘order_status’ column
  3. Cleaning- Remove Duplicates at ‘order_id’ column

After doing some of the steps above, then we used Python to perform these following steps:

  1. Preparation- Import all the libraries to be used in cleaning and Exploratory Data Analysis (EDA)
import pandas as pdimport datetime as dtimport numpy as npimport matplotlib.pyplot as plt

2. Preparation- Reading csv file from url

# reading csv file from urlsheet_url = 'https://docs.google.com/spreadsheets/d/1saBGAYO7fz0_w2NE1b3mEofc2b9HqzbDeq0ebg0IsLo/edit#gid=1400750999'sheet_url_trf = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')df = pd.read_csv(sheet_url_trf)df

3. Preparation- Checking dataset info

df.info()

4. Preparation- Checking nulls data

pd.isnull(df).sum()

5. Cleaning- Delete null rows

df = df.dropna(subset=['order_approved_at'])df = df.dropna(subset=['order_delivered_carrier_date'])df = df.dropna(subset=['order_delivered_customer_date'])df = df.dropna(subset=['product_category_name'])

6. Cleaning- Remove outliers data of ‘payment_value’ column

  • Creating boxplot to visualize the ‘payment_value’ data
df.boxplot(column='payment_value', fontsize=10, rot=0, grid=False, figsize=(10,10))
Sample Visualization Result
  • Finding Interquartile from the ‘payment_value’ data
p_q1 = df['payment_value'].quantile(0.25)p_q3 = df['payment_value'].quantile(0.75)p_iqr = p_q3 - p_q1boxplot_p_min = p_q1 - 1.5 * p_iqrboxplot_p_max = p_q3 + 1.5 * p_iqrprint('pv_Q1 :', p_q1)print('pv_Q3 :', p_q3)print('pv_IQR :', p_iqr)print('pv_Min :', boxplot_p_min)print('pv_Max :', boxplot_p_max)
  • Finding length of dataframe before removing outliers from the ‘payment_value’ data
len(df)
  • Removing outliers from the ‘payment_value’ data
df_clean = df_clean[(df_clean[‘Clean Rooms’] >= boxplot_cr_min) & (df_clean[‘Clean Rooms’] < boxplot_cr_max)]len(df_clean)

7. Cleaning- Remove outliers data of ‘price’ column and ‘freight_value’ column by mirroring step 6 and adjusting the code to the data used

8. Cleaning- Remove “0” value of ‘product_photos_qty’

df = df[df.product_photos_qty !=0]

9. Cleaning- Remove “0” value of ‘product_weight_g’, ‘product_length_cm’, ‘product_height_cm’, and ‘product_width_cm’ by mirroring step 8 and adjusting the code to the data used

*Before we move on to the next chapter, we will do Recency, Frequency and Monetary (RFM) Customer Segmentation using Python

Here we show the codes:

1.

df_recency = df.groupby(by='customer_unique_id',as_index=False)['order_purchase_timestamp'].max()df_recency.columns = ['customer_unique_id', 'order_purchase_timestamp']recent_date = df_recency['order_purchase_timestamp'].max()df_recency['Recency'] = df_recency['order_purchase_timestamp'].apply(lambda x: (recent_date - x).days)df_recency.head()

2.

frequency_df = df.groupby(by=['customer_unique_id'], as_index=False)['order_purchase_timestamp'].count()frequency_df.columns = ['customer_unique_id', 'Frequency']frequency_df.head()

3.

df['Total'] = df['payment_value']monetary_df = df.groupby(by='customer_unique_id', as_index=False)['Total'].sum()monetary_df.columns = ['customer_unique_id', 'Monetary']monetary_df.head()

4.

rf_df = df_recency.merge(frequency_df, on=’customer_unique_id’)rfm_df = rf_df.merge(monetary_df, on=’customer_unique_id’).drop(columns=’order_purchase_timestamp’)rfm_df.head()

5.

rfm_df[‘R_rank’] = rfm_df[‘Recency’].rank(ascending=False)rfm_df[‘F_rank’] = rfm_df[‘Frequency’].rank(ascending=True)rfm_df[‘M_rank’] = rfm_df[‘Monetary’].rank(ascending=True)

6.

# normalizing the rank of the customersrfm_df[‘R_rank_norm’] = (rfm_df[‘R_rank’]/rfm_df[‘R_rank’].max())*100rfm_df[‘F_rank_norm’] = (rfm_df[‘F_rank’]/rfm_df[‘F_rank’].max())*100rfm_df[‘M_rank_norm’] = (rfm_df[‘F_rank’]/rfm_df[‘M_rank’].max())*100rfm_df.drop(columns=[‘R_rank’, ‘F_rank’, ‘M_rank’], inplace=True)rfm_df.head()

7.

rfm_df[‘RFM_Score’] = 0.15*rfm_df[‘R_rank_norm’]+0.28 * \rfm_df[‘F_rank_norm’]+0.57*rfm_df[‘M_rank_norm’]rfm_df[‘RFM_Score’] *= 0.05rfm_df = rfm_df.round(2)rfm_df[[‘customer_unique_id’, ‘RFM_Score’]].head(7)

8.

rfm_df[“Customer_segment”] = np.where(rfm_df[‘RFM_Score’] >4.5, “Top Customers”,(np.where(rfm_df[‘RFM_Score’] > 4,“High value Customer”,(np.where(rfm_df[‘RFM_Score’] > 3,“Medium Value Customer”,np.where(rfm_df[‘RFM_Score’] > 1.6,‘Low Value Customers’, ‘Lost Customers’))))))rfm_df[[‘customer_unique_id’, ‘RFM_Score’, ‘Customer_segment’]].head(20)

9.

plt.pie(rfm_df.Customer_segment.value_counts(),labels=rfm_df.Customer_segment.value_counts().index,autopct=’%.0f%%’)plt.show()

10.

rfm_df.head(10)

11. Export data to excel, so we can move forward to the EDA and Visualization process by using Tableau

rfm_df.to_excel(‘data_olist_RFM.xlsx’, index=False )

Exploratory Data Analysis and Visualization

Total Transaction Gap between States is Very High and Avg Freight Value is more Expensive outside Sao Paulo

Click for clearer visualization
Click for clearer visualization
  • The top sales is up to 34.567 in Sao Paulo, but the second sales was only up to 9.840 (71% less)
  • It could be caused by the high freight value in other states
  • The Ebit | Nielsen survey in Brazil showed 70% respondents pointed to freight cost as an opportunity for improvement

Relationship between Delivery Time and Customer Rating, showed the longer the delivery time, the smaller rating score we will get

Click for clearer visualization
  • We found that in the box chart, when the customer gave a rating score of 1, the customer received a longer delivery time.
  • We use 11 days as a benchmark. If it’s more than 11 days, it exceeds the acceptable delivery time for the customer.

Pareto Chart for Overview sales between Q4 2016 — Q3 2018

Click for clearer visualization
  • Approximately 80% of sales generated came from six group category; namely, Household needs, Sport & Hobby, Beauty, Computer & Laptop, and Mom & baby.

Pareto category sales decline when the market is growth

Click for clearer visualization
  • Computer & Laptop growth significantly in Q1 2018 but drop significantly in Q2 — Q3 2018.
  • Sales trend for the Pareto category in Q2-Q3 2018 tends to decrease. Against market growth in 2018.

New Customers and Repeat Customers Order on 2016–2018

Click for clearer visualization
  • There is significant new customer orders increase from 2017–2018
  • Customers repeat orders in this graphic are very low
  • Need to retain and grow existing customers

Cohort Analysis for customer retention rate on 12 months

Click for clearer visualization
  • Cohort Analysis Showed Customer Retention Rate is very Low
  • Monthly Customer retention rate is below 1% rate
  • Need to improve the frequency and the touchpoints with those customers

RFM Customer Segmentation dominated by Low Value Customers

Click for clearer visualization
  • Low value Customer indicated customers who stop purchase long time for some reasons
  • Only 1,69% customers spend high amount transaction and still active purchase
  • Re-engage with existing customers to drive greater revenue and keep them longer

Characteristic and Marketing Strategy based on RFM Customer Segmentation

Click for clearer visualization

Top Customers

Characteristic: Likes engaging with us

Marketing Strategy:

  • Provide loyalty & referral reward programs
  • Can apply new product testing to this segment, because they are possible to help in giving us valuable feedback and recommending our product to others
  • Can be targeted for our most expensive products

High Value Customers

Characteristic: Spending less money than the Top Customers, but they are still our valued customers

Marketing Strategy:

  • Personalize products by recommending products based on previous purchases
  • Implement the review programs (executed in apps / social media)
  • Due to high recency avg, survey and more analysis are needed to find the causes
  • Give special promos/prices for a limited time

Low Value Customers

Characteristic: Shopping infrequently and spending less money than others

Marketing Strategy:

  • Focus on increasing monetization and frequency
  • Personalize products by recommending products based on previous purchases
  • Increase the brand awareness to increase shopping frequency

Summary & Our Recommendations

Low sales in states outside San Paulo:

  • Summary- There’s an indication that high freight values caused the gap of total transaction between Sao Paulo and other state is very high
  • Recommendation- Developing new strategy to lower the freight cost for customers

The longer the delivery time, the smaller rating score we will get:

  • Summary- Data shows acceptable delivery time for customers by looking at the review score.
  • Recommendation- (1) Improve the delay situation and customer satisfaction, (2) Also delivery time control can be improved by having olist logistics center.

Pareto category sales declined while the market is growing:

  • Summary- Pareto chart shows some importance product category sales for Olist
  • Recommendation- Olist can boost its sales by focus on Household need and fashion categories

Low Customer Retention rate and Value Customer:

  • Summary- New customers is increase from 2017–2018 but retention rate is very low and from RFM, we know that our customer is dominated by Low Value Customer
  • Recommendation- (1) Customers need more attention with product recommendation and price incentives, (2) Analyze customer leaving behavior

We present all of the above visualizations combined into multiple dashboards

Dashboard 1

Dashboard 2

Dashboard 3

Reference

— — — — — — — — —

The Group M Team : Rona Fauzan Noer, Felix Sulistio, Dwi Mariastuti, and Chandra Agung Wijaya

Mentor : Andrian T

— — —

Specifically, I contributed to the following work:

1. Cleaning data outliers, ‘0’ data, and data nulls using Python

2. Crafting visualization using Tableau and analyzing data in the “Relationship between Delivery Time and Customer Rating” part

3. Analyzing data on “Characteristic and Marketing Strategy based on RFM Customer Segmentation” part

4. Finalizing all the visualizations that my teammates have worked on with Tableau

5. Crafting Dashboard from data that has been analyzed by my team using Tableau

— — —

Since this is my team’s first analysis project, maybe the results that we provide are still far from perfect. In this post, I am very open and appreciate if you have feedback and insight about this.

Connect with me on LinkedIn

Thank you!

--

--

No responses yet