Exploring Data

Analyzing the behaviors of online shoppers

Our dataset, online_shoppers_intention_clean.csv, stores data on the behaviors of visitors to an online shopping website. Our goal is to study this behavior via a variety of Nimble features in order to extract insights on the users.

In this example we will learn about:

Getting Started

[2]:
import nimble

bucket = 'https://storage.googleapis.com/nimble/datasets/'
visits = nimble.data(bucket + 'online_shoppers_intention_explore.csv',
                     returnType="Matrix")
featureNames = visits.features.getNames()
print('Features in dataset: ')
print(featureNames)
Features in dataset:
['Administrative', 'Admin_Duration', 'Informational', 'Info_Duration', 'ProductRelated', 'Product_Duration', 'BounceRates', 'ExitRates', 'PageValues', 'SpecialDay', 'Month', 'OperatingSystems', 'Browser', 'Region', 'TrafficType', 'NewVisitor', 'Weekend', 'Purchase']

Data Overview

This dataset has 18 features, too many to show at one time. Let’s begin exploring our data by looking at groups of similar features.

This online shopping website is composed of three different types of webpages (Admininstrative, Informational, and Product Related). Our first 6 features record the counts and durations of time spent on each page type for each visit.

[3]:
visits.show('Page activity features', points=7, features=range(0,6))
Page activity features
12212pt x 18ft
        Administrative  Admin_Duration  Informational  Info_Duration  ProductRelated  Product_Duration
      ┌───────────────────────────────────────────────────────────────────────────────────────────────
    0 │       0              0.000            0            0.000             1              0.000
    1 │       0              0.000            0            0.000             2             64.000
    2 │       0              0.000            0            0.000             1              0.000
    │ │       │                │              │              │               │               │
12209 │       0              0.000            0            0.000             6            184.250
12210 │       4             75.000            0            0.000            15            346.000
12211 │       0              0.000            0            0.000             3             21.250

The next 3 features are website analytics collected during the visit.

[4]:
visits.show('Website analytic features', points=7, features=range(6,9))
Website analytic features
12212pt x 18ft
        BounceRates  ExitRates  PageValues
      ┌───────────────────────────────────
    0 │    0.200       0.200      0.000
    1 │    0.000       0.100      0.000
    2 │    0.200       0.200      0.000
    │ │      │           │          │
12209 │    0.083       0.087      0.000
12210 │    0.000       0.021      0.000
12211 │    0.000       0.067      0.000

The last 9 features are details about the visit or visitor.

[5]:
visits.show('Visit detail features', points=7, features=range(9,18))
Visit detail features
12212pt x 18ft
        SpecialDay  Month  OperatingSystems  Browser  Region  TrafficType  NewVisitor  Weekend  Purchase
      ┌─────────────────────────────────────────────────────────────────────────────────────────────────
    0 │   0.000       0           1             1       1           1        False      False    False
    1 │   0.000       0           2             2       1           2        False      False    False
    2 │   0.000       0           4             1       9           3        False      False    False
    │ │     │         │           │             │       │          │           │          │        │
12209 │   0.000       7           3             2       1          13        False       True    False
12210 │   0.000       7           2             2       3          11        False      False    False
12211 │   0.000       7           3             2       1           2         True       True    False

Now that we have a better understanding of our data, let’s see what we can learn from it.

Exploring data through Nimble’s calculate module

Reaching product-related pages is important for maximizing the chance that a purchase is made. This site categorizes their pages into three types (“Administrative”, “Informational”, and “ProductRelated”). Let’s calculate the mean and median counts for each page type and find out if most visitors are reaching a product-related page.

[6]:
for ft in ['Administrative', 'Informational', 'ProductRelated']:
    mean = nimble.calculate.mean(visits[:, ft])
    print('Mean', ft, 'hits per visit', mean)
    median = nimble.calculate.median(visits[:, ft])
    print('Median', ft, 'hits per visit', median)

noProduct = nimble.calculate.proportionZero(visits.features['ProductRelated'])
print('Proportion of visitors that view a product page:', 1 - noProduct)
Mean Administrative hits per visit 2.27104487389453
Median Administrative hits per visit 1.0
Mean Informational hits per visit 0.4758434326891582
Median Informational hits per visit 0.0
Mean ProductRelated hits per visit 30.687684245004913
Median ProductRelated hits per visit 18.0
Proportion of visitors that view a product page: 0.9968883065836882

We see that the mean values are consistently higher than the median values. Since the mean is sensitive to outliers, this indicates that we have some visitors that view a very high number of pages. We are also happy to see that nearly every visitor interacts with at least one product related page during their visit.

Exploring data through data object methods

Now that we know visitors are typically viewing product pages, let’s focus on the Purchase feature. Purchase is a boolean feature indicating whether a purchase was made. Let’s count the number of times the Purchase feature was true, and divide by the number of visits to see the proportion that result in a purchase.

[7]:
purchases = visits.points.count(lambda pt: bool(pt['Purchase']))
print('Proportion of visits with a purchase:', purchases / len(visits.points))
Proportion of visits with a purchase: 0.15263675073698002

Now let’s check how Purchase correlates with our other features.

[8]:
correlations = visits.features.similarities('correlation')
correlations[:, 'Purchase'].show('Feature correlations with Purchase')
Feature correlations with Purchase
18pt x 1ft
                   Purchase
                 ┌─────────
  Administrative │   0.139
  Admin_Duration │   0.106
   Informational │   0.100
   Info_Duration │   0.089
  ProductRelated │   0.165
Product_Duration │   0.172
     BounceRates │  -0.150
       ExitRates │  -0.206
      PageValues │   0.538
      SpecialDay │  -0.082
           Month │   0.104
OperatingSystems │  -0.017
         Browser │   0.022
          Region │  -0.012
     TrafficType │  -0.006
      NewVisitor │   0.103
         Weekend │   0.030
        Purchase │   1.000

The SpecialDay feature ranges from 0 to 1 indicating proximity to a special day. Most days will have a value of 0 but, for example, a visit three days before Mother’s Day could have a value of 0.4, the day before Mother’s Day would have a (higher) value of 0.8, and visits on Mother’s Day have a value of 1. We might think a special day would increase purchases made on the site, but we see above that SpecialDay has a negative correlation with Purchase. Let’s investigate. First, we will find what percent of visits were near a special day. The ‘copy’ function accepts a variety of arguments including a ‘query’. In Nimble, a query is a string expression that returns a boolean value after the expression is evaluated, in this case for each point in the data. The boolean results will determine if a point should be copied or not. See Nimble’s QueryString documentation for more details.

[9]:
special = visits.points.copy('SpecialDay > 0')
visitPercent = len(special.points) / len(visits.points) * 100
print(f'{visitPercent:.2f}% of all visits were near a special day')
10.19% of all visits were near a special day

Now, let’s see what percent of purchases occurred near a special day.

[10]:
specialPurchases = special.points.count(lambda pt: pt['Purchase'])
purchasePercent = specialPurchases / purchases * 100
print(f'{purchasePercent:.2f}% of all purchases were near a special day')
4.08% of all purchases were near a special day

Visits near a special day represent over 10% of visits, but only about 4% of purchases. It appears that these days attract more visitors to the site, but these visitors are less likely to make a purchase.

Exploring data through plotting

We saw above that visits near a special day leads to less purchases, let’s explore the impact of location on purchases. The location of each visit is classified into one of nine regions, let’s see the distribution of visits by region.

[11]:
visits.plotFeatureDistribution('Region')
../_images/examples_exploring_data_24_0.svg

We see above that region 1 provides the most visits to the website and regions 1 and 3 combine for over 50% of website traffic. Now, let’s check if some regions are more likely to make a purchase. We can use plotFeatureGroupStatistics to do this. Since this function is grouping by the Region feature, the regions on the x-axis will be in order of appearance in the data. To keep them in ascending numeric order, we will first sort our data by Region. Once sorted, plotFeatureGroupStatistics will find the count of values in the purchase column for each Region. Then, it will further subdivide each count bar based on the values in Purchase (True or False). Now we can see if any regions are particularly better or worse at providing visits with a purchase.

[12]:
visits.points.sort('Region')
visits.plotFeatureGroupStatistics(nimble.calculate.count, 'Purchase', 'Region',
                                  subgroupFeature='Purchase',
                                  color=['red', 'blue'])
../_images/examples_exploring_data_26_0.svg

It does not appear that any region is making disproportionately more or less purchases than the others. We have learned a lot about our website data through this exploration. Next, see how we can use Nimble to extract more insight from this dataset using machine learning in our Unsupervised Learning example.

References:

Sakar, C.O., Polat, S.O., Katircioglu, M. et al. Neural Comput & Applic (2018). [https://doi.org/10.1007/s00521-018-3523-0]

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Link to original dataset: https://archive.ics.uci.edu/ml/datasets/Online+Shoppers+Purchasing+Intention+Dataset