SAS: Amazon Analytics
- Brandon Maddox
- May 3, 2021
- 11 min read
Amazon is arguably one of the most successful companies in history. Prime shipping is magic, the ordering process is simple and convenient, and the additional prime benefits have nurtured the dependencies of millions of people around the world. You can order health supplements, books for school, new clothes, and birthday gifts; all while on the couch in your pajamas, and it will usually be at your door within a few days.
This isn't meant to be an ad for Amazon, though. I recently discovered that you can download a report of every item you've ever purchased from Amazon. If you'd like to access the data yourself and are currently logged in to your Amazon account, this link should take you directly to the order history reports page. For this deep dive into my Amazon purchase history, I'll be using SAS 9.4 software. I know…SAS may not be as sexy as Python or R, but it’s a great piece of statistical software and I recently passed my Base Programmer Certification exam using SAS 9.4 so this should be good practice for me.
A technique that I find helpful is first skimming over the raw data and writing down as many questions that come to mind. Ideating for a few moments before digging in helps to direct my focus. Let me just say, I'm really interested to know what my purchase history looks like, but afraid to find out how much of my money I've spent with them. I'm well aware of how dependent I am, so I shouldn't be surprised when I see the grand total.
So how much money have I spent with Amazon since I created my account? What is the average price that I pay per item? How has my purchase behavior changed over time? What can Amazon expect from me as a customer over the next 50 years? Stay tuned, and let's find out.
Getting the data in
First, I import the .csv file from Amazon and since I've looked at the raw data, I've decided to establish a format for the month variable that I'll create later on.
proc import datafile='C:\...\Amazon Order History 1_23_21.csv'
out=mysas.Amazon
dbms=csv
replace;
getnames=yes;
guessingrows=100;
run;
proc format;
value month_name
1 = 'JAN' 2 = 'FEB' 3 = 'MAR'
4 = 'APR' 5 = 'MAY' 6 = 'JUN'
7 = 'JUL' 8 = 'AUG' 9 = 'SEP'
10 = 'OCT' 11 = 'NOV' 12 = 'DEC';
run;Next, I'll create a data step to clean and transform the dataset into what I'll use for the analysis. The data step below essentially makes the following modifications to the original dataset:
Exports month and year values from the order date into new variables
Cleans variables so that they are uniform across the entire dataset
Separates the carrier name and tracking number into their own columns
Calculates Total_Savings by subtracting the purchase price from the list price
Creates Days_to_Ship as the delta between the order date and the ship date
Removes observations that do not add value to my analysis of the dataset
Formats multiple currency values as well as our month variable from before
data mysas.Amazon(drop=Carrier_Name___Tracking_Number Tracking_Number Release_Date Shipping_Address_Street_1 Shipping_Address_Street_2 Buyer_Name Shipping_Address_Name Order_Status Payment_Instrument_Type);
set Mysas.Amazon;
Year=Year(Order_Date);
Month=Month(Order_Date);
Shipping_Address_City=upcase(Shipping_Address_City);
Shipping_Address_State=upcase(Shipping_Address_State);
Carrier_Name = substr(Carrier_Name___Tracking_Number, 1,
(index(Carrier_Name___Tracking_Number, '(') - 1));
if Carrier_Name = 'FEDEX' then Carrier_Name = 'FedEx';
Tracking_Number = substr(Carrier_Name___Tracking_Number,
index(Carrier_Name___Tracking_Number, '(') + 1);
Tracking = substr(Tracking_Number, 1, index(Tracking_Number, ")")-1);
if List_Price_Per_Unit=0 then
List_Price_Per_Unit=Purchase_Price_Per_Unit;
Total_Savings=(List_Price_Per_Unit-Purchase_Price_Per_Unit)*Quantity;
Days_to_Ship=DATDIF(Order_Date, Shipment_Date, 'ACT/ACT');
Shipping_Address_Zip=substr(Shipping_Address_Zip, 1, 5);
if substr(Payment_Instrument_Type, 1, 4) = "Amer" then
Payment='American Express';
else if substr(Payment_Instrument_Type, 1, 4) = "Visa" then
Payment='Visa';
else if substr(Payment_Instrument_Type, 1, 4) = "Gift" then
Payment='Gift Card';
else if substr(Payment_Instrument_Type, 1, 4) = "Disc" then
Payment='Discover';
else if substr(Payment_Instrument_Type, 1, 4) = "Mast" then
Payment='MasterCard';
if Category ne 'ABIS_GIFT_CARD' and Item_Total ne 0 and year ne 2021;
format Item_Total dollar10.2
List_Price_Per_Unit dollar10.2
Purchase_Price_Per_Unit dollar10.2
Item_Subtotal dollar10.2
Item_Subtotal_Tax dollar10.2
Total_Savings dollar10.2
Month month_name.;
run;proc contents data=mysas.amazon varnum;
run;

Alright so our dataset has 726 observations and 23 variables. Let's do some exploring on the variables above.
What does my purchase history look like?
proc print data=mysas.amazon(obs=1);
var Order_Date Title Item_Total;
run;
I have been an Amazon customer since December 25th, 2009. That means that 12 years ago on Christmas Day, I decided to create an Amazon account and purchase a Criminal Minds: Season 2 DVD Box Set for $48.99. That's correct, I paid close to FIFTY DOLLARS for a single season of a TV show that today we can watch all 15 seasons on Netflix for pennies.
proc print data=mysas.amazon;
sum Quantity Total_Savings Item_Subtotal_Tax Item_Total;
run;Over the course of my customer history, I have purchased 726 unique items from Amazon. It's embarrassing, but my grand total is $19,231.56 or roughly the cost of a brand new car, $864.42 of which has paid as sales tax. On each product listing, Amazon advertises a list price against the purchase price. I created the total savings variable by taking the delta of the two, then multiplying by the quantity purchased. Total savings sum to $8,958.33, but I have a feeling Amazon inflates those list prices on purpose to make you feel like you're getting a great deal, so I'll take this with a grain of salt.
I wonder what types of items I buy most often...
title1 "Purchase Category Frequencies";
proc freq data=mysas.amazon order=freq;
table Category / out=cat_freq;
run;
proc contents data=cat_freq varnum;
run;
proc print data=cat_freq (obs = 5);
run;
title1;
Evidently I use Amazon for most of my reading. Amazon originated as a bookstore in Bezos' garage in Bellevue Washington, so it makes sense that the highest frequency category would be books. Amazon has a few generic category labels, but some are oddly specific with 255 unique categories in my purchase history. I wonder what criteria they use to distinguish whether an item falls into a specific category, like "tent", or a more generic one like "outdoor living".
Can we talk about the 12 pairs of headphones? What am I doing with my life to possibly need that many sets of headphones?
What's are the highest and lowest priced items I've purchased on Amazon?
proc sort data=mysas.amazon out=amazon_price;
by descending Item_Total;
run;Sorting the dataset turned up a gift card reload as well as six items that show an item total of $0. Amazon counts gift card balance reloads in addition to all purchases paid using gift card balances, and the zero dollar items appear to be replacements for items that arrived damaged. I have removed all observations that fall under the category of ABIS_GIFT_CARD or have a value of $0.00 using the initial data step.
title 'Highest Priced Amazon Purchases';
proc print data=amazon_price (obs=10);
var Order_ID Title Category Order_Date Item_Total;
format Title $char78.;
run;
title;
title 'Lowest Priced Amazon Purchases';
proc print data=amazon_price (firstobs=722 obs=726);
var Order_ID Title Category Order_Date Item_Total;
format Title $char98.;
run;
title;
7 of the 10 highest priced items were textbooks for school; none of which I still own. The highest priced item, Intermediate Accounting with Annual Report, put me back $245.37 in August of 2012. For reference, the cost of that book was about half of an entire semesters tuition at the community college I was attending. Most of the expensive textbook purchases were made in 2012 so I must have realized soon afterward that renting textbooks was a better choice.

My lowest priced purchases include a contour brush, an eyeshadow brush, and a crease brush at $1.07 each. Based on the order date, I assume these were a birthday gift for my wife. See how these are another example of Amazon's odd categorization rules? Two of the brushes are labeled as "Cosmetic Brush" while the third is labeled as "Beauty". In February of 2019, I paid $1.97 for a used copy of Extreme Ownership by Jocko Willink; shipping was probably more expensive than the book itself.
I'm also interested to know how many items I purchase per year, and how that number has changed over time.
title 'Number of Items Purchased by Year';
proc freq data=mysas.amazon noprint;
table year / out=year_freq outcum;
run;
data year_freq;
set year_freq;
format PCT_CHNG 4.3;
if first.count then PCT_CHNG = 0;
PCT_CHNG = ((COUNT- LAG(COUNT)) / LAG(COUNT));
run;
proc print data=year_freq noobs;
run;
title;
Looks like I really started to hit my stride as an Amazon customer sometime in 2015. I have made approximately 82.5% of my purchases since 2016, and the percent change from year to year is generally positive. I anticipate that the total number of items I purchase each year will continue to rise while the percent change in the number of items purchased per year will flatten. Unless Amazon can figure out a way to become my sole source for groceries, glucose management equipment, and fitness apparel, my purchases in their marketplace are bound to level out at some point.
title 'Number of Items Purchased per Month';
proc freq data=mysas.amazon;
tables Year * Month / norow nocol nopercent;
run;
title;
March of 2016 was the last time I went an entire month without purchasing anything from Amazon, and I have only become more dependent since then. My purchases appear to be relatively constant throughout the year except for the moneymaker…December. 26.3% of all my Amazon purchases are made in the month of December. Evidently, Amazon is a crutch of mine for last minute Christmas shopping.
A few other frequency tables I found interesting...
title "Number of Items Purchased by Seller";
proc freq data=mysas.amazon order=freq;
table seller / nopercent norow nocol;
run;
title;
title 'Items Shipped per Carrier per Year';
proc freq data=mysas.amazon;
tables Carrier_Name*Year / norow nocol nopercent;
where Carrier_Name ne 'China Post' and Carrier_Name ne 'Yun Express';
run;
title;
title "Number of Items by Payment Type";
proc freq data=mysas.amazon order=freq;
table Payment;
run;
title;
I can see that Amazon started their personal delivery service in 2016 and now does the majority of their shipping internally, rather than relying on third party carriers. The shipping carrier and seller frequency tables make me wonder if there is a significant difference in the number of days it takes to ship an item based on whether or not the item is sold by or shipped by Amazon. I'll make note of this and see if I can run a regression on those variables a little later.
How have my purchases changed over the past five years?
The bulk of my purchases are between 2016 and 2020, and since that creates a clean 5 year set, I am going to create a smaller subset of the data that I will use for future analysis. I'm also going to include some dummy variables that indicate whether or not an item was shipped or sold by Amazon, so that I can run an analysis against those two variables.
data mysas.amazon_analysis(drop=Shipping_Address_Street_1);
set mysas.amazon;
if seller = 'Amazon.com' then sold_by_amazon = 1;
else sold_by_amazon = 0;
if Carrier_Name = 'AMZN_US' then shipped_by_amazon = 1;
else shipped_by_amazon = 0;
if year ge 2016 and year le 2020;
run;So what is the average price I pay per item?
proc means data=mysas.amazon maxdec=2 n mean stddev min max;
var Item_Total;
run;
OK, so on average I spend $26.49 per item with Amazon, all items inclusive. What about broken up per year?
title "Average Item Price per Year";
proc means data=mysas.amazon_analysis maxdec=2 nonobs n mean stddev min max;
var Item_Total;
Class Year;
output out=average_price;
run;
title;
And how does this number change over time?
data average_price(drop=_type_ _stat_ _freq_);
set average_price;
where _type_ = 1 and _stat_='MEAN';
PCT_CHNG = ((Item_Total - LAG(Item_Total)) / LAG(Item_Total));
run;
title 'Average Item Price by Year';
proc print data=average_price noobs;
run;
title;
So it looks like the average price I pay per item has been steadily increasing over the last four years. Now I'm interested to know the total I spent for each of the last five years, but I want the output to be granular enough to view the amount spent each month out of every year.
data amazon_temp;
set mysas.amazon_analysis;
by year month;
if first.month then monthly_total=0;
monthly_total + item_total;
if last.month;
run;
proc means data=amazon_temp mean noprint;
class year month;
var monthly_total;
output out=amazon_monthly mean=monthly_total / autoname autolabel;
run;
data amazon_monthly(drop=_type_ _freq_);
set amazon_monthly;
where _type_ = 3;
format monthly_total dollar10.2;
run;
proc transpose data=amazon_monthly out=monthly prefix=Yr;
by Year month;
id Year;
var monthly_total;
run;
data Yr2016(keep=month Yr2016);
set monthly;
where Year=2016;
run;
data Yr2017(keep=month Yr2017);
set monthly;
where Year=2017;
run;
data Yr2018(keep=month Yr2018);
set monthly;
where Year=2018;
run;
data Yr2019(keep=month Yr2019);
set monthly;
where Year=2019;
run;
data Yr2020(keep=month Yr2020);
set monthly;
where Year=2020;
run;
data monthly_merge;
merge Yr2016 Yr2017 Yr2018 Yr2019 Yr2020;
by month;
if Yr2016 = . then Yr2016=0;
format Yr2016 Yr2017 Yr2018 Yr2019 Yr2020 dollar10.2;
run;
title "Yearly Spending by Month";
proc print data=monthly_merge noobs;
sum Yr2016 Yr2017 Yr2018 Yr2019 Yr2020;
run;
title;

Days to Ship - Amazon vs. Non-Amazon
I was interested to know if there is a significant difference in the number of days it takes to ship an item based on whether or not the item is sold by or shipped by Amazon. To find out, I'll run two regressions using days to ship as the dependent variable, and the dummy variables sold_by_amazon, and shipped_by_amazon as the independent variable. Fun stat fact alert: running a regression using a dummy variable is the same as conducting a two sample pooled t-test, and to prove it I've included both procedures and output below.
Do items sold by Amazon ship faster than items sold by other sellers?
proc reg data=mysas.amazon_analysis;
model days_to_ship=sold_by_amazon;
run; quit;
proc ttest data= mysas.amazon_analysis;
class sold_by_amazon;
var days_to_ship;
run; quit;

It appears that items sold by Amazon actually ship slower than third party sellers. On average, items that are NOT sold by Amazon ship out 0.84 days at the order date, and items that are sold by Amazon ship (0.83702 + 0.57227) = 1.4093 days days after the order date. The results of the t-test back up this argument and the parameter estimates from the regression are significant which indicate that the probability of seeing the differences in the means of the two samples, given the variance of each sample, is relatively low.
Do items shipped with Amazon's delivery service ship faster than other carriers?
proc reg data=mysas.amazon_analysis;
model days_to_ship=shipped_by_amazon;
run; quit;
proc ttest data= mysas.amazon_analysis;
class shipped_by_amazon;
var days_to_ship;
run; quit;

Amazon appears to have the upper hand when shipping through their own carrier services over third party carriers like USPS, UPS, and FedEx. The parameter estimates of the regression and the mean values for the two sample t-test both show that items shipped with third party carriers take an average of 1.36 days to ship, whereas the effect of using Amazon's in house delivery service decreases the average days to ship to 0.83529.
What is my overall lifetime value to Amazon?
I don't have access to profit margin data, but I can at least calculate my lifetime revenue generation capacity as my customer lifetime value (CLV). Amazon surely has a value that they'll expect me to spend over the course of my life. Assuming that I will be a lifelong customer, and based on my current age, that is approximately an additional 50 years of online shopping.
Using Average Costs:
CLV = Avg. Item Cost x Avg. Yearly Purchase Frequency x Retention Period in Years
$26.49 x 136 x 50 = $180,132.00
CLV = Avg. Monthly Expenditure x Retention Period in Months
$300.07 x (600) = $180,042.00
OK, so both these numbers are really close which is promising, but I actually think my lifetime revenue generation capacity is a larger number. We've already seen that number of items I purchase per year, as well as the average price I pay per item show a positive trend. Typically, our purchasing power improves as we grow older. So what does my revenue generation capacity look like keeping in mind this forecast of future growth?
Using 2020 Values:
CLV = Avg. Item Cost x Avg. Yearly Purchase Frequency x Retention Period in Years
29.41 x 165 x 50 = $242,632.50
I can find the average percent change in the number of items I purchase from year to year, and the average percent change in the price I pay per item using the means procedure.
data total_count_merged;
merge year_freq average_price;
by year;
drop percent cum_freq cum_pct pct_chng;
count_change = (count - lag(count)) / lag(count);
item_total_change = (item_total - lag(item_total)) / lag(item_total);
run;
proc means data=total_count_merged;
var item_total_change count_change;
where year > 2016;
run;
Using an average purchase price increase of 5.15 percent per year and an average increase in the number of items purchased at 17.34 percent per year, I can more confidently estimate a value for my lifetime expenditure at Amazon. Since I believe that both will plateau at some point in the future, both will remain steady after the three year mark.
data CLV;
item=29.41;
number=165;
do year=1 to 3;
item=round(item * 1.0515, .01);
number=round(number*1.1734, 1);
value+item*number;
output;
end;
do year=4 to 50;
value+item*number;
output;
end;
format value dollar13.2;
format item dollar10.2;
run;
proc print data=CLV (firstobs = 50 obs = 50);
run;
And there it is, a lifetime revenue generation capacity of approximately $453,102 assuming that Amazon is around for the long haul. Even over the course of 60 years, shy of half a million dollars is a lot of money to spend at a single company. The true amount I'll spend probably exists somewhere between the formula using existing averages and average growth formula.
Conclusion
It has been enlightening to find out just how much I rely on Amazon, and to see how that dependence has grown over the past 5 - 10 years. As easy as they make it to purchase something, seeing the numbers in front of me helps me feel accountable for the amount I spend. Hopefully this experience will empower me to make better decisions the next time I'm face to face with the "Buy Now" button, but I doubt it.

Comments