Assignment 3: Niveditha Achanta

The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.

The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip file contains the fruit and vegetable files in the directories fruit and vegetables, respectively.

Exercise 1.1. Use pandas to extract the "Fresh" row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:

type food form price_per_lb yield lb_per_cup price_per_cup
fruit watermelon Fresh1 0.333412 0.52 0.330693 0.212033
fruit cantaloupe Fresh1 0.535874 0.51 0.374786 0.3938
vegetables onions Fresh1 1.03811 0.9 0.35274 0.406868
...

It's okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:

Ask questions and search the documentation/web to find the functions you need.

In [3]:
#Import modules 
import os
import pandas as pd
# Import modules for plots
import matplotlib.pyplot as plt
import seaborn as sns
In [4]:
#Set directory
directory = 'C:\\Users\\Niveditha_2\\Documents\\STA141B\\assignment3\\fruit'
files=[] #Initialize a list 
for filename in os.listdir(directory):
    if filename.endswith(".xlsx"): #pull all the Excel files
        paths = (os.path.join(directory, filename))
        files.append(paths) #Put the filepaths in the list  
In [5]:
df = pd.DataFrame() #Initialize a dataframe 

for i in files:
    if "pineapple" in i: #Because pineapple has 9 columns if you look at its dimensions 
        continue
    path = i.split("\\") #Split the filepath by \\ so we can use its elements in the dataframe later 
   
    data = pd.read_excel(i, skiprows=[0]).dropna() #Ignore the first row for every spreadsheet
    data['Type'] = path[-2] #Adds the second to last element of the filepath, in this case, "fruit"
    data['Food'] = path[-1][:-5] #Adds the last element of the filepath, minus the ".xlsx"
    #print data.axes
    
    newRow = data[data.Form == "Fresh1"] 
    df = df.append(data[data.Form == "Fresh1"])
fruit = df.drop(df.columns[[2,5]], axis=1) #Drop the columns with unnecessary information 
fruit.columns=["Form", "Avg Price per Lb", "Yield", "Lb per Cup", "Avg Price per Cup", "Type", "Food"]    
fruit = fruit.sort_index(axis=1) #Sort by column names to maintain the same order as the veg df below
fruit
Out[5]:
Avg Price per Cup Avg Price per Lb Food Form Lb per Cup Type Yield
1 0.422373 1.56752 apples Fresh1 0.242508 fruit 0.9
1 1.1891 3.04007 apricots Fresh1 0.363763 fruit 0.93
1 0.292965 0.566983 bananas Fresh1 0.330693 fruit 0.64
1 1.59318 4.73462 blueberries Fresh1 0.31967 fruit 0.95
1 0.3938 0.535874 cantaloupe Fresh1 0.374786 fruit 0.51
1 1.33455 3.59299 cherries Fresh1 0.341717 fruit 0.92
1 0.848278 0.897802 grapefruit Fresh1 0.462971 fruit 0.49
1 0.721266 2.09383 grapes Fresh1 0.330693 fruit 0.96
1 0.649077 0.796656 honeydew Fresh1 0.374786 fruit 0.46
1 1.03797 2.04468 kiwi Fresh1 0.385809 fruit 0.76
1 0.705783 1.37756 mangoes Fresh1 0.363763 fruit 0.71
1 0.618667 1.76115 nectarines Fresh1 0.31967 fruit 0.91
1 0.578357 1.03517 oranges Fresh1 0.407855 fruit 0.73
1 0.646174 1.29801 papaya Fresh1 0.308647 fruit 0.62
1 0.56639 1.59119 peaches Fresh1 0.341717 fruit 0.96
1 0.59074 1.46157 pears Fresh1 0.363763 fruit 0.9
1 0.707176 1.82742 plums Fresh1 0.363763 fruit 0.94
1 1.32634 2.17359 pomegranate Fresh1 0.341717 fruit 0.56
1 2.32287 6.97581 raspberries Fresh1 0.31967 fruit 0.96
1 0.802171 2.35881 strawberries Fresh1 0.31967 fruit 0.94
1 0.759471 1.37796 tangerines Fresh1 0.407855 fruit 0.74
1 0.212033 0.333412 watermelon Fresh1 0.330693 fruit 0.52

Comments: You will notice that the above table includes just fruits tagged with "Fresh1" to provide exact matches. However, fruits such as blackberries and pineapple are excluded. This may be because the dataframe is only taking elements that have 9 columns. Pineapple, for example, has extraneous columns that get excluded in the analysis. Additionally, fruits might be missing because the above code only provides exact matches for "Fresh1", and the reason for this is that in cases like this, it is more important to have an accurate dataset than one that is comprehensive but may contain misinformation. We want to use an exact match here so that it returns consistent data throughout.

Exercise 1.2. Reuse your code from exercise 1.1 to extract the "Fresh" row(s) from the vegetable Excel files.

Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?

You don't need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don't have this problem.

In [7]:
#Set directory
directory = 'C:\\Users\\Niveditha_2\\Documents\\STA141B\\assignment3\\vegetables'
files=[] #Initialize a list 
for filename in os.listdir(directory):
    if filename.endswith(".xlsx"): #pull all the Excel files
        paths = (os.path.join(directory, filename))
        files.append(paths) #Put the filepaths in the list    
df=pd.DataFrame() #Initialize a dataframe 

for i in files:
    path = i.split("\\") #Split the filepath by \\ so we can use its elements in the dataframe later 
    data=pd.read_excel(i, skiprows=[0]).dropna() #Ignore the first row for every spreadsheet
    data['Type'] = path[-2] #Adds the second to last element of the filepath, in this case, "fruit"
    data['Food'] = path[-1][:-5] #Adds the last element of the filepath, minus the ".xlsx"

    df = df.append(data[data.Form == "Fresh1"])
    
veg = df.drop(df.columns[[7, 8,9, 10]], axis=1)
veg.columns = ["Avg Price per Cup", "Avg Price per Lb", "Food", "Form", "Yield", "Lb per Cup",  "Type"]
veg = veg.sort_index(axis=1)
veg
Out[7]:
Avg Price per Cup Avg Price per Lb Food Form Lb per Cup Type Yield
1 1.15536 1.172248 acorn_squash Fresh1 0.451948 vegetables 0.458554
1 2.27497 2.213050 artichoke Fresh1 0.385809 vegetables 0.375309
1 2.58227 3.213494 asparagus Fresh1 0.396832 vegetables 0.493835
1 0.964886 2.235874 avocados Fresh1 0.31967 vegetables 0.740753
1 0.890898 2.763553 brussels_sprouts Fresh1 0.341717 vegetables 1.06
1 0.787893 1.244737 butternut_squash Fresh1 0.451948 vegetables 0.714
1 0.650001 2.630838 collard_greens Fresh1 0.286601 vegetables 1.16
1 1.8125 2.690623 corn_sweet Fresh1 0.363763 vegetables 0.54
1 0.696606 2.139972 green_beans Fresh1 0.275578 vegetables 0.846575
1 0.455022 1.410363 green_peppers Fresh1 0.264555 vegetables 0.82
1 0.766262 2.807302 kale Fresh1 0.286601 vegetables 1.05
1 0.309655 1.213039 lettuce_iceberg Fresh1 0.242508 vegetables 0.95
1 0.944032 2.569235 mustard_greens Fresh1 0.308647 vegetables 0.84
1 1.47315 3.213552 okra Fresh1 0.35274 vegetables 0.769474
1 0.406868 1.038107 onions Fresh1 0.35274 vegetables 0.9
1 0.184017 0.564320 potatoes Fresh1 0.264555 vegetables 0.811301
1 0.401618 1.311629 radish Fresh1 0.275578 vegetables 0.9
1 0.734926 2.277940 red_peppers Fresh1 0.264555 vegetables 0.82
1 0.84548 1.639477 summer_squash Fresh1 0.396832 vegetables 0.7695
1 0.4994 0.918897 sweet_potatoes Fresh1 0.440925 vegetables 0.811301
1 1.05353 2.471749 turnip_greens Fresh1 0.31967 vegetables 0.75

Comments: No, it doesn't produce any price for tomatoes. This is because the tomatoes' price is split into various categories: grape and cherry, roma, and beefsteak. The code similarly does not return values for vegetables that have different subcategories-- for example, spinach is split into boiled and raw, so there isn't one consolidated price value for it.

The code itself is designed to return exact matches for "Fresh1", so if something has a subcategory, it will not return it. This avoids errors and problems that may arise from data mismatches.

Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).

In [8]:
df3 = fruit.append(veg) #Append veg to fruit 
df3 = df3.apply(lambda x: pd.to_numeric(x, errors = 'ignore')) #Make sure all columns with numbers are actually numeric
df3
Out[8]:
Avg Price per Cup Avg Price per Lb Food Form Lb per Cup Type Yield
1 0.422373 1.567515 apples Fresh1 0.242508 fruit 0.900000
1 1.189102 3.040072 apricots Fresh1 0.363763 fruit 0.930000
1 0.292965 0.566983 bananas Fresh1 0.330693 fruit 0.640000
1 1.593177 4.734622 blueberries Fresh1 0.319670 fruit 0.950000
1 0.393800 0.535874 cantaloupe Fresh1 0.374786 fruit 0.510000
1 1.334548 3.592990 cherries Fresh1 0.341717 fruit 0.920000
1 0.848278 0.897802 grapefruit Fresh1 0.462971 fruit 0.490000
1 0.721266 2.093827 grapes Fresh1 0.330693 fruit 0.960000
1 0.649077 0.796656 honeydew Fresh1 0.374786 fruit 0.460000
1 1.037970 2.044683 kiwi Fresh1 0.385809 fruit 0.760000
1 0.705783 1.377563 mangoes Fresh1 0.363763 fruit 0.710000
1 0.618667 1.761148 nectarines Fresh1 0.319670 fruit 0.910000
1 0.578357 1.035173 oranges Fresh1 0.407855 fruit 0.730000
1 0.646174 1.298012 papaya Fresh1 0.308647 fruit 0.620000
1 0.566390 1.591187 peaches Fresh1 0.341717 fruit 0.960000
1 0.590740 1.461575 pears Fresh1 0.363763 fruit 0.900000
1 0.707176 1.827416 plums Fresh1 0.363763 fruit 0.940000
1 1.326342 2.173590 pomegranate Fresh1 0.341717 fruit 0.560000
1 2.322874 6.975811 raspberries Fresh1 0.319670 fruit 0.960000
1 0.802171 2.358808 strawberries Fresh1 0.319670 fruit 0.940000
1 0.759471 1.377962 tangerines Fresh1 0.407855 fruit 0.740000
1 0.212033 0.333412 watermelon Fresh1 0.330693 fruit 0.520000
1 1.155360 1.172248 acorn_squash Fresh1 0.451948 vegetables 0.458554
1 2.274967 2.213050 artichoke Fresh1 0.385809 vegetables 0.375309
1 2.582272 3.213494 asparagus Fresh1 0.396832 vegetables 0.493835
1 0.964886 2.235874 avocados Fresh1 0.319670 vegetables 0.740753
1 0.890898 2.763553 brussels_sprouts Fresh1 0.341717 vegetables 1.060000
1 0.787893 1.244737 butternut_squash Fresh1 0.451948 vegetables 0.714000
1 0.650001 2.630838 collard_greens Fresh1 0.286601 vegetables 1.160000
1 1.812497 2.690623 corn_sweet Fresh1 0.363763 vegetables 0.540000
1 0.696606 2.139972 green_beans Fresh1 0.275578 vegetables 0.846575
1 0.455022 1.410363 green_peppers Fresh1 0.264555 vegetables 0.820000
1 0.766262 2.807302 kale Fresh1 0.286601 vegetables 1.050000
1 0.309655 1.213039 lettuce_iceberg Fresh1 0.242508 vegetables 0.950000
1 0.944032 2.569235 mustard_greens Fresh1 0.308647 vegetables 0.840000
1 1.473146 3.213552 okra Fresh1 0.352740 vegetables 0.769474
1 0.406868 1.038107 onions Fresh1 0.352740 vegetables 0.900000
1 0.184017 0.564320 potatoes Fresh1 0.264555 vegetables 0.811301
1 0.401618 1.311629 radish Fresh1 0.275578 vegetables 0.900000
1 0.734926 2.277940 red_peppers Fresh1 0.264555 vegetables 0.820000
1 0.845480 1.639477 summer_squash Fresh1 0.396832 vegetables 0.769500
1 0.499400 0.918897 sweet_potatoes Fresh1 0.440925 vegetables 0.811301
1 1.053526 2.471749 turnip_greens Fresh1 0.319670 vegetables 0.750000

Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.

  • What kinds of fruits are the most expensive (per pound)? What kinds are the least expensive?
  • How do the price distributions compare for fruit and vegetables?
  • Which foods are the best value for the price?
  • What's something surprising about this data set?
  • Which foods do you expect to provide the best combination of price, yield, and nutrition? A future assignment may combine this data set with another so you can check your hypothesis.
In [21]:
#Most expensive foods per pound: compare "Avg Price per Lb" for fruits 
exp = sns.factorplot("Food", "Avg Price per Lb", data = fruit, kind = "bar") #data=veg to see vegetable bar plot
exp.set_xticklabels(rotation=90) #Rotate axis labels so they don't overlap
exp.fig.suptitle('Fruit Prices in Dollars')
plt.show(exp)

Comments: Raspberries are the most expensive fruit and watermelons are the cheapest per pound.

In [16]:
#Distributions
dists=sns.boxplot("Avg Price per Lb", "Type", data = df3)
plt.show(dists)

Comments: Other than the existence of two outliers (raspberries and blueberries, as can be seen from the first plot), the fruit prices are less skewed. They also have a wider range than the vegetables, which have no outliers. The median for the vegetables group is slightly above \$2, and is higher than the median for fruits, which is a little higher than \'1.50.

In [22]:
#Best value for price
#Divide lb per cup by avg price per cup to get lb per price 
df3['Value'] = df3['Avg Price per Cup']/df3['Yield']
df3=df3.sort_values(by = "Value", axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last') 
#ascending=False for least value for money
vals=df3[:5:] #Only first five rows
valplot=sns.factorplot("Food", "Value", data = vals, kind = "bar") #Barplot
valplot.set_xticklabels(rotation=30) #Rotate axis labels for readability
valplot.fig.suptitle('Food Price (in Dollars) per Yield')
plt.show(valplot)

Comments: The five foods with the overall best value for money, by comparing price per cup to yield, are potatoes, iceberg lettuce, watermelon, radish, and onions. If we want to find out the five fruits and five vegetables with the best value for money, we can do that as well, and this is pictured below. The two graphs below show us that the fruits with the most value for money (in order of most to least valuable) are watermelon, bananas, apples, peaches, and pears. The vegetables results are potatoes, iceberg lettuce, radish, onions, and green peppers.

In [19]:
#Top 5 for fruit
fruit['Value'] = fruit['Avg Price per Cup']/fruit['Yield']
fruit = fruit.sort_values(by = "Value", axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
vals = fruit[:5:]
valplot = sns.factorplot("Food", "Value", data = vals, kind = "bar")
valplot.set_xticklabels(rotation=30)
valplot.fig.suptitle('Fruit Price (in Dollars) per Yield')
plt.show(valplot)

#Top 5 for veg
veg['Value'] = veg['Avg Price per Cup']/veg['Yield']
veg = veg.sort_values(by = "Value", axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
vals = veg[:5:]
valplot=sns.factorplot("Food", "Value", data = vals, kind = "bar")
valplot.set_xticklabels(rotation=30)
valplot.fig.suptitle('Vegetable Price (in Dollars) per Yield')
plt.show(valplot)

Comments: I'm suprised by the fact that fruits similar to watermelon, such as canteloupe and honeydew, have less value for money than apples, peaches, and pears, especially because they're also some of the cheapest fruits in the fruit group (this can be verified visually by looking at the first barplot). Additionally, I would expect strawberries to be as expensive, if not more, than raspberries and blueberries, which is not the case. Even after ranking the fruits to see which one has the least value for money, strawberries does not make that list, which means it is not as relatively expensive (i.e. least valuable) as some of the other fruit. Also, when the fruits and vegetables together are ranked from least to greatest value from money (which can be done by changing ascending=False in the code above, the list results are artichoke, asparagus, sweet corn, acorn squash, and raspberries. This is interesting because out of the five most relatively expensive items, four of them are vegetables and only one of them is a fruit, even though the fruit price distribution shows that there are two outliers with two fruits being extremely expensive per pound, as well as fruit having a higher price range than vegetables.

Comments: Since the spreadsheets have not recorded information on nutrition, based on intuition, I would assume that all the fruit (maybe not watermelon, since it has the highest water content as compared to actual nutrition value), as well as radishes and green peppers, would be the most beneficial in terms of price, yield, and nutrition. I came to that conclusion by looking at the highest value for money for both fruits and vegetables, and then decided that most fruits are heralded to be high in nutrition, and potatoes, iceberg lettuce, and onions would not necessarily provide nutritional value because potatoes are high in starch content, iceberg lettuce (like watermelon) has a high water content and not much else, and I've never really heard of onions providing high nutritional value.

Notes: Some input from Ricky Safran, Chad Pickering, and Hannah Kosinovsky, even though I'm pretty sure it was just moral support this time around. Everything else is courtesy of various Stackoverflow threads.