The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.
The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip
file contains the fruit and vegetable files in the directories fruit
and vegetables
, respectively.
Exercise 1.1. Use pandas to extract the "Fresh" row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:
type | food | form | price_per_lb | yield | lb_per_cup | price_per_cup |
---|---|---|---|---|---|---|
fruit | watermelon | Fresh1 | 0.333412 | 0.52 | 0.330693 | 0.212033 |
fruit | cantaloupe | Fresh1 | 0.535874 | 0.51 | 0.374786 | 0.3938 |
vegetables | onions | Fresh1 | 1.03811 | 0.9 | 0.35274 | 0.406868 |
... |
It's okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:
str
methodsos
os.path
read_excel()
, concat()
, .fillna()
, .str
, plotting methodsAsk questions and search the documentation/web to find the functions you need.
#Import modules
import os
import pandas as pd
# Import modules for plots
import matplotlib.pyplot as plt
import seaborn as sns
#Set directory
directory = 'C:\\Users\\Niveditha_2\\Documents\\STA141B\\assignment3\\fruit'
files=[] #Initialize a list
for filename in os.listdir(directory):
if filename.endswith(".xlsx"): #pull all the Excel files
paths = (os.path.join(directory, filename))
files.append(paths) #Put the filepaths in the list
df = pd.DataFrame() #Initialize a dataframe
for i in files:
if "pineapple" in i: #Because pineapple has 9 columns if you look at its dimensions
continue
path = i.split("\\") #Split the filepath by \\ so we can use its elements in the dataframe later
data = pd.read_excel(i, skiprows=[0]).dropna() #Ignore the first row for every spreadsheet
data['Type'] = path[-2] #Adds the second to last element of the filepath, in this case, "fruit"
data['Food'] = path[-1][:-5] #Adds the last element of the filepath, minus the ".xlsx"
#print data.axes
newRow = data[data.Form == "Fresh1"]
df = df.append(data[data.Form == "Fresh1"])
fruit = df.drop(df.columns[[2,5]], axis=1) #Drop the columns with unnecessary information
fruit.columns=["Form", "Avg Price per Lb", "Yield", "Lb per Cup", "Avg Price per Cup", "Type", "Food"]
fruit = fruit.sort_index(axis=1) #Sort by column names to maintain the same order as the veg df below
fruit
Comments: You will notice that the above table includes just fruits tagged with "Fresh1" to provide exact matches. However, fruits such as blackberries and pineapple are excluded. This may be because the dataframe is only taking elements that have 9 columns. Pineapple, for example, has extraneous columns that get excluded in the analysis. Additionally, fruits might be missing because the above code only provides exact matches for "Fresh1", and the reason for this is that in cases like this, it is more important to have an accurate dataset than one that is comprehensive but may contain misinformation. We want to use an exact match here so that it returns consistent data throughout.
Exercise 1.2. Reuse your code from exercise 1.1 to extract the "Fresh" row(s) from the vegetable Excel files.
Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?
You don't need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don't have this problem.
#Set directory
directory = 'C:\\Users\\Niveditha_2\\Documents\\STA141B\\assignment3\\vegetables'
files=[] #Initialize a list
for filename in os.listdir(directory):
if filename.endswith(".xlsx"): #pull all the Excel files
paths = (os.path.join(directory, filename))
files.append(paths) #Put the filepaths in the list
df=pd.DataFrame() #Initialize a dataframe
for i in files:
path = i.split("\\") #Split the filepath by \\ so we can use its elements in the dataframe later
data=pd.read_excel(i, skiprows=[0]).dropna() #Ignore the first row for every spreadsheet
data['Type'] = path[-2] #Adds the second to last element of the filepath, in this case, "fruit"
data['Food'] = path[-1][:-5] #Adds the last element of the filepath, minus the ".xlsx"
df = df.append(data[data.Form == "Fresh1"])
veg = df.drop(df.columns[[7, 8,9, 10]], axis=1)
veg.columns = ["Avg Price per Cup", "Avg Price per Lb", "Food", "Form", "Yield", "Lb per Cup", "Type"]
veg = veg.sort_index(axis=1)
veg
Comments: No, it doesn't produce any price for tomatoes. This is because the tomatoes' price is split into various categories: grape and cherry, roma, and beefsteak. The code similarly does not return values for vegetables that have different subcategories-- for example, spinach is split into boiled and raw, so there isn't one consolidated price value for it.
The code itself is designed to return exact matches for "Fresh1", so if something has a subcategory, it will not return it. This avoids errors and problems that may arise from data mismatches.
Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).
df3 = fruit.append(veg) #Append veg to fruit
df3 = df3.apply(lambda x: pd.to_numeric(x, errors = 'ignore')) #Make sure all columns with numbers are actually numeric
df3
Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.
#Most expensive foods per pound: compare "Avg Price per Lb" for fruits
exp = sns.factorplot("Food", "Avg Price per Lb", data = fruit, kind = "bar") #data=veg to see vegetable bar plot
exp.set_xticklabels(rotation=90) #Rotate axis labels so they don't overlap
exp.fig.suptitle('Fruit Prices in Dollars')
plt.show(exp)
Comments: Raspberries are the most expensive fruit and watermelons are the cheapest per pound.
#Distributions
dists=sns.boxplot("Avg Price per Lb", "Type", data = df3)
plt.show(dists)
Comments: Other than the existence of two outliers (raspberries and blueberries, as can be seen from the first plot), the fruit prices are less skewed. They also have a wider range than the vegetables, which have no outliers. The median for the vegetables group is slightly above \$2, and is higher than the median for fruits, which is a little higher than \'1.50.
#Best value for price
#Divide lb per cup by avg price per cup to get lb per price
df3['Value'] = df3['Avg Price per Cup']/df3['Yield']
df3=df3.sort_values(by = "Value", axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
#ascending=False for least value for money
vals=df3[:5:] #Only first five rows
valplot=sns.factorplot("Food", "Value", data = vals, kind = "bar") #Barplot
valplot.set_xticklabels(rotation=30) #Rotate axis labels for readability
valplot.fig.suptitle('Food Price (in Dollars) per Yield')
plt.show(valplot)
Comments: The five foods with the overall best value for money, by comparing price per cup to yield, are potatoes, iceberg lettuce, watermelon, radish, and onions. If we want to find out the five fruits and five vegetables with the best value for money, we can do that as well, and this is pictured below. The two graphs below show us that the fruits with the most value for money (in order of most to least valuable) are watermelon, bananas, apples, peaches, and pears. The vegetables results are potatoes, iceberg lettuce, radish, onions, and green peppers.
#Top 5 for fruit
fruit['Value'] = fruit['Avg Price per Cup']/fruit['Yield']
fruit = fruit.sort_values(by = "Value", axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
vals = fruit[:5:]
valplot = sns.factorplot("Food", "Value", data = vals, kind = "bar")
valplot.set_xticklabels(rotation=30)
valplot.fig.suptitle('Fruit Price (in Dollars) per Yield')
plt.show(valplot)
#Top 5 for veg
veg['Value'] = veg['Avg Price per Cup']/veg['Yield']
veg = veg.sort_values(by = "Value", axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last')
vals = veg[:5:]
valplot=sns.factorplot("Food", "Value", data = vals, kind = "bar")
valplot.set_xticklabels(rotation=30)
valplot.fig.suptitle('Vegetable Price (in Dollars) per Yield')
plt.show(valplot)
Comments: I'm suprised by the fact that fruits similar to watermelon, such as canteloupe and honeydew, have less value for money than apples, peaches, and pears, especially because they're also some of the cheapest fruits in the fruit group (this can be verified visually by looking at the first barplot). Additionally, I would expect strawberries to be as expensive, if not more, than raspberries and blueberries, which is not the case. Even after ranking the fruits to see which one has the least value for money, strawberries does not make that list, which means it is not as relatively expensive (i.e. least valuable) as some of the other fruit. Also, when the fruits and vegetables together are ranked from least to greatest value from money (which can be done by changing ascending=False
in the code above, the list results are artichoke, asparagus, sweet corn, acorn squash, and raspberries. This is interesting because out of the five most relatively expensive items, four of them are vegetables and only one of them is a fruit, even though the fruit price distribution shows that there are two outliers with two fruits being extremely expensive per pound, as well as fruit having a higher price range than vegetables.
Comments: Since the spreadsheets have not recorded information on nutrition, based on intuition, I would assume that all the fruit (maybe not watermelon, since it has the highest water content as compared to actual nutrition value), as well as radishes and green peppers, would be the most beneficial in terms of price, yield, and nutrition. I came to that conclusion by looking at the highest value for money for both fruits and vegetables, and then decided that most fruits are heralded to be high in nutrition, and potatoes, iceberg lettuce, and onions would not necessarily provide nutritional value because potatoes are high in starch content, iceberg lettuce (like watermelon) has a high water content and not much else, and I've never really heard of onions providing high nutritional value.
Notes: Some input from Ricky Safran, Chad Pickering, and Hannah Kosinovsky, even though I'm pretty sure it was just moral support this time around. Everything else is courtesy of various Stackoverflow threads.