---
title: "Analyzing golf ball carry distance."
author: "Ved Piyush"
date: "10/12/2020"
output:
  pdf_document: default
  html_document: default
---

```{r setup, include=FALSE}
options(continue = " ")
knitr::opts_chunk$set(prompt = TRUE, comment = NA, background = "white",
tidy = TRUE, tidy.opts = list(width.cutoff = 60, blank = FALSE))
```

We will use the concepts learned from Sections 2, 5, and 7 to analyze golf ball carry distances. Our ultimate goal is to determine if a less expensive golf ball will perform similarly to a more expensive golf ball. Carry distance is the distance that the golf ball travels in the air after being hit by the golfer. Golfers would prefer a golf ball that covers more distance on average. Comparing the average carry distance across golf ball types can help a golfer make an informed decision as to which ball they should use.  

The data in this study was collected by Mark Crossfield, who is a golf instructor. He used a launch monitor to determine carry distance. Let's first look at a plot comparing the carry distances for the five different golf balls used by Mark.

**Question** - If we want to compare the mean carry distance for two golf balls, then do we have independent samples or dependent samples?  


```{r}
# read the csv file comprising the carry distance data
# make sure you have this file in your working directory first
carry_distance <- read.csv("Carry Distance.csv")

# let's look at the data
head(carry_distance)

# let's make side by side boxplots for the carry distances
par(mfrow = c(1,1))

# col = NA removes gray background in box
boxplot(formula = Carry.Distance ~ Ball, data = carry_distance, main = "Box and dot plot", ylab = "Carry distance (yards)",xlab = "Ball type", pars = list(outpch=NA), col = NA) 

stripchart(x = carry_distance$Carry.Distance ~ carry_distance$Ball, lwd = 2, col = "red",method = "jitter", vertical = TRUE, pch = 1, add = TRUE)
```
**Question** - Do you think there are differences among the means or variances? Remember a plot like this is meant to give an initial impression of the data relative to the research hypotheses. Focus on shifts of points in addition to the variability seen in the points and the location of the medians. 

We will compare the average carry distances of the Soft Feel golf balls to the Z Star golf balls. The Soft Feel golf balls cost around 20 dollars while the Z Star golf balls cost around 40 dollars. This comparison is interesting for a golfer because if there is no significant difference between the mean carry distance of these two balls, then the golfer can purchase the Soft Feel balls which are 20 dollars cheaper. 

Let's first construct the confidence interval for $\mu_1 - \mu_2$ where $\mu_1$ represent the population means carry distance for the Soft Feel ball and $\mu_2$ represents the population means carry distance for the Z Star ball. We will use $\alpha = 0.05$. 

We know that the $(1-\alpha)100\%$ CI for $\mu_1 - \mu_2$ with $\sigma_{1}^2$ and $\sigma_{2}^2$ unknown and possibly unequal is given by $$\bar{y_1} - \bar{y_2} - t_{\frac{\alpha}{2},v}\sqrt{\frac{s_{1}^2}{n_1}+\frac{s_{1}^2}{n_1}} < \mu_1 - \mu_2 < \bar{y_1} - \bar{y_2} + t_{\frac{\alpha}{2},v}\sqrt{\frac{s_{1}^2}{n_1}+\frac{s_{1}^2}{n_1}}$$ where $t_{\frac{\alpha}{2},v}$ is the $1-\frac{\alpha}{2}$ quantile from a t distribution with $$v = \frac{\left(\frac{s_{1}^2}{n_1}+\frac{s_{2}^2}{n_2}\right)^{2}}{\frac{\left(\frac{s_{1}^2}{n_1}\right)}{n_1 -1} + \frac{\left(\frac{s_{2}^2}{n_2}\right)}{n_2 -1}}$$ degrees of freedom. 

**Question** - Why does our assumption of unequal $\sigma_{1}^2$ and $\sigma_{2}^2$ make sense? What visual evidence can we present to justify this assumption? 

```{r}
alpha <- 0.05
#Let 1 = Soft feel and 2 = Z star

save.mean <- aggregate(formula = Carry.Distance ~ Ball, data = carry_distance, FUN = mean)

ybar1 <- save.mean$Carry.Distance[3]
ybar2 <- save.mean$Carry.Distance[4]

save.var <- aggregate(formula = Carry.Distance ~ Ball, data = carry_distance, FUN = var)

s.sq1 <- save.var$Carry.Distance[3]
s.sq2 <- save.var$Carry.Distance[4]

save.n <- aggregate(formula = Carry.Distance ~ Ball, data = carry_distance, FUN = length)

n1 <- save.n$Carry.Distance[3]
n2 <- save.n$Carry.Distance[4]

#Variance unequal
nu <- (s.sq1/n1 + s.sq2/n2)^2 / ( (s.sq1/n1)^2 / (n1 - 1) +  (s.sq2/n2)^2 / (n2 - 1)) 

data.frame(ybar1, ybar2, s.sq1, s.sq2, n1, n2, nu, t.quant = qt(p = 1 - alpha/2, df = nu))

lower <- ybar1 - ybar2 - qt(p = 1 - alpha/2, df = nu) * sqrt(s.sq1/n1 + s.sq2/n2) 

upper <- ybar1 - ybar2 + qt(p = 1 - alpha/2, df = nu) * 
    sqrt(s.sq1/n1 + s.sq2/n2) 

data.frame(lower, upper)

# easier way to verify the calculations
t.test(formula = Carry.Distance ~ Ball, data = carry_distance[carry_distance$Ball %in% c("Soft Feel", "Z Star"),],var.equal = FALSE, conf.level = 0.95) 
```
From our calculations, we find that the $95 \%$ CI for $\mu_1 - \mu_2$ is (-6.03, 8.88). Since 0 is not inside this interval, we do not have sufficient evidence to reject the null hypothesis. Therefore we do not have sufficient evidence to indicate a difference between the mean carry distance for the two golf balls. 

**Question** - Interpret the confidence interval calculated above.  

**Question** - How can your conclusion above help a golfer save money? 

**Question** - Using the confidence intervals calculated above, conduct the following hypothesis test and interpret the results. $$\begin{aligned}H_{o}: \mu_1 - \mu_2 = 0 \\ H_{a}: \mu_1 - \mu_2 \neq 0\end{aligned}$$ 

**Exploration** - Please conduct a similar analysis for the golf balls Q Star and Z Star XV. Interpret your results in context of the golf problem. 

**Question** - How many possible pairwise comparisons can be made for the golf ball data? Suppose you are conducting a hypothesis test for each possible pairwise comparison and using a level of $\alpha$ for the type I error rate in each test. How does the probability of making at least one type I error for all multiple comparisons compare with $\alpha$? 

**Question** - Apart from the inference for the difference in means, why might the inferences for the ratio of variances for the carry distance also make sense? 

Now, we will do the inference for the ratio of variances. 

We know that a $(1-\alpha)100\%$ CI for the ratio of variances is given by $$\frac{s_{1}^2}{s_{2}^2}F_{1-\frac{\alpha}{2},v_2,v1} < \frac{\sigma_{1}^2}{\sigma_{2}^2} < \frac{s_{1}^2}{s_{2}^2}F_{\frac{\alpha}{2},v_2,v1}$$ where $v_1 = n_1 - 1$ and $v_2 = n_2 - 1$. 

```{r}
alpha <- 0.05

data.frame(n1, n2)

qf(p = alpha/2, df1 = n2 - 1, df2 = n1 - 1)

qf(p = 1 - alpha/2, df1 = n2 - 1, df2 = n1 - 1)

lower <- s.sq1/s.sq2 * qf(p = alpha/2, df1 = n2 - 1, df2 
    = n1 - 1)

upper <- s.sq1/s.sq2 * qf(p = 1 - alpha/2, df1 = n2 - 1, 
    df2 = n1 - 1)

data.frame(lower, upper)

# easier way to verify the calculations
var.test(x = carry_distance[carry_distance$Ball == "Soft Feel",]$Carry.Distance, y = carry_distance[carry_distance$Ball == "Z Star",]$Carry.Distance,conf.level = 0.95)
```

From our calculations, we find that the $95 \%$ CI for $\frac{\sigma_{1}^2}{\sigma_{2}^2}$ is (0.08, 1.61). Since 1 is inside this interval, we do not have sufficient evidence to reject the null hypothesis. Therefore we do not have sufficient evidence to indicate that the ratio of variances of carry distance for the two golf balls is different from 1.

**Question** - Interpret the confidence interval calculated above. 

**Question** - Using the confidence intervals calculated above, conduct the following hypothesis test and interpret the results. $$\begin{aligned}H_{o}: \frac{\sigma_{1}^2}{\sigma_{2}^2} = 1 \\ H_{a}: \frac{\sigma_{1}^2}{\sigma_{2}^2} \neq 1\end{aligned}$$ 

**Exploration** - Please conduct a similar analysis for the golf balls Q Star and Z Star XV. Interpret your results in context of the golf problem.