index.md (2620B)
1 +++ 2 title = 'Relationships between variables' 3 template = 'page-math.html' 4 +++ 5 6 # Relationships between variables 7 8 relationship can be investigated, causality can’t. 9 graphically, you can use scatterplots: 10 11 ![](6de852d30c13f092f1d0954f4d21c2c6.png) 12 13 ## Correlation 14 15 correlation: if values of two variables are somehow associated with each other 16 17 - positive: higher values of variable 1 are usually associated with higher values of variable 2 18 - negative: higher values of variable 1 are usually associated with lower values of variable 2 19 20 linear if the plotted points are basically a straight line. 21 population linear correlation coefficient is ρ. 22 sample linear correlation coefficient (estimator for ρhoρ) is: 23 24 $r = \frac{1}{n-1} \times \frac{\sum_{i=1} n(x_{i} - \bar{x})(y_{i} - \bar{y})}{s_{x} s_{y}}$ 25 26 interpreting r: 27 28 - r = 1: perfect positive linear relationship 29 - r >0: positive linear relationship 30 - r ≈ 0: no linear relationship (doesn’t mean no relationship!!) 31 - r < 0: negative linear relationship 32 - r = −1: perfect negative linear relationship 33 34 ### Testing ρ = 0 35 36 test statistic: 37 38 $T_{p} = \frac{R - \rho}{\sqrt{\frac{1 - R^{2}}{n-1}}}$ 39 40 has under H0: ρ = 0 a t-distribution with n−2 degrees of freedom. 41 42 ## Regression 43 44 if there’s a correlation, points can be described by line 45 $y_{i} = \beta_{0} + \beta_{1} x_{i} + error_{i}$ 46 47 regression equation is $\hat{y} = b_{0} + b_{1} x$ 48 49 where b₀ and b₁ are least-squares estimates of β₀ and β₁ 50 51 you want values that satisfy least-squares property (i.e. minimise $\sum_{i} (observed - model)^{2}$) 52 53 54 $\begin{aligned} 55 b_{1} &= r \frac{s_{y}}{s_{x}} &&\text{(the slope)} \\\\ 56 b_0 &= \hat{y} - b_{1} \bar{x} &&\text{(the y intercept)} 57 \end{aligned}$ 58 59 ### Testing linearity 60 61 Test: 62 - H0: β1 = 0 63 - HA: β1 ≠ 0 64 65 The score is: 66 67 $t_{\beta} = \frac{b_{1}}{s_{b_{1}}}$ 68 69 (realisation of test statistic $T_{\beta}$ that has t-distribution with n−2 degrees of freedom under H₀) 70 71 ### Coefficient of determination 72 73 Coefficient of determination is proportion of variation in y variable that regression equation can explain: 74 75 $r^{2} = \frac{\text{explained variations}}{\text{total variation}}$ 76 77 ### Residuals 78 79 To check for a fixed standard deviation, make a residual plot. 80 Residuals are estimates for the errors. 81 82 residual: difference between observed yi and predicted value $\hat{y}\_{i} = b\_{0} + b\_{1} x\_{i}$ 83 84 $residual\_{i} = y\_{i} - \hat{y}\_{i} = y\_{i} - (b\_{0} + b\_{1} x\_{i})$ 85 86 A residual plot is scatterplot of residuals against x values. Should be no obvious pattern in residuals. 87 88 ![](4670b5bf474343b006017ea93ea64fdb.png)