“Correlation analysis deals with the association between two or more variables.” —Simpson and Kafka
Correlation is a statistical technique is used to show whether variables are related to each other . Correlation is a single number that describes the degree of relationship between two variables. For example, height and weight are related or not . We have different types of correlation .
The three basic types of correlations are :
1.Karl Pearson’s parametric correlation coefficient
2. Kendall tau rank correlation coefficient
3. Spearman’s rank correlation coefficient
4. Partial Correlation
Karl Pearson’s correlation coefficient
The assumptions of Pearson correlation coefficient “r” are:
1. The variables should be normally distributed .
2. The graph should be linear in form .
3. The data should be homoscedasticity.
The formula of correlation coefficient is –
r = ∑(x-mx)(y-my) / √∑(x-mx)2 ∑(y-my)2
Where,
r = the slope of the regression line is also called as the regression coefficient
x = First variable
y = Second variable
mx = mean of x variable
my = mean of y variable
Kendall tau rank correlation coefficient
Kendall rank correlation is a non-parametric test that measures the strength of dependence between two variables. If we consider two samples, a and b, where each sample size is n, we know that the total number of pairings with a b is n(n-1)/2. The following formula is used to calculate the value of Kendall rank correlation:
τ = 2(nc -nd )/ (n(n-1))
nc = number of concordant
nd = number of discordant
n = size of x and y
Spearman’s rank correlation coefficient
Spearman rank correlation is a non-parametric test that is used to measure the degree of association between two variables. It was developed by Spearman, thus it is called the Spearman rank correlation. Spearman rank correlation test does not assume any assumptions about the distribution of the data and is the appropriate correlation analysis when the variables are measured on a scale that is at least ordinal.
There are two methods to calculate Spearman’s correlation depending on whether: (1) your data does not have tied ranks or (2) your data has tied ranks. The formula for when there are no tied ranks is:
ρ = 1 – 6 ∑di2 / n(n2-1)
The formula to use when there are tied ranks is:
rho , ρ = ∑(x’-mx’)(y’-my’) / √∑(x’-mx’)2 ∑(y’-my’)2
Where x’ = rank of x
y’ = rank of y
mx’ = mean of x’
my’ = mean of y’
i = tied pairs
Partial Correlation
Partial correlation is the measure of association between two variables, while controlling or adjusting the effect of one or more additional variables. Partial correlations can be used in many cases that assess for relationship, like whether or not the sale value of a particular item is related to the expenditure on advertising when the effect of price is controlled.
We are going to use Auto MPG dataset. The dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.
It contains following variables :
1. mpg: continuous
2. cylinders: multi-valued discrete
3. displacement: continuous
4. horsepower: continuous
5. weight: continuous
6. acceleration: continuous
7. model year: multi-valued discrete
8. origin: multi-valued discrete
9. car name: string (unique for each instance)
We create an object to store link of dataset .
x<-“http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data”
The dataset can be import by using read.table() function.
data1<-read.table(x,dec=”,”,header=F)
We are giving names of columns of dataset .
names(data1)<-c(“mpg”,”cylinders”,”displacement”,”horsepower”,”weight”,”acceleration”,
“model year”,”origin”,’car name’)
We are checking structure of dataset as:
str(data1)
We want to convert mpg , displacement , horsepower , weight and acceleration variables to numeric data type.
We check all the data type of variables and store the position of variables of factor data type.
w <- which( sapply( data1, class ) == ‘factor’ )
We does not want to convert “car name” variable as numeric . as it stored character data. We subset our data by removing last element .
w<-w[-6]
We are using lapply() function to change factor data type to numeric data type as:
data1[w] <- lapply( data1[w], function(x) as.numeric(as.character(x)) )
We want to change “car name” variable to character data type by using as.character() function.
data1$`car name` <- as.character(data1$`car name`)
We are checking the structure of data1 object.
str(data1)
We are checking top six observations of data1 .
head(data1)
We can visualize data using scatter plots .
We are installing “ggpubr” package .
install.packages(“ggpubr”)
We load “ggpubr” package .
library(“ggpubr”)
We are using ggscatter() function to plot scatter plot with mpg on x-axis and weight on y-axis . The argument add = “reg.line” is used to add regression line in the plot. The argument cor.coef=TRUE is used to show correlation coefficient in the plot . The argument cor.method =”pearson” to find pearson correlation coefficient .
ggscatter(data1, x = “mpg”, y = “weight”,
add = “reg.line”,
cor.coef = TRUE, cor.method = “pearson”,
xlab = “Miles/(US) gallon”, ylab = “Weight (1000 lbs)”)
The correlation coefficient , r is -0.83 and p-value is less than 2.2 e-16 .
We can plot q-qplot on mpg variable of dataset.
ggqqplot(data1$mpg, ylab = “MPG”)
We can plot q-qplot on weight variable as:
ggqqplot(data1$weight, ylab = “WT”)
We want to find pearson correlation between mp and weight of data1 . We are using cor.test() function to find correlation between variables.
res <- cor.test(data1$weight, data1$mpg, method = “pearson”)
In the result above :
- t is the t-test statistic value (t = -29.814),
- df is the degrees of freedom (df= 396),
- p-value is the significance level of the t-test (p-value = 2.2e^{-16}).
- conf.int is the confidence interval of the correlation coefficient at 95% (conf.int = [-0.85974, -0.798747]);
- sample estimates is the correlation coefficient (Cor.coeff = -0.83).
Interpretation of the result
The p-value of the test is 2.2e^{-16}, which is less than the significance level alpha = 0.05. We can conclude that wt and mpg are significantly correlated with a correlation coefficient of -0.83 and p-value of 2.2e^{-16}.
The function cor.test() returns a list containing the following components:
- p.value: the p-value of the test
- estimate: the correlation coefficient
We can extract p-value of correlation analysis as:
res$p.value
We can extract correlation coefficient as:
res$estimate
Kendall rank correlation test
The Kendall rank correlation coefficient or Kendall’s tau statistic is used to estimate a rank-based measure of association. This test may be used if the data do not necessarily come from a bivariate normal distribution.
res2 <- cor.test(data1$weight, data1$mpg, method=”kendall”)
tau is the Kendall correlation coefficient.
The correlation coefficient between x and y are -0.6940 and the p-value is 2.2e^{-16}.
Spearman rank correlation coefficient
Spearman’s rho statistic is also used to estimate a rank-based measure of association. This test may be used if the data do not come from a bivariate normal distribution.
res2 <-cor.test(data1$weight, data1$mpg, method = “spearman”)
rho is the Spearman’s correlation coefficient.
The correlation coefficient between x and y are -0.8749 and the p-value is 2.2e^{-16}.
Interpret correlation coefficient
Correlation coefficient is comprised between -1 and 1:
- -1 indicates a strong negative correlation : this means that every time x increases, y decreases .
- 0 means that there is no association between the two variables (x and y).
- 1 indicates a strong positive correlation : this means that y increases with x .