LOADING

Type to search

Chi-squared Test of Independence in R- Programming

To Know more about the Different Corporate Training & Consulting Visit our website www.Instrovate.com Or Email : info@instrovate.com or WhatsApp / Call at +91 74289 52788

Data Analytics R Programming

Chi-squared Test of Independence in R- Programming

Share

Chi-squared Test of Independence

It is a non-parametric test to determine if there is a significant relationship between two categorical variables. The frequency of one variable is compared with frequency of second variable .

The assumptions of chi-square test are :

1. The data in the cells should be frequencies or counts of cases.

2. The levels of the variables are mutually exclusive .

3. Each subject may contribute data to one and only one cell .

4. The groups must be independent . There is no interdependency between groups while comparing the groups.

5. The variables should be categorical or we can change data in categorical form .

6. The sample data are displayed in a contingency table , the expected frequency count for each cell of the table is at least 5.

Expected frequencies :

The expected frequency is calculated for each cell in a contingency table. The expected frequency is calculated as :

  E = nr X nc /n

Where

E – represents the cell expected value,

nr – represents the total number of sample observations for row for level r

nc – represents the total number of sample observations for column for level c

n – represents the total sample size

Test statistic :

The test statistic of the chi-square test is –

 χ2 = ∑ (O-E)2/ E

Where

O – Observed value

E – Expected value

χ2 – the chi-square value

∑ – Calculate summation of all values in cell

Null hypothesis : Assumes that there is no association between the two variables .

Alternative hypothesis : Assumes that there is an association between the two variables.

If p-value > 0.05 , then null hypothesis is true. If p-value is less than 0.05 then alternative hypothesis is true.

Degrees of freedom :

The number of degrees of freedom can be defined as the minimum number of independent coordinates that can specify the position of the system completely.

The degrees of freedom , df = (Number of rows -1) X (Number of columns – 1)

We will use housetasks  data set from STHDA .

We import dataset using online link.

file_path <- “http://www.sthda.com/sthda/RDoc/data/housetasks.txt”

We import dataset by using read.delim() function.

housetasks <- read.delim(file_pathrow.names = 1)

Chi-squared Test of Independence in R- Programming 29

We are installing “gplots” library for visualization.

install.packages(“gplots”)

We load “gplots” library using following code:

library(“gplots”)

We want to create a table format to store the dataset . To convert dataset into a table , we used as.matrix() function to convert in matrix form and then convert matrix into a table format by using as.table() function on it.

dt <- as.table(as.matrix(housetasks))

Chi-squared Test of Independence in R- Programming 30

We transform dt table to represent rows values corresponds to values in table .

t(dt)

Chi-squared Test of Independence in R- Programming 31

We are using baloonplot to plot data in a dot form. In this plot , dot is bigger if the value of the variable is larger. We used label = FALSE  to not show the values of the elements on the plot. We used show.margins = FALSE to not print the total sum of rows and columns in the plot.

balloonplot(t(dt), main =”housetasks”, xlab =””, ylab=””, label = FALSE, show.margins = FALSE)

Chi-squared Test of Independence in R- Programming 32

We are installing “graphics” library for advanced visualization. We load “graphics” library as:

install.packages(“graphics”)

library(“graphics”)

We are using mosaicplot to plot the work associated with Husband and Wife .

The argument shade is used to color the graph

The argument las=2 produces vertical labels.

mosaicplot(dtshade = TRUE, las=2, main = “housetasks”)

Chi-squared Test of Independence in R- Programming 33

From this plot , we can see that housetasks Laundry, Main_meal , Dinner and breakfast(blue color) are mainly done by the wife .

The chi-square test can be done as :

chisq <- chisq.test(housetasks)

chisq

Output :

Chi-squared Test of Independence in R- Programming 34

Here , X-squared = 1944.5 means chi-square value is 1944.5 and the degrees of freedom is 36. The p-value is less than  2.2e-16

We can see observed frequency by using following code :

chisq$observed

Chi-squared Test of Independence in R- Programming 35

We can see expected frequency by using following code :

round(chisq$expected,2)

Chi-squared Test of Independence in R- Programming 36

Pearson residual  

The Pearson residuals can be used to check the model fit at each observation for generalized linear models. The Pearson residual for a cell in a two-way table is :

r = O – E / √ E

We can calculate residuals by following code :

round(chisq$residuals, 3)

Chi-squared Test of Independence in R- Programming 37

The chi-square statistic is the sum of the contributions from each of the individual cells.

If an individual contribution is high, it is either because the expected value is low or the difference between the observed and the expected is reasonably high. If the independent variable has more than two values, you might like to consider whether the distinction between a specific value and all the others would be significant. 

We can see chi-square vale as :

Chi-squared Test of Independence in R- Programming 38

We can also find contribution of each combination of pairs in chi-square test.

It is the ratio of squared residual value and chi-square value.

contrib <- 100*chisq$residuals^2/chisq$statistic

round(contrib, 3)

Chi-squared Test of Independence in R- Programming 39