# Using ggplot2 for Data Analytics in R On Diamond Data Set

##### Share

plot

We are using diamonds dataset to explore qplot() .

Load ggplot2 package

*library(ggplot2)*

View diamonds dataset

*View(diamonds)*

We see the structure of diamonds dataset.

*str(diamonds)*

Output:

We check top 6 observations of diamonds dataset.

*head(diamonds)*

We check the summary of variables of diamonds dataset. It shows all the basic descriptive statistics of diamonds dataset.

*summary(diamonds)*

I check the dimension of diamonds . It shows 53940 rows and 10 columns.

*dim(diamonds)*

We plot histogram in ggplot2 by using ggplot() function to define diamonds dataset and add geom_histogram() function to plot histogram . Aesthetic mappings describe how variables in the data are mapped to visual properties(aesthetics) of geoms. We used binwidth to adjust bins width .

*ggplot(data=diamonds) + geom_histogram(binwidth=500, aes(x=diamonds$price))*

We add labels of x-axis and y-axis by using xlab and ylab parameters. We add title to the graph by using ggtitle() .

*ggplot(data=diamonds) + geom_histogram(binwidth=500, aes(x=diamonds$price)) + ggtitle(“Diamond Price Distribution”) + xlab(“Diamond Price U$”) + ylab(“Frequency”)*

We used theme_minimal() to add white theme to show the graph.

*ggplot(data=diamonds) + geom_histogram(binwidth=500, aes(x=diamonds$price)) + ggtitle(“Diamond Price Distribution”) + xlab(“Diamond Price U$”) + ylab(“Frequency”) + theme_minimal()*

We can see from the graph , there is high frequency of diamonds have price below $5000.

We can get average value of diamond price.

*mean(diamonds$price)*

We can get median of diamond proce.

*median(diamonds$price)*

xlim() is used for adding limits of x-axis.

*ggplot(data=diamonds) + geom_histogram(binwidth=500, aes(x=diamonds$price)) + ggtitle(“Diamond Price Distribution”) + xlab(“Diamond Price U$ – Binwidth 500”) + ylab(“Frequency”) + theme_minimal() + xlim(0,2500)*

We changed the binwidth to 100 to show changes in graph .

*ggplot(data=diamonds) + geom_histogram(binwidth=100, aes(x=diamonds$price)) + ggtitle(“Diamond Price Distribution”) + xlab(“Diamond Price U$- Binwidth 100”) + ylab(“Frequency”) + theme_minimal() + xlim(0,2500)*

By changing binwidth , frequency dropped from 10,000 to 2,000 in diamonds between $500 and $1,000 .

We again change binwidth to 50 to see changes in distribution.

*ggplot(data=diamonds) + geom_histogram(binwidth=50, aes(x=diamonds$price)) + ggtitle(“Diamond Price Distribution”) + xlab(“Diamond Price U$ – Binwidth 50”) + ylab(“Frequency”) + theme_minimal() + xlim(0,2500)*

We can see different frequency by cut of diamond.

*ggplot(data=diamonds) + geom_histogram(binwidth=100, aes(x=diamonds$price)) + ggtitle(“Diamond Price Distribution by Cut”) + xlab(“Diamond Price U$”) + ylab(“Frequency”) + theme_minimal() + facet_wrap(~cut)*

We can see there is wide difference in frequency of different cut of diamond.

We can scatter plot between carat and price of diamonds.

*qplot(carat, price, data= diamonds)*

Now , we can make samples to see better visualization of diamonds dataset .

For creating sample of dataset , we use sample() function.

First , we take help from Help window to see description of sample().

*?sample()*

It shows this window :

Sample technique is randomized technique. It is used to take samples in random manner. So, to make sample reproducible we use set.seed() function.

*set.seed(2)*

The sample remains same every time we run this function.

We used sample(nrow(diamonds),1000) function . In this function ,

nrow(diamonds) return 53940. So , we are choosing 1000 observations from it.

We used diamonds[sample() , ] function . It is used to select all rows in diamonds dataset on the basis of sample() function. So, it will return 1000 observations of diamonds dataset. We create dsmall dataset where 1000 observations are stored in it.

*dsmall <- diamonds[sample(nrow(diamonds), 1000),]*

*dsmall*

We create a scatter plot between carat and price in dsmall dataset . We select different colour for each color of diamonds . We set size of points to represent.

*qplot(carat, price, data= dsmall, colour= color, size=4)*

We select the shape on the basis of cut values in diamonds.

We can increase and decrease size of points in graph by using I() function .

*qplot(carat, price, data= dsmall, colour= “red”, size= I(2))*

We used I() in colour to increase intensity of colour , when there are more number of observations .We add alpha parameter to see bulk of points lie.

*qplot(carat, price, data= dsmall, colour= I(“red”), size= I(2), alpha= (1/10))*

We plot a scatter plot between carat and price of dsmall dataset. We also add “smooth” in geom parameter to add smooth line in scatter plot , which shows the average values of points .

*qplot(carat, price, data = dsmall, geom = c(“point”, “smooth”))*

We also plot scatter plot between carat and price in diamonds dataset. We add “smooth” in geom parameter to add smooth line .

*qplot(carat, price, data = diamonds, geom = c(“point”, “smooth”))*

We find out the price per carat of diamonds across different colors of diamonds using boxplots.

qplot(color, price / carat, data = diamonds, geom = “boxplot”)

We create a jittered points to explore the distribution of price per carat varies with the colour of diamonds using geom parameter. The alpha parameter used to show more intensity of observations across observations.

*qplot(color, price / carat, data = diamonds, geom = “jitter”, alpha = I(1 / 5))*

As we decrease alpha value the dark mlack colour shows more observations at that point .

*qplot(color, price / carat, data = diamonds, geom = “jitter”,alpha = I(1 / 50))*

We create a histogram of carat in diamonds . We used fill() to fill bars by different color values from diamonds dataset.

*qplot(carat,data = diamonds, geom = “histogram”, fill= color)*

We create a density plot of various color values of diamonds .

*qplot(carat, data = diamonds, geom = “density”, colour = color)*

We used binwidth equals to 0.01 for represent bins width in very minute levels. We limit carat value to 3. It shows count of carat of each color of diamonds.

*qplot(carat, data = diamonds, facets = color~., geom = “histogram”, binwidth = 0.01, xlim = c(0, 3))*

We plot scatter plot of price per carat to carat values of dsmall dataset. It shows smoothing curve line also.

*qplot( carat, price/carat, data = dsmall,*

* ylab = expression(frac(price,carat)),*

* xlab = “Weight (carats)”,*

* main=”Small diamonds”,*

* xlim = c(.2,1)*

*) + geom_smooth()*