R statistics assignment
R Statistics Help
Question 13.
a.
read.csv()
metro <- read.csv("MetroMedian.csv", header = T)
reads the file from the folder and store to the dataframe “metro” with header=T so that it can retain variable names from the file.
b.
install.packages("reshape2")
install.packages("data.table")
The package “Reshape” is installed so that its function ”melt()” can be used to create molten data from the matrix read. The “data.table” package is loaded so it can transform the dataframe for easier manipulation. Load the data.table after the “reshape” package.
tidyMetro <- melt(metro,id.vars=c("RegionName","State","SizeRank"),variable.name="date",na.rm=TRUE)
use the melt function from the data.table library, to convert the dataframe from Wide to Long. The function has the “object” to be converted, the factors and the variable of interest as the inputs, the na.rm=TRUE, drops the empty cells.
c.
mean(tidyMetro$value[tidyMetro$State=="NY"])
use the r function mean(), That’s is select value from the tidyMetro dataframe, where state==”NY”
d.
regionMean <- function(valueFrame,searchRegion) {
mean(valueFrame$value[valueFrame$RegionName==searchRegion])}
The function above is stored in the variable region. The inputs of the variable are;- the object and the searchRegion. Inside the function, we use the r-function mean(), which selects values from the variable of interest where the state name is same as the searchRegion entered to the function.
Question 14
a.
beaches <- read.csv("BeachWaterQuality.csv", header = T)
because the data is stored in an excel .csv format, use the function read.csv(), to read the excel file using the columns names as the variable names.
b.
beaches$Results[is.na(beaches$Results)] <- 0
select the variable Results from the beaches dataframe and check if it is NA, assign 0 to the empty value.
head(beaches)
check the format of the dataframe.
c.
new.Date <- strptime(as.character(beaches$Sample.Date),"%m/%d/%Y")\
create a variable new.data which is in r-local format, using the r–function strptime(), as.character converts the date variable to be of character type so that its format is understood. The “m/%d/%Y”, tells are the date format from the file is month/date/ and Year written in four digits. The month and date does not contain the leading 0.beaches$new.date <- new.Date
Add the newly created variable;- new.Date to the beaches dataframe and assign the name new.Date.
d.
beachPlot <- function(beachData,beachName,sampleLocation){
beaches2 <- subset(beachData, Beach.Name==beachName & Sample.Location==sampleLocation)
plot(beaches2$new.date,beaches2$Results, ylab = "Bacterial Count", xlab = "Date",main=c('Bacterial count for', beachName,sampleLocation))
lines(beaches2$new.date[order(beaches2$new.date)], beaches2$Results[order(beaches2$new.date)],
xlim=range(beaches2$new.date), ylim=range(beaches2$Results),col="red")
}
Create a function and assign the name beachplot. The inputs to the function are;- the dataframe, beachName and sampleLocation. Use the function inputs to subset the dataframe and store the subset to the beache2 dataframe. The subset dataframe is selected from the input dataframe. The rows that have beachName and sample location are selected. Use the plot function to create a plot by entering the x-axis variable, y-axis variable, the y-axis label the x-axis label and main title label which is entered as a vector so that it can get the function input factors.
Add lines to the plot for Results against Date. The line uses the data range and the plot has a red color.
Question 15.
a.
mileage <- read.csv("Insight (3).csv", header = T)
head(mileage)
The data set in the directory is stored in excel .csv format with the name Insight. So use read.csv() function to read the data file and store the variables in a dataframe called mileage.
Check the structure of the dataframe using the head(), function.
b.plot(MPG~Avg.Temp, data = mileage, ylab="",xlab="Average Temperature",main="MPG against Avg.Temp and Car Said",col="blue")
plot function to create the plot. The tilde sign means y~x. And get the values for x and y axis from the mileage dataframe. Leave the y-axis empty because another line will be added after. Label the x-axis because both variables are being plotted against the same x-varibale. Add a title using main=”” and set the colour for this plot to blue.
c.
abline(lm(MPG~Avg.Temp,data = mileage),col="blue")
Add a trend line to the plot. The line to be added is the line of best fit from a linear model formed using the dependent and the response variable, Get the variables from the mileage dataframe. Set the color of the trendline to blue using col=”blue” command.
D ~
par(new = TRUE)
par() is an r-function used to combine plots. So setting new=T, allows a new plot to be embedded in an existing plot.
plot(Car.Said~Avg.Temp, data=mileage,col="red",ylab="MPG/Car Said",xlab="",axes=FALSE)
use the plot() function to add a new line to the existing plot. Set the color to red and add the y-axis label. Axes=FALSE suppresses the axis values.
e Adding a Red Line, that Fits the Linear Model
abline(lm(Car.Said~Avg.Temp,data = mileage),col="red")
create a linear model and Add a trendline for the 2nd plot. Set the colour to red.
par(new=FALSE)
legend("topleft",legend=c("Measured MPG","Car Reported MPG"),
text.col=c("blue","red"),pch=c(16,16),col=c("blue","red"))
Add a legend to the plot. Place the legend to the top left of the plot. The labels of the legend should be “Measured MPG” and “Reported MPG”, the colors of the text are red and blue, pch-sets the width of the line and color them with “blue” and “red respectively.