r:flamescourse

By Goele Bossaert (KULeuven), 29/09/2014-02/10/2014

- ask for help on mailing lists : http://www.r-project.org/mail.html (read the posting guide)

- R for Dummies (Joris Meys = gives advanced R course?)

- open a new r script
- left upper corner = script (= programme)
- lower left = console
- upper right corner = workspace & history
- lower right corner
- files = file manager
- plots
- packages
- help = same as http://cran.r-project.org/manuals.html (more or less), in the manuals, there are
- Introduction to R
- R Data Import/Export
- R installation and Administration
- the rest is more advanced

in the console, question mark + word

?plot

→ opens related documentation in the help screen

- R is case-sensitive
- end a command with ; or a new line
- comment lines start with #
- assign a value with ← or =

x<-5 x=5

R studio != R

- R studio = syntax coloured
- R = doesn't have other windows, so you have to ask for them to be shown

- R studio
- lower right corner, tab “packages”
- click on the link = get information

- R
*library()*: list of loaded packages*library(help=nlme)*: help on the nlme package

- packages have to be installed only once, but loaded every session
- install a new package (e.g. rsiena)
- R studio
- in the “packages” tab, click on
*install packages*and type*RSiena*

- R
*install.packages(“RSiena”)*

- load a package
- R studio
- click on the checkbox

- R
*library(“RSiena”)*

- list of loaded packages
*search()*

- list of available packages

library()

- load package
*robustreg*- actually, first need to install it

install.package("robustreg") library("robustreg")

- check the content of this package

library(help=robustreg)

- check which packages are loaded in your system

search()

getwd() setwd()

# remove vector test1 rm(list="test1") #remove all vectors rm(list=ls())

cf. script_1.R

Functions:

- order() = list the place of the smallest number, then the second smallest, etc.
- rev() = reverse
- e.g.
*rev(order(x))*: list from the biggest to the smallest places

- sort() : output the numbers from smallest to biggest (N.B. : doesn't change your vector)
- unique(): prune the vector of equal values (if y[2,3,2,4], unique will give [2,3,4])

Numeric functions

- abs() : absolute value
- trunc() : cut the decimals
- round() : round to most proximate integer
*round(z, digits=2)*outputs 2 decimals

y <- 1:8 #y=(1,2,3,4,5,6,7,8) z <- 20:1 seq(from=-2, to=3, by=0.5) u <- rep(2:5,2) #u=(2,3,4,5,2,3,4,5) v <- rep(2:5,2:5) #v=[2 2 3 3 3 4 4 4 4 5 5 5 5 5]

- Create the vector test1 with numbers 1.5, 0.7, 45.6
- Create the following vectors:
- create a vector x with elements (1,2,3,..,100)
- create a vector y with elements (0,5,10,15,….,500)
- create a vector z1 with elements (1,1,1,2,2,2,….,50,50,50)
- create a vector z2 with elements (1,2,2,3,4,4,…,9,10,10)

test1<-c(1.5, 0.7, 45.6) test1 x<-c(1:100) x y<-c(seq(from=0, to=500, by=5)) y z<-rep(1:50,3) z z1<-sort(z1) z1 z1<-sort(rep(1:50,3)) z1 z1<-sort(rep(1:50,each=3)) z1 z2.1 <- rep(1:2,5) z2.1 z2<-rep(1:10,z2.1) z2 z2 <- rep(1:10, rep(c(1,2),5)) z2

x <- c(9,10,8,5,9) y <- x<9 #y = FALSE FALSE TRUE TRUE FALSE

N.B.: TRUE = 1 and FALSE = 0

- Logical expressions : <, ⇐, >, >=, ==, !=
- logical operators : &, |, !, %in%
- e.g. x %in% c(3, 6, 2, 10)

How many values >= 10 in a given vector ?

x <- 1:20 y <-x>10 y sum(y)

nchar(x) = number of characters in x

- less information → most information: logical < numeric < character
- str(x) will give you the type of vector
- missing values
- NA for logical & numeric data
- “” for character

x[subset]

y <- c(33, 55, 4, 22, 89) y #2nd and 3rd values y[c(2,3)] #everything exept the 5th value y[-5] y[y > 30] === Exercise 3 === <code> y1 <- c(seq(from=-5, to=10)) y1 y1 <- -5:10 y1 y2 <- y1>-2 & y1<7 y2 sum(y2) progr <- c("SAS progr", "R progr", "SPSS progr") progr stat <- substr(progr, c(1,1,1), c(3,1,4)) stat stat <- substr(progr, 1, c(3,1,4)) stat first <- c(1,1,1) last <- nchar(progr)-6 stat <- substr(prog, first, last) stat

= sum of vectors of same type , same length

bind vectors by binding rows (rbind()) or columns (cbind()) together

x <- rbind(c(1:4), c(5:8)) x

x = 2×4 integer matrix 2 = rows, 4 = columns

name rows and colums (by default [1,] and [,1];

- dimnames(x)
- rownames(x)
- colnames (x)

#exercise 3 v <- rnorm(100) v mat1 <- matrix(data=v, nrow=10, ncol=10, byrow=TRUE) mat1 mat1 <- rbind(c(1:10), mat1) mat1 mat1 <- rbind(mat1, 10:1) mat1 mat1 <- cbind(1:12, mat1) mat1

Matrix extended : can contain more than numbers in any “cell”.

- merge() to merge data frames

authors <- data.frame( surname = c("Tukey", "Venables", "Tierney", "Ripley", "McNeil"), nationality = c("US", "Australia", "US", "UK", "Australia"), deceased = c("yes", rep("no", 4))) books <- data.frame( name = c("Tukey", "Venables", "Tierney", "Ripley", "Ripley", "McNeil", "R Core"), title = c("Exploratory Data Analysis", "Modern Applied Statistics ...", "LISP-STAT", "Spatial Statistics", "Stochastic Simulation", "Interactive Data Analysis", "An Introduction to R"), other.author = c(NA, "Ripley", NA, NA, NA, NA, "Venables & Smith")) authors books m1 <- merge(authors, books, by.x = "surname", by.y = "name") m1 m2 <- merge(books, authors, by.x = "name", by.y = "surname") m2

- order has to be correct : x related to authors in the first merge() and w to books
- by default, takes only what is in common (if only in one data frame, not taken) = natural join (all = FALSE)
- change it to all=TRUE to get a full outer join
- change it to all.x=TRUE = take all from x data frame (and intersection between x and y, but not what is in y and not in x) = left outer join
- same with all.y=TRUE = right outer join

sub1 <- authors[1:3, 1:2] sub1 sub2 <- authors$nationality sub2

With the dollar sign ($) = list (other construct, not seen yet)

Can also use *subset()*:

?airquality #need the datasets packages loaded sub1 <- subset(airquality, Temp > 80, select = c(Ozone, Temp)) sub1 sub2 <- subset(airquality, Day == 1, select = -Temp) sub2 sub3 <- subset(airquality, select = Ozone:Wind) sub3

- need the doBy library to be loaded (and installed)

air10 <- airquality[1:10,] # nothing after the comma = take all variables air10 sort_air10 <- orderBy(~Temp, data=air10) sort_air10

- Data Loblolly(library(datasets))

- Check the data description
- Ask about the attributes of the data
- Take a subset of data Loblolly which contains the observations from 5 until 12 and only the variables height and age
- Take a subset of Loblolly data which has only seed levels 307 and 311
- Sort the data Loblolly in a way that:
- age will be in descending order (from the largest to the smallest);
- age will be sorted in descending order (from the largest to the smallest), height will be sorted in ascending order (from the smallest to the largest) for every level of age

?Loblolly str(Loblolly) subl1 <- Loblolly[5:12, 1:2] subl1 subl2 <- subset(Loblolly, Seed == 307 | Seed == 311,) subl2 Loblolly_age_desc <- orderBy(~-age, data=Loblolly) Loblolly_age_desc Loblolly_age_desc2 <- orderBy(~-age+height, data=Loblolly) Loblolly_age_desc2

= sum of vectors (differents types and different length)

#list x <-1:2 y <- letters[1:2] z <-1:3 list(x,y,z) #name list mixlist <- list(logica = c(T,T,T,F), plant = c("tree", "bush","grass"), comment="these components are unrelated")

- clickodrome : workspace > import dataset > from text file…
- with code

chol <- read.table(file=file.choose(), header=TRUE)

If no window opens, check for it in the background

- import .csv and .xls/.xlsx

package *foreign* to import from SPSS or SAS

- SPSS:
*read.spss* - SAS:
*read.xport*or*read.ssd*

builtins() #or ls(baseenv(), all = TRUE)

Name <- function(arg1, arg2...) {expression}

We will use a t-test as example (even though t-test function already exists).

t.test.p <- function(x, mu = 0) { n <- length(x) t.stat <- (mean(x) - mu)/(sd(x)/n^0.5) 2 * (1 - pt(abs(t.stat), n - 1)) } y <- 5:27 y t.test.p(y)

- mu = 0, by default mu equals zero (if not specified)
- by default, the function outputs the last thing you calculated in your function
- to circumvent this, create a list with the different values as latest command in your function

- tip: first test the commands of your function, then put it in a function
- all variables created by the function are publicly available afterwards

Write a function which returns the most elementary statistics for a sample x :

min, median, max, mean, sd and length. Apply your function on a vector x with values from 25 to 80.

v = c(25:80) v elem_stats <- function(v) { min1 = min(v) median1 = median(v) max1 = max(v) mean1 = mean(v) sd1 = sd(v) length1 = length(v) list (min = min1, median = median1, max = max1, mean = mean1, sd = sd1, length = length1) } elem_stats(v)

- high-level graphic functions (e.g. plot(), curve()) = drawing a graphic
- low-level graphic functions to draw extra things on the graphic (title, legend, etc.)
- graphs are output on the “Plots” tab in the lower right corner window, you can navigate between graphs with the two arrows (← and →) under the tabls
- par() to set or look at parameter from the graph
- mfrow() is a par() parameter to split the screen into several sections to have several graphics on one screen
- par(mfrow(c(2,2))) va créer 4 graphiques (2 x 2)

- colours: palette() for a list ; colors() for more choice (list them), but use the name corresponding: colors(“sienna4”)
- add an extra point :
*points(1000, 260)*- to add text:
*text(locator(1), “new point”, cex=1.5)*- run this line separately, because you have to point where you want to put the text (because locator() was used)

- add a line with abline() for a straight line or lines() for a curve
- add a legend with legend()

plot(sin, -2*pi, 2*pi, col=1, lty=1) plot(cos, -2*pi, 2*pi, col=2, lty=2, add=TRUE) plot(tan, -2*pi, 2*pi, col=3, lty=3, add=TRUE) legend(locator(1), c('Sine', 'Cosine', 'Tangent'), lty=c(1, 2, 3), col=c(1, 2, 3))

01/10/2014

- different distributions, cumulative probabilities
- median absolute deviation : explanation

02/10/2014

write.csv(galileo,file='galileo.csv',row.names=FALSE) write.csv(central.park.cloud,file='central_park_cloud',row.names=FALSE) write.csv(cfb,file='cfb',row.names=FALSE) write.csv(alltime.movies,file='alltime.movies',row.names=FALSE) write.csv(normtemp,file='normtemp',row.names=FALSE)

chisq() smokers (pipe, sigaret), dead or alive

chol <- read.table(file=file.choose(), header=TRUE) names(chol) tab.chol <- table(chol$SMOKE, chol$MORT) tab.chol chisq.test(chol$SMOKE, chol$MORT)

output:

Pearson's Chi-squared test data: chol$SMOKE and chol$MORT X-squared = 1.6677, df = 2, p-value = 0.4344

p-value > 0.05 → cannot reject hypothesis (that smokers and non-smokers are different concerning death).

- t-test is a parametric test → need to check that data are normally distributed
- use shapiro.test() to check normality
- if normal, use t.test()
- if not, use wilcox.test()

- check that the two samples are normal (with shapiro.test())
- check whether variances are equal or not (to know what to test)
- F test to compare two variances:
*var.test()*

- cor()

- use lm() for linear model
- this gives you the a and b from y=ax+b so you can draw it with abline() in your graph
- different models (e.g. polynomial model)

r/flamescourse.txt · Last modified: 2014/10/03 11:47 by carl