Start ® Programs ® Teaching ® Maths
Click on Maths. You will then find various icons including one for R. Double click on this icon. You should then find that the screen displays a window headed
R Console
After the red > prompt, type
data(faithful)
to load the faithful data frame which has 272 rows and 2 columns; the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. Typing
names(faithful)
will give you the names of the variables contained in this data frame, while typing
?faithful
will give a more detailed description of the data frame, while typing
faithful
will display all the data (but you will need to scroll back with the right hand scroll bar to see it all). You can refer to the first column alone as faithful[,1] or as faithful[,"eruptions"] and similarly for the second column, but it makes life a bit easier to type
attach(faithful)
after which you can refer to the first column simply as eruptions and to the second simply as waiting. You can refer to the third element of waiting simply as waiting[3], and this is in fact the same as faithful[3,2] since waiting constitutes the second column of faithful.
Try finding some simple descriptive statistics. For example, mean(waiting) will give the mean waiting time. If you simply type
mean(waiting)
the mean waiting time will be displayed, whereas if you type
m <- mean(waiting)
then the variable m will be given a value equal to this mean. You can then type m to see its value, but you can also use it in further calculations, which may make formulae look less cumbersome than if you use mean(waiting) every time. (You may care to note that while you can use _ instead of <- for assignemnt, in the words of Venables and Ripley, "We regard the use of _ for assigment as unreadable". They also point out that, "Assignments using the right-hand pointing combination -> are also allowed to make assigments in the opposite direction, but these are never needed and are little used in practice.")
Some other descriptive statistics, the meanings of which are pretty obvious, are given by median, var and sd. You may also care to see what results from quantile. Naturally, you can refer to the second of the numbers which are printed out when you type quantile(waiting) as quantile(waiting)[2]. You may also care to try summary or fivenum. The number of observations can be found from length.
Sometimes you want not just the mean of one variable, but the means of several. In this connection it is worth looking at the result of
apply(faithful,2,mean)
(or, e.g., apply(faithful,2,sum) or apply(faithful,1,mean).
You should then try some simple graphical techniques. Investigate the results of boxplot(waiting) or stem(waitng) (you can vary the display of the latter by, e.g., going stem(waiting,scale=2).
Sometimes we want to investigate a subset of the data. We can display the values of waiting for which eruptions is less than 3 by waiting[eruptions < 3], and we can similarly refer to eruptions[eruptions < 3]. If you need two conditions they can be joined by | for 'or' as in
waiting[(eruptions<3)|(eruptions>5)]
or by & for 'and' (and for that matter you can use ! for 'not').
Try to examine the data in the way that they do. Begin by looking at histograms and box-and-whisker plots of intereruption times and scatter plots of intereruption times (waiting times) against eruption duration times. Look for a suitable definition of a 'short' duration, in the sense that it appears that the scatterplot falls into two distinct parts depending on whether the eruption concerned is short or long. Then try a parallel box-and-whisker plot showing such plots of intereruption times for short and long eruption times side by side.
Try and guess a simple prediction rule of the form that short eruptions will be followed by intereruption intervals of length x while long ione will be followed by intervals of length y for suitable values of x and y. You can then define a variable representing your prediction by using a construction of the form
predwaiting <- ifelse(eruptions < d, t <- x, t <- y)
If d is the value you choose to distinguish short from long eruption times, then such a variable takes the value x if the condition eruptions < d is true and otherwise ('else') takes the value y.
You can then find the error with your prediction rule by
error <- predwaiting - waiting
You could then try a histogram or a boxplot of the errors or a Q-Q plot to see how closely they appear to be normally distributed.
Try and see how well your rule is obeyed by the two data sets provided by Chatterjee et al. which can be found at
and
Try plotting the errors using your rule with the original data frame faithful and with the two data sets provided by Chatterjee et al.
This is just a suggestion. Other ways of exploring the data may occur to you - at this stage the most important thing is to get used to using R to carry out simple exploration of data. You may like to know that there are quite a lot of data sets supplied with R; for a list go
data()