Title: | Foundations and Applications of Statistics Using R (2nd Edition) |
---|---|
Description: | Data sets and utilities to accompany the second edition of "Foundations and Applications of Statistics: an Introduction using R" (R Pruim, published by AMS, 2017), a text covering topics from probability and mathematical statistics at an advanced undergraduate level. R is integrated throughout, and access to all the R code in the book is provided via the snippet() function. |
Authors: | Randall Pruim [aut, cre] |
Maintainer: | Randall Pruim <[email protected]> |
License: | GPL (>=2) |
Version: | 1.2.4 |
Built: | 2025-03-05 05:35:18 UTC |
Source: | https://github.com/rpruim/fastr2 |
ACT scores and college GPA for a small sample of college students.
A data frame with 26 observations on the following 2 variables.
ACT score
GPA
gf_point(GPA ~ ACT, data = ACTgpa)
gf_point(GPA ~ ACT, data = ACTgpa)
Flights categorized by destination city, airline, and whether or not the flight was on time.
A data frame with 11000 observations on the following 3 variables.
a factor with levels LosAngeles
,
Phoenix
, SanDiego
, SanFrancisco
, Seattle
a factor with levels Delayed
, OnTime
a factor with levels Alaska
, AmericaWest
Barnett, Arnold. 1994. “How numbers can trick you.” Technology Review, vol. 97, no. 7, pp. 38–45.
These and similar data appear in many text books under the topic of Simpson's paradox.
tally( airline ~ result, data = AirlineArrival, format = "perc", margins = TRUE) tally( result ~ airline + airport, data = AirlineArrival, format = "perc", margins = TRUE) AirlineArrival2 <- AirlineArrival %>% group_by(airport, airline, result) %>% summarise(count = n()) %>% group_by(airport, airline) %>% mutate(total = sum(count), percent = count/total * 100) %>% filter(result == "Delayed") AirlineArrival3 <- AirlineArrival %>% group_by(airline, result) %>% summarise(count = n()) %>% group_by(airline) %>% mutate(total = sum(count), percent = count/total * 100) %>% filter(result == "Delayed") gf_line(percent ~ airport, color = ~ airline, group = ~ airline, data = AirlineArrival2) %>% gf_point(percent ~ airport, color = ~ airline, size = ~total, data = AirlineArrival2) %>% gf_hline(yintercept = ~ percent, color = ~airline, data = AirlineArrival3, linetype = "dashed") %>% gf_labs(y = "percent delayed")
tally( airline ~ result, data = AirlineArrival, format = "perc", margins = TRUE) tally( result ~ airline + airport, data = AirlineArrival, format = "perc", margins = TRUE) AirlineArrival2 <- AirlineArrival %>% group_by(airport, airline, result) %>% summarise(count = n()) %>% group_by(airport, airline) %>% mutate(total = sum(count), percent = count/total * 100) %>% filter(result == "Delayed") AirlineArrival3 <- AirlineArrival %>% group_by(airline, result) %>% summarise(count = n()) %>% group_by(airline) %>% mutate(total = sum(count), percent = count/total * 100) %>% filter(result == "Delayed") gf_line(percent ~ airport, color = ~ airline, group = ~ airline, data = AirlineArrival2) %>% gf_point(percent ~ airport, color = ~ airline, size = ~total, data = AirlineArrival2) %>% gf_hline(yintercept = ~ percent, color = ~airline, data = AirlineArrival3, linetype = "dashed") %>% gf_labs(y = "percent delayed")
Air pollution measurements at three locations.
A data frame with 6 observations on the following 2 variables.
a numeric vector
a factor with levels Hill Suburb
,
Plains Suburb
, Urban City
David J. Saville and Graham R. Wood, Statistical methods: A geometric primer, Springer, 1996.
data(AirPollution) summary(lm(pollution ~ location, data = AirPollution))
data(AirPollution) summary(lm(pollution ~ location, data = AirPollution))
Undergraduate students in a physics lab recorded the height from which a ball was dropped and the time it took to reach the floor.
A data frame with 30 observations on the following 2 variables.
height in meters
time in seconds
Steve Plath, Calvin College Physics Department
gf_point(time ~ height, data = BallDrop)
gf_point(time ~ height, data = BallDrop)
Major League batting data for the seasons from 2000-2005.
A data frame with 8062 observations on the following 22 variables.
unique identifier for each player
year
for players who were traded mid-season, indicates which portion of the season the data cover
three-letter code for team
a
factor with levels AA
AL
NL
games
at bats
runs
hits
doubles
triples
home runs
runs batted in
stolen bases
caught stealing
bases on balls (walks)
strike outs
intentional base on balls
hit by pitch
a numeric vector
sacrifice fly
grounded into double play
data(Batting) gf_histogram( ~ HR, data = Batting)
data(Batting) gf_histogram( ~ HR, data = Batting)
Data from an experiment to determine the efficacy of various methods of eradicating buckthorn, an invasive woody shrub. Buckthorn plants were chopped down and the stumps treated with various concentrations of glyphosate. The next season, researchers returned to see whether the plant had regrown.
A data frame with 165 observations on the following 3 variables.
number of new shoots coming from stump
concentration of glyphosate applied
weather the stump was considered dead
David Dornbos, Calvin College
data(Buckthorn)
data(Buckthorn)
This data frame contains data from an experiment to see if insects are more attracted to some colors than to others. The researchers prepared colored cards with a sticky substance so that insects that landed on them could not escape. The cards were placed in a field of oats in July. Later the researchers returned, collected the cards, and counted the number of cereal leaf beetles trapped on each card.
A data frame with 24 observations on the following 2 variables.
color of card; one of B
(lue)
G
(reen) W
(hite) Y
(ellow)
number of insects trapped on the card
M. C. Wilson and R. E. Shade, Relative attractiveness of various luminescent colors to the cereal leaf beetle and the meadow spittlebug, Journal of Economic Entomology 60 (1967), 578–580.
data(Bugs) favstats(trapped ~ color, data = Bugs)
data(Bugs) favstats(trapped ~ color, data = Bugs)
A theme for use with lattice graphics.
col.fastR(bw = FALSE, lty = 1:7)
col.fastR(bw = FALSE, lty = 1:7)
bw |
whether color scheme should be "black and white" |
lty |
vector of line type codes |
Returns a list that can be supplied as the theme
to
trellis.par.set()
.
This theme was used in the production of the book Foundations and Applications of Statistics
Randall Pruim
trellis.par.set
, show.settings
trellis.par.set(theme=col.fastR(bw=TRUE)) show.settings() trellis.par.set(theme=col.fastR()) show.settings()
trellis.par.set(theme=col.fastR(bw=TRUE)) show.settings() trellis.par.set(theme=col.fastR()) show.settings()
Convenience wrappers around apply()
to compute row and column
percentages of matrix-like structures, including output of
xtabs
.
col.perc(x) row.perc(x)
col.perc(x) row.perc(x)
x |
matrix-like structure |
Randall Pruim
row.perc(tally(~ airline + result, data = AirlineArrival)) col.perc(tally(~ airline + result, data = AirlineArrival))
row.perc(tally(~ airline + result, data = AirlineArrival)) col.perc(tally(~ airline + result, data = AirlineArrival))
These data were collected by I-Cheng Yeh to determine how the compressive strength of concrete is related to its ingredients (cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate) and age.
Concrete
is a data frame with the following
variables.
percentage of limestone
water-cement ratio
compressive strength (MPa) after 28 days
Appeared in Devore's "Probability and Statsistics for Engineers and the Sciences (6th ed). The variables have been renamed.
These data were collected by I-Cheng Yeh to determine how the compressive strength of concrete is related to its ingredients (cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate) and age.
concreteAll
is a data frame with the following 9 variables.
amount of cement (kg/m^3)
amount of blast furnace slag (kg/m^3)
amount of fly ash(kg/m^3)
amount of water (kg/m^3)
amount of superplasticizer (kg/m^3)
amount of coarse aggregate (kg/m^3)
amount of fine aggregate (kg/m^3)
age of concrete in days
compressive strength measured in MPa
Concrete
is a subset of ConcreteAll
.
Data were obtained from the Machine Learning Repository (https://archive.ics.uci.edu/ml/) where they were deposited by I-Cheng Yeh ([email protected]) who retains the copyright for these data.
I-Cheng Yeh (1998), "Modeling of strength of high performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808.
data(Concrete)
data(Concrete)
Temperature of a mug of water as it cools.
data(CoolingWater1) data(CoolingWater2) data(CoolingWater3) data(CoolingWater4)
data(CoolingWater1) data(CoolingWater2) data(CoolingWater3) data(CoolingWater4)
A data frame with the following variables.
time
time in seconds
temp
temperature in Celsius (CoolingWater1
, CoolingWater2
)
or Fahrenheit (CoolingWater3
, CoolingWater4
)
These data were collected by Stan Wagon and his students at Macelester
College to explore Newton's Law of Cooling and the ways that the law
fails to capture all of the physics involved in cooling water.
CoolingWater1
and CoolingWater2
appeared in a plot in Wagon (2013)
and were (approximatley) extracted from the plot.
CoolingWater3
and CoolingWater4
appeared in a plot in Wagon (2005).
The data in
CoolingWater2
and CoolingWater4
were collected with a film of oil on
the surface of the water to minimize evaporation.
R. Portmann and S. Wagon. "How quickly does hot water cool?" Mathematica in Education and Research, 10(3):1-9, July 2005.
R. Israel, P. Saltzman, and S. Wagon. "Cooling coffee without solving differential equations". Mathematics Magazine, 86(3):204-210, 2013.
data(CoolingWater1) data(CoolingWater2) data(CoolingWater3) data(CoolingWater4) if (require(ggformula)) { gf_line( temp ~ time, color = ~ condition, data = rbind(CoolingWater1, CoolingWater2)) } if (require(ggformula)) { gf_line( temp ~ time, color = ~ condition, data = rbind(CoolingWater3, CoolingWater4)) }
data(CoolingWater1) data(CoolingWater2) data(CoolingWater3) data(CoolingWater4) if (require(ggformula)) { gf_line( temp ~ time, color = ~ condition, data = rbind(CoolingWater1, CoolingWater2)) } if (require(ggformula)) { gf_line( temp ~ time, color = ~ condition, data = rbind(CoolingWater3, CoolingWater4)) }
William Gosset analyzed data from an experiment comparing the yield of regular and kiln-dried corn.
A data frame with 11 observations on the following 2 variables.
yield of regular corn (lbs/acre)
yield of kiln-dried corn (lbs/acre)
Gosset (Student) reported on the results of seeding plots with two different kinds of seed. Each type of seed (regular and kiln-dried) was planted in adjacent plots, accounting for 11 pairs of "split" plots.
These data are also available at DASL, the data and story library (https://dasl.datadescription.com/).
W.S. Gosset, "The Probable Error of a Mean," Biometrika, 6 (1908), pp 1-25.
Corn2 <- stack(Corn) names(Corn2) <- c('yield','treatment') lm(yield ~ treatment, data = Corn2) t.test(yield ~ treatment, data = Corn2) t.test(Corn$reg, Corn$kiln)
Corn2 <- stack(Corn) names(Corn2) <- c('yield','treatment') lm(yield ~ treatment, data = Corn2) t.test(yield ~ treatment, data = Corn2) t.test(Corn$reg, Corn$kiln)
Cuckoos are knows to lay their eggs in the nests of other (host) birds. The eggs are then adopted and hatched by the host birds. These data were originally collected by O. M. Latter in 1902 to see how the size of a cuckoo egg is related to the species of the host bird.
A data frame with 120 observations on the following 2 variables.
length of egg (mm)
a factor with levels hedge sparrow
meadow pipet
pied wagtail
robin
tree pipet
wren
L.H.C. Tippett, The Methods of Statistics, 4th Edition, John Wiley and Sons, Inc., 1952, p. 176.
These data are also available from DASL, the data and story library (https://dasl.datadescription.com/).
data(Cuckoo) gf_boxplot(length ~ species, data = Cuckoo)
data(Cuckoo) gf_boxplot(length ~ species, data = Cuckoo)
A famous example of Simpson's paradox.
A data frame with 326 observations.
a subject id
a factor with levels Bl
Wh
)
a factor with levels Bl
, Wh
a factor with levels Yes
, No
a factor with levels death
other
Radelet, M. (1981). Racial characteristics and imposition of the death penalty. American Sociological Review, 46:918–927.
tally(penalty ~ defendant, data = DeathPenalty) tally(penalty ~ defendant + victim, data = DeathPenalty)
tally(penalty ~ defendant, data = DeathPenalty) tally(penalty ~ defendant + victim, data = DeathPenalty)
The data come from an experiment to determine how terminal velocity depends on the mass of the falling object. A helium balloon was rigged with a small basket and just the ballast to make it neutrally buoyant. Mass was then added and the terminal velocity is calculated by measuring the time it took to fall between two sensors once terminal velocity has been reached. Larger masses were drop from higher heights and used sensors more widely spaced.
A data frame with 42 observations on the following 5 variables.
time (in seconds) to travel between two sensors
net mass (in kg) of falling object
distance (in meters) between two sensors
average velocity (in m/s) computed from time
and height
calculated drag force (in N,
force.drag = mass * 9.8
) using the fact that at terminal velocity,
the drag force is equal to the force of gravity
Calvin College physics students under the supervision of Professor Steve Plath.
data(Drag) with(Drag, force.drag / mass) gf_point(velocity ~ mass, data = Drag)
data(Drag) with(Drag, force.drag / mass) gf_point(velocity ~ mass, data = Drag)
The effect of a single 600 mg dose of ascorbic acid versus a sugar placebo on the muscular endurance (as measured by repetitive grip strength trials) of fifteen male volunteers (19-23 years old).
A data frame with 15 observations on the following 5 variables.
number of repetitions until reaching 50 maximal grip after taking viatimin
which treatment was
done first, a factor with levels Placebo
Vitamin
number of repetitions until reaching 50 strength after taking placebo
Three initial maximal contractions were performed for each subject, with the greatest value indicating maximal grip strength. Muscular endurance was measured by having the subjects squeeze the dynamometer, hold the contraction for three seconds, and repeat continuously until a value of 50 maximum grip strength was achieved for three consecutive contractions. Endurance was defined as the number of repetitions required to go from maximum grip strength to the initial 50 positive verbal encouragement in an effort to have them complete as many repetitions as possible.
The study was conducted in a double-blind manner with crossover.
These data are available from OzDASL, the Australasian data and story library (https://dasl.datadescription.com/).
Keith, R. E., and Merrill, E. (1983). The effects of vitamin C on maximum grip strength and muscular endurance. Journal of Sports Medicine and Physical Fitness, 23, 253-256.
data(Endurance) t.test(Endurance$vitamin, Endurance$placebo, paired = TRUE) t.test(log(Endurance$vitamin), log(Endurance$placebo), paired = TRUE) t.test(1/Endurance$vitamin, 1/Endurance$placebo, paired = TRUE) gf_qq( ~ vitamin - placebo, data = Endurance) gf_qq( ~ log(vitamin) - log(placebo), data = Endurance) gf_qq( ~ 1/vitamin - 1/placebo, data = Endurance)
data(Endurance) t.test(Endurance$vitamin, Endurance$placebo, paired = TRUE) t.test(log(Endurance$vitamin), log(Endurance$placebo), paired = TRUE) t.test(1/Endurance$vitamin, 1/Endurance$placebo, paired = TRUE) gf_qq( ~ vitamin - placebo, data = Endurance) gf_qq( ~ log(vitamin) - log(placebo), data = Endurance) gf_qq( ~ 1/vitamin - 1/placebo, data = Endurance)
A cross-tabulation of whether a student smokes and how many of his or her parents smoke from a study conducted in the 1960's.
A data frame with 5375 observations on the following 2 variables.
a factor with levels DoesNotSmoke
Smokes
a factor with levels
NeitherSmokes
OneSmokes
BothSmoke
S. V. Zagona (ed.), Studies and issues in smoking behavior, University of Arizona Press, 1967.
The data also appear in
Brigitte Baldi and David S. Moore, The Practice of Statistics in the Life Sciences, Freeman, 2009.
data(FamilySmoking) xchisq.test( tally(parents ~ student, data = FamilySmoking) )
data(FamilySmoking) xchisq.test( tally(parents ~ student, data = FamilySmoking) )
This data frame gives the number of fumbles by each NCAA FBS team for the first three weeks in November, 2010.
A data frame with 120 observations on the following 7 variables.
NCAA football team
rank based on fumbles per game through games on November 26, 2010
number of wins through games on November 26, 2010
number of losses through games on November 26, 2010
number of fumbles on November 6, 2010
number of fumbles on November 13, 2010
number of fumbles on November 20, 2010
The fumble counts listed here are total fumbles, not fumbles lost. Some of these fumbles were recovered by the team that fumbled.
https://www.teamrankings.com/college-football/stat/fumbles-per-game
data(Fumbles) m <- max(Fumbles$week1) table(factor(Fumbles$week1,levels = 0:m)) favstats( ~ week1, data = Fumbles) # compare with Poisson distribution cbind( fumbles = 0:m, observedCount = table(factor(Fumbles$week1,levels = 0:m)), modelCount= 120* dpois(0:m,mean(Fumbles$week1)), observedPct = table(factor(Fumbles$week1,levels = 0:m))/120, modelPct= dpois(0:m,mean(Fumbles$week1)) ) %>% signif(3) showFumbles <- function(x, lambda = mean(x),...) { result <- gf_dhistogram( ~ week1, data = Fumbles, binwidth = 1, alpha = 0.3) %>% gf_dist("pois", lambda = mean( ~ week1, data = Fumbles) ) print(result) return(result) } showFumbles(Fumbles$week1) showFumbles(Fumbles$week2) showFumbles(Fumbles$week3)
data(Fumbles) m <- max(Fumbles$week1) table(factor(Fumbles$week1,levels = 0:m)) favstats( ~ week1, data = Fumbles) # compare with Poisson distribution cbind( fumbles = 0:m, observedCount = table(factor(Fumbles$week1,levels = 0:m)), modelCount= 120* dpois(0:m,mean(Fumbles$week1)), observedPct = table(factor(Fumbles$week1,levels = 0:m))/120, modelPct= dpois(0:m,mean(Fumbles$week1)) ) %>% signif(3) showFumbles <- function(x, lambda = mean(x),...) { result <- gf_dhistogram( ~ week1, data = Fumbles, binwidth = 1, alpha = 0.3) %>% gf_dist("pois", lambda = mean( ~ week1, data = Fumbles) ) print(result) return(result) } showFumbles(Fumbles$week1) showFumbles(Fumbles$week2) showFumbles(Fumbles$week3)
geolm
create a graphical representation of the fit of a linear model.
geolm(formula, data = parent.env(), type = "xz", version = 1, plot = TRUE, ...) to2d(x, y, z, type = NULL, xas = c(0.4, -0.3), yas = c(1, 0), zas = c(0, 1))
geolm(formula, data = parent.env(), type = "xz", version = 1, plot = TRUE, ...) to2d(x, y, z, type = NULL, xas = c(0.4, -0.3), yas = c(1, 0), zas = c(0, 1))
formula |
a formula as used in |
data |
a data frame as in |
type |
character: indicating the type of projection to use to collapse multi-dimensional data space into two dimensions of the display. |
version |
an integer (currently |
plot |
a logical: should the plot be displayed? |
... |
other arguments passed to |
x , y , z
|
numeric. |
xas , yas , zas
|
numeric vector of length 2 indicating the projection of
|
Randall Pruim
lm
.
geolm(pollution ~ location, data = AirPollution) geolm(distance ~ projectileWt, data = Trebuchet2)
geolm(pollution ~ location, data = AirPollution) geolm(distance ~ projectileWt, data = Trebuchet2)
The order of the resulting factor is determined by the order in which unique labels first
appear in the vector or factor x
.
givenOrder(x)
givenOrder(x)
x |
a vector or factor to be converted into an ordered factor. |
givenOrder(c("First", "Second", "Third", "Fourth", "Fifth", "Sixth"))
givenOrder(c("First", "Second", "Third", "Fourth", "Fifth", "Sixth"))
Allan Rossman used to live on a golf course in a spot where dozens of balls would come into his yard every week. He collected the balls and eventually tallied up the numbers on the first 5000 golf balls he collected. Of these 486 bore the number 1, 2, 3, or 4. The remaining 14 golf balls were omitted from the data.
The format is: num [1:4] 137 138 107 104
Data collected by Allan Rossman in Carlisle, PA.
data(golfballs) golfballs/sum(golfballs) chisq.test(golfballs, p = rep(.25,4))
data(golfballs) golfballs/sum(golfballs) chisq.test(golfballs, p = rep(.25,4))
In a 1979 study by Bishop and Heberlein, 237 hunters were each offered one of 11 cash amounts (bids) ranging from $1 to $200 in return for their hunting permits. The data records how many hunters offered each bid kept or sold their permit.
A data frame with 11 rows and 5 columns.
Each row corresponds to a bid (in US dollars)
offered for a goose permit. The colums keep
and sell
indicate
how many hunters offered that bid kept or sold their permit, respectively.
n
is the sum of keep
and sell
and prop_sell
is the proportion that sold.
Bishop and Heberlein (Amer. J. Agr. Econ. 61, 1979).
goose.mod <- glm( cbind(sell, keep) ~ log(bid), data = GoosePermits, family = binomial()) gf_point(0 ~ bid, size = ~keep, color = "gray50", data = GoosePermits) %>% gf_point(1 ~ bid, size = ~ sell, color = "navy") %>% gf_function(fun = makeFun(goose.mod)) %>% gf_refine(guides(size = "none")) ggplot(data = GoosePermits) + geom_point( aes(x = bid, y = 0, size = keep), colour = "gray50") + geom_point( aes(x = bid, y = 1, size = sell), colour = "navy") + stat_function(fun = makeFun(goose.mod)) + guides( size = "none") gf_point( (sell / (sell + keep)) ~ bid, data = GoosePermits, size = ~ sell + keep, color = "navy") %>% gf_function(fun = makeFun(goose.mod)) %>% gf_text(label = ~ as.character(sell + keep), colour = "white", size = 3) %>% gf_refine(scale_size_area()) %>% gf_labs(y = "probabity of selling") ggplot(data = GoosePermits) + stat_function(fun = makeFun(goose.mod)) + geom_point( aes(x = bid, y = sell / (sell + keep), size = sell + keep), colour = "navy") + geom_text( aes(x = bid, y = sell / (sell + keep), label = as.character(sell + keep)), colour = "white", size = 3) + scale_size_area() + labs(y = "probabity of selling")
goose.mod <- glm( cbind(sell, keep) ~ log(bid), data = GoosePermits, family = binomial()) gf_point(0 ~ bid, size = ~keep, color = "gray50", data = GoosePermits) %>% gf_point(1 ~ bid, size = ~ sell, color = "navy") %>% gf_function(fun = makeFun(goose.mod)) %>% gf_refine(guides(size = "none")) ggplot(data = GoosePermits) + geom_point( aes(x = bid, y = 0, size = keep), colour = "gray50") + geom_point( aes(x = bid, y = 1, size = sell), colour = "navy") + stat_function(fun = makeFun(goose.mod)) + guides( size = "none") gf_point( (sell / (sell + keep)) ~ bid, data = GoosePermits, size = ~ sell + keep, color = "navy") %>% gf_function(fun = makeFun(goose.mod)) %>% gf_text(label = ~ as.character(sell + keep), colour = "white", size = 3) %>% gf_refine(scale_size_area()) %>% gf_labs(y = "probabity of selling") ggplot(data = GoosePermits) + stat_function(fun = makeFun(goose.mod)) + geom_point( aes(x = bid, y = sell / (sell + keep), size = sell + keep), colour = "navy") + geom_text( aes(x = bid, y = sell / (sell + keep), label = as.character(sell + keep)), colour = "white", size = 3) + scale_size_area() + labs(y = "probabity of selling")
GPA, ACT, and SAT scores for a sample of students.
A data frame with 271 observations on the following 4 variables.
ACT score
college grade point average
SAT mathematics score
SAT verbal score
data(GPA) splom(GPA)
data(GPA) splom(GPA)
Two identical footballs, one air-filled and one helium-filled, were used outdoors on a windless day at The Ohio State University's athletic complex. Each football was kicked 39 times and the two footballs were alternated with each kick. The experimenter recorded the distance traveled by each ball.
A data frame with 39 observations on the following 3 variables.
trial number
distance traveled by air-filled football (yards)
distance traveled by helium-filled football (yards)
These data are available from DASL, the data and story library (https://dasl.datadescription.com/).
Lafferty, M. B. (1993), "OSU scientists get a kick out of sports controversy", The Columbus Dispatch (November, 21, 1993), B7.
data(HeliumFootballs) gf_point(helium ~ air, data = HeliumFootballs) gf_dhistogram( ~ (helium - air), data = HeliumFootballs, fill = ~ (helium > air), bins = 15, boundary = 0 )
data(HeliumFootballs) gf_point(helium ~ air, data = HeliumFootballs) gf_dhistogram( ~ (helium - air), data = HeliumFootballs, fill = ~ (helium > air), bins = 15, boundary = 0 )
This data set contains the results of an experiment comparing the efficacy of different forms of dry ice application in reducing the temperature of the calf muscle.
The 12 subjects in this study came three times, at least four days apart,
and received one of three ice treatments (cubed ice, crushed ice, or ice
mixed with water). In each case, the ice was prepared in a plastic bag and
applied dry to the subjects calf muscle. The temperature measurements were
taken on the skin surface and inside the calf muscle (via a 4 cm long probe)
every 30 seconds for 20 minutes prior to icing, for 20 minutes during icing,
and for 2 hours after the ice had been removed. The temperature
measurements are stored in variables that begin with b
(baseline),
t
(treatment), or r
(recovery) followed by a numerical code
for the elapsed time formed by concatenating the number of minutes and
seconds. For example, R1230
contains the temperatures 12 minutes and
30 seconds after the ice had been removed.
Variables include
identification number
a factor with levels female
male
weight of subject (kg)
height of subject (cm)
skinfold thickness
calf diameter (cm)
age of subject
a factor with levels intramuscular
surface
a factor with levels crushed
cubed
wet
baseline temperature at time 0
baseline temperature 30 seconds after start
baseline temperature 1 minute after start
baseline temperature 19 minutes 30 seconds start
treatment temperature at beginning of treatment
treatment temperature 30 seconds after start of treatment
treatment temperature 1 minute after start of treatment
treatment temperature 19 minutes 30 seconds after start of treatment
recovery temperature at start of recovery
recovery temperature 30 seconds after start of recovery
recovery temperature 1 minute after start of recovery
recovery temperature 120 minutes after start of recovery
Dykstra, J. H., Hill, H. M., Miller, M. G., Michael T. J., Cheatham, C. C., and Baker, R.J., Comparisons of cubed ice, crushed ice, and wetted ice on intramuscular and surface temperature changes, Journal of Athletic Training 44 (2009), no. 2, 136–141.
data(Ice) gf_point(weight ~ skinfold, color = ~ sex, data = Ice) if (require(readr) && require(tidyr)) { Ice2 <- Ice %>% gather("key", "temp", b0:r12000) %>% separate(key, c("phase", "time"), sep = 1) %>% mutate(time = parse_number(time), subject = as.character(subject)) gf_line( temp ~ time, data = Ice2 %>% filter(phase == "t"), color = ~ sex, group = ~subject, alpha = 0.6) %>% gf_facet_grid( treatment ~ location) }
data(Ice) gf_point(weight ~ skinfold, color = ~ sex, data = Ice) if (require(readr) && require(tidyr)) { Ice2 <- Ice %>% gather("key", "temp", b0:r12000) %>% separate(key, c("phase", "time"), sep = 1) %>% mutate(time = parse_number(time), subject = as.character(subject)) gf_line( temp ~ time, data = Ice2 %>% filter(phase == "t"), color = ~ sex, group = ~subject, alpha = 0.6) %>% gf_facet_grid( treatment ~ location) }
The article developed four measures of central bank independence and explored their relation to inflation outcomes in developed and developing countries. This datafile deals with two of these measures in 23 nations.
A data frame with 23 observations on the following 5 variables.
country where data were collected
questionnaire index of independence
annual inflation rate, 1980-1989 (percent)
legal index of independence
developed (1) or developing (2) nation
These data are available from OzDASL, the Australasian Data and Story Library (https://dasl.datadescription.com/).
A. Cukierman, S.B. Webb, and B. Negapi, "Measuring the Independence of Central Banks and Its Effect on Policy Outcomes," World Bank Economic Review, Vol. 6 No. 3 (Sept 1992), 353-398.
data(Inflation)
data(Inflation)
Extract information from a maxLik object
information(object, ...)
information(object, ...)
object |
an object of class |
... |
additional arguments |
The number of points scored by Michael Jordan in each game of the 1986-87 regular season.
A data frame with 82 observations on the following 2 variables.
a numeric vector
a numeric vector
data(Jordan8687) gf_qq(~ points, data = Jordan8687)
data(Jordan8687) gf_qq(~ points, data = Jordan8687)
Subjects were students in grades 4-6 from three school districts in Michigan. Students were selected from urban, suburban, and rural school districts with approximately 1/3 of their sample coming from each district. Students indicated whether good grades, athletic ability, or popularity was most important to them. They also ranked four factors: grades, sports, looks, and money, in order of their importance for popularity. The questionnaire also asked for gender, grade level, and other demographic information.
A data frame with 478 observations on the following 11 variables.
a factor with levels boy
girl
grade in school
student age
a factor with levels
other
White
a factor with levels
Rural
Suburban
Urban
a factor
with levels Brentwood Elementary
Brentwood Middle
Brown
Middle
Elm
Main
Portage
Ridge
Sand
Westdale Middle
a factor with levels
Grades
Popular
Sports
rank of ‘make good grades’ (1 = most important for popularity; 4 = least important)
rank of ‘beging good at sports’ (1 = most important for popularity; 4 = least important)
rank of 'beging handsome or pretty' (1 = most important for popularity; 4 = least important)
rank of ‘having lots of money’ (1 = most important for popularity; 4 = least important)
These data are available at DASL, the data and story library (https://dasl.datadescription.com/).
Chase, M. A., and Dummer, G. M. (1992), "The Role of Sports as a Social Determinant for Children," Research Quarterly for Exercise and Sport, 63, 418-424.
data(Kids) tally(goals ~ urban.rural, data = Kids) chisq.test(tally(~ goals + urban.rural, data = Kids))
data(Kids) tally(goals ~ urban.rural, data = Kids) chisq.test(tally(~ goals + urban.rural, data = Kids))
These data are from a little survey given to a number of students in introductory statistics courses. Several of the items were prepared in multiple versions and distributed randomly to the students.
A data frame with 279 observations on the following 20 variables.
a number between 1 and 30
which version of the 'favorite color' question was
on the survey. A factor with levels v1
v2
favorite color if among predefined choices. A factor
with levels black
green
other
purple
red
favorite color if not among choices above.
which version of the 'favorite color'
question was on the survey. A factor with levels v1
v2
favorite animal if among predefined choices. A factor
with levels elephant
giraffe
lion
other
.
favorite animal if not among the predefined choices.
which version of the 'pulse' question was on the survey
self-reported pulse
which of three versions of the TV question was on the survey
a factor with levels <1
>4
>8
1-2
2-4
4-8
none
other
a numeric vector
which of two versions of the 'surprise' question was on the survey
a factor with levels no
yes
which of two versions of the 'play' question was on the survey
a factor with levels no
yes
which of two versions of the 'play' question was on the survey
a factor with levels A
B
which of two versions of the 'homework' question was on the survey
a factor with levels
A
B
1.1. Write down any number between 1 and 30 (inclusive).
2.1. What is your favorite color? Choices: black red; green; purple; other
2.2. What is your favorite color?
3.1. What is your favorite zoo animal? Choices: giraffe; lion; elephant; other
3.2. What is your favorite zoo animal?
4.1. Measure and record your pulse.
5.1. How much time have you spent watching TV in the last week?
5.2. How much time have you spent watching TV in the last week? Choises: none; under 1; hour 1-2 hours; 2-4 hours; more than 4 hours
5.3. How much time have you spent watching TV in the last week? Choises: under 1 hour; 1-2 hours; 2-4 hours; 4-8 hours; more than 8 hours
6.1. Social science researchers have conducted extensive empirical studies and concluded that the expression "absence makes the heart grow fonder" is generally true. Do you find this result surprising or not surprising?
6.2. Social science researchers have conducted extensive empirical studies and concluded that the expression "out of sight out of mind" is generally true. Do you find this result surprising or not surprising?
7.1. Suppose that you have decided to see a play for which the admission charge is $20 per ticket. As you prepare to purchase the ticket, you discover that you have lost a $20 bill. Would you still pay $20 for a ticket to see the play?
7.2. Suppose that you have decided to see a play for which the admission charge is $20 per ticket. As you prepare to enter the theater, you discover that you have lost your ticket. Would you pay $20 to buy a new ticket to see the play?
8.1. suppose that the United States is preparing for the outbreak of an unusual Asian disease that is expected to kill 600 people. Two alternative programs to combat the disease have been proposed. Assume that the exact scientific estimates of the consequences of the programs are as follows: If program A is adopted, 200 people will be saved. If program B is adopted, there is a 1/3 probability that 600 people will be saved and a 2/3 probability that nobody will be saved. Which of the two programs would you favor?
8.2. Suppose that the United States is preparing for the outbreak of an unusual Asian disease that is expected to kill 600 people. two alternative programs to combat the disease have been proposed. Assume that the exact scientific estimates of the consequences of the programs are as follows:
If program A is adopted, 400 people will die. If program B is adopted, there is a 1/3 probability that no one will die and a 2/3 probability that all 600 people will die. Which of the two programs would you favor? A or B
9.1. A national survey of college students revealed that professors at this college assign "significantly more homework that the nationwide average for an institution of its type." How does this finding compare with your experience? Choises: a. That sounds about right to me; b that doesn't sound right to me.
9.2. A national survey of college students revealed that professors at this college assign an amount of homework that "is fairly typical for institutions of its type." How does this finding compare with your experience? Choices: A that sounds about right to me; b that doesn't sound right to me.
data(LittleSurvey) tally(surprise ~ surprisever, data = LittleSurvey) tally(disease ~ diseasever, data = LittleSurvey)
data(LittleSurvey) tally(surprise ~ surprisever, data = LittleSurvey) tally(disease ~ diseasever, data = LittleSurvey)
In this experiment, hyperactive and control students were given a mathematics test in either a quiet or loud testing environment.
A data frame with 40 observations on the following 3 variables.
score on a mathematics test
a factor with levels hi
lo
a factor with levels control
hyper
Sydney S. Zentall and Jandira H. Shaw, Effects of classroom noise on perfor- mance and activity of second-grade hyperactive and control children, Journal of Educational Psychology 72 (1980), no. 6, 830.
data(MathNoise) xyplot (score ~ noise, data = MathNoise, group = group, type = 'a', auto.key = list(columns = 2, lines = TRUE, points = FALSE)) gf_jitter(score ~ noise, data = MathNoise, color = ~ group, alpha = 0.4, width = 0.1, height = 0) %>% gf_line(score ~ noise, data = MathNoise, color = ~ group, group = ~ group, stat = "summary")
data(MathNoise) xyplot (score ~ noise, data = MathNoise, group = group, type = 'a', auto.key = list(columns = 2, lines = TRUE, points = FALSE)) gf_jitter(score ~ noise, data = MathNoise, color = ~ group, alpha = 0.4, width = 0.1, height = 0) %>% gf_line(score ~ noise, data = MathNoise, color = ~ group, group = ~ group, stat = "summary")
This version of maxLik
stores additional information in the
returned object enabling a plot method.
maxLik2(loglik, ..., env = parent.frame())
maxLik2(loglik, ..., env = parent.frame())
loglik |
a log-likelihood function as for |
... |
additional arguments passed to |
env |
an environment in which to evaluate |
Individual player statistics for the 2004-2005 Michigan Intercollegiate Athletic Association basketball season.
A data frame with 134 observations on the following 27 variables.
jersey number
player's name
games played
games started
minutes played
average minutes played per game
field goals made
field goals attempted
field goal percentage
3-point field goals made
3-point field goals attempted
3-point field goal percentage
free throws made
free throws attempted
free throw percentage
offensive rebounds
defensive rebounds
total rebounds
rebounds per game
personal fouls
games fouled out
assists
turn overs
blocked shots
steals
points scored
points per game
MIAA sports archives (https://www.miaa.org/)
data(MIAA05) gf_histogram(~ FTPct, data = MIAA05)
data(MIAA05) gf_histogram(~ FTPct, data = MIAA05)
Team batting statistics, runs allowed, and runs scored for the 2004 Major League Baseball season.
A data frame with 30 observations on the following 20 variables.
team city, a factor
League, a factor with levels AL
NL
number of wins
number of losses
number of games
number of runs scored
oppnents' runs – number of runs allowed
run difference – R - OR
number of at bats
number of hits
number of doubles
number of triples
number of home runs
number of walks (bases on balls)
number of strike outs
number of stolen bases
number of times caught stealing
batting average
slugging percentage
on base average
data(MLB2004) gf_point(W ~ Rdiff, data = MLB2004)
data(MLB2004) gf_point(W ~ Rdiff, data = MLB2004)
Results of NCAA basketball games
Nine variables describing NCAA Division I basketball games.
date on which game was played
visiting team
visiting team's score
home team
home team's score
code indicting games played at neutral sites (n or N) or in tournaments (T)
where game was played
a character indicating which season the game belonged to
a logical indicating whether the game is a postseason game
data(NCAAbb) # select one year and add some additional variables to the data frame NCAA2010 <- NCAAbb %>% filter(season == "2009-10") %>% mutate( dscore = hscore - ascore, homeTeamWon = dscore > 0, numHomeTeamWon <- -1 + 2 * as.numeric(homeTeamWon), winner = ifelse(homeTeamWon, home, away), loser = ifelse(homeTeamWon, away, home), wscore = ifelse(homeTeamWon, hscore, ascore), lscore = ifelse(homeTeamWon, ascore, hscore) ) NCAA2010 %>% select(date, winner, loser, wscore, lscore, dscore, homeTeamWon) %>% head()
data(NCAAbb) # select one year and add some additional variables to the data frame NCAA2010 <- NCAAbb %>% filter(season == "2009-10") %>% mutate( dscore = hscore - ascore, homeTeamWon = dscore > 0, numHomeTeamWon <- -1 + 2 * as.numeric(homeTeamWon), winner = ifelse(homeTeamWon, home, away), loser = ifelse(homeTeamWon, away, home), wscore = ifelse(homeTeamWon, hscore, ascore), lscore = ifelse(homeTeamWon, ascore, hscore) ) NCAA2010 %>% select(date, winner, loser, wscore, lscore, dscore, homeTeamWon) %>% head()
Results of National Football League games (2007 season, including playoffs)
A data frame with 267 observations on the following 7 variables.
date on which game was played
visiting team
score for visiting team
home team
score for home team
‘betting line’
'over/under' line (for combined score of both teams)
data(NFL2007) NFL <- NFL2007 NFL$dscore <- NFL$homeScore - NFL$visitorScore w <- which(NFL$dscore > 0) NFL$winner <- NFL$visitor; NFL$winner[w] <- NFL$home[w] NFL$loser <- NFL$home; NFL$loser[w] <- NFL$visitor[w] # did the home team win? NFL$homeTeamWon <- NFL$dscore > 0 table(NFL$homeTeamWon) table(NFL$dscore > NFL$line)
data(NFL2007) NFL <- NFL2007 NFL$dscore <- NFL$homeScore - NFL$visitorScore w <- which(NFL$dscore > 0) NFL$winner <- NFL$visitor; NFL$winner[w] <- NFL$home[w] NFL$loser <- NFL$home; NFL$loser[w] <- NFL$visitor[w] # did the home team win? NFL$homeTeamWon <- NFL$dscore > 0 table(NFL$homeTeamWon) table(NFL$dscore > NFL$line)
nlmin
and nlmax
are thin wrappers around nlm
, a non-linear minimizer.
nlmax
avoids the necessity of modifying the function to construct a minimization problem
from a problem that is naturally a maximization problem.
The summary
method for the resulting objects provides output that is easier
for humnans to read.
nlmax(f, ...) nlmin(f, ...) ## S3 method for class 'nlmax' summary(object, nsmall = 4, ...) ## S3 method for class 'nlmin' summary(object, nsmall = 4, ...)
nlmax(f, ...) nlmin(f, ...) ## S3 method for class 'nlmax' summary(object, nsmall = 4, ...) ## S3 method for class 'nlmin' summary(object, nsmall = 4, ...)
f |
a function to optimize |
... |
additional arguments passed to |
object |
an object returned from |
nsmall |
a numeric passed through to |
summary( nlmax( function(x) 5 - 3*x - 5*x^2, p=0 ) )
summary( nlmax( function(x) 5 - 3*x - 5*x^2, p=0 ) )
In order to test the effect of room noise, subjects were given a test under 5 different sets of conditions: 1) no noise, 2) intermittent low volume, 3) intermittent high volume, 4) continuous low volume, and 5) continuous high volume.
A data frame with 50 observations on the following 5 variables.
subject identifier
score on the test
numeric code for condition
a factor with levels high
low
none
a factor with levels continuous
intermittent
none
data(Noise) Noise2 <- Noise %>% filter(volume != 'none') model <- lm(score ~ volume * frequency, data = Noise2) anova(model) gf_jitter(score ~ volume, data = Noise2, color = ~ frequency, alpha = 0.4, width = 0.1, height = 0) %>% gf_line(score ~ volume, data = Noise2, group = ~frequency, color = ~ frequency, stat = "summary") gf_jitter(score ~ frequency, data = Noise2, color = ~ volume, alpha = 0.4, width = 0.1, height = 0) %>% gf_line(score ~ frequency, data = Noise2, group = ~ volume, color = ~ volume, stat = "summary")
data(Noise) Noise2 <- Noise %>% filter(volume != 'none') model <- lm(score ~ volume * frequency, data = Noise2) anova(model) gf_jitter(score ~ volume, data = Noise2, color = ~ frequency, alpha = 0.4, width = 0.1, height = 0) %>% gf_line(score ~ volume, data = Noise2, group = ~frequency, color = ~ frequency, stat = "summary") gf_jitter(score ~ frequency, data = Noise2, color = ~ volume, alpha = 0.4, width = 0.1, height = 0) %>% gf_line(score ~ frequency, data = Noise2, group = ~ volume, color = ~ volume, stat = "summary")
The paletts data set contains data from a firm that recycles paletts. Paletts from warehouses are bought, repaired, and resold. (Repairing a palette typically involves replacing one or two boards.) The company has four employees who do the repairs. The employer sampled five days for each employee and recorded the number of pallets repaired.
A data frame with 20 observations on the following 3 variables.
number of pallets repaired
a factor with levels A
B
C
D
a factor with levels day1
day2
day3
day4
day5
Michael Stob, Calvin College
data(Pallets) # Do the employees differ in the rate at which they repair pallets? pal.lm1 <- lm(pallets ~ employee, data = Pallets) anova(pal.lm1) # Now using day as a blocking variable pal.lm2 <- lm(pallets ~ employee + day, data = Pallets) anova(pal.lm2) gf_line(pallets ~ day, data = Pallets, group = ~employee, color = ~employee) %>% gf_point() %>% gf_labs(title = "Productivity by day and employee")
data(Pallets) # Do the employees differ in the rate at which they repair pallets? pal.lm1 <- lm(pallets ~ employee, data = Pallets) anova(pal.lm1) # Now using day as a blocking variable pal.lm2 <- lm(pallets ~ employee + day, data = Pallets) anova(pal.lm2) gf_line(pallets ~ day, data = Pallets, group = ~employee, color = ~employee) %>% gf_point() %>% gf_labs(title = "Productivity by day and employee")
Student-collected data from an experiment investigating the design of paper airplanes.
A data frame with 16 observations on the following 5 variables.
distance plane traveled (cm)
type of paper used
a numeric vector
design of plane (hi performance
or
simple
)
order in which planes were thrown
These data were collected by Stewart Fischer and David Tippetts, statistics students at the Queensland University of Technology in a subject taught by Dr. Margaret Mackisack. Here is their description of the data and its collection:
The experiment decided upon was to see if by using two different designs of paper aeroplane, how far the plane would travel. In considering this, the question arose, whether different types of paper and different angles of release would have any effect on the distance travelled. Knowing that paper aeroplanes are greatly influenced by wind, we had to find a way to eliminate this factor. We decided to perform the experiment in a hallway of the University, where the effects of wind can be controlled to some extent by closing doors.
In order to make the experimental units as homogeneous as possible we allocated one person to a task, so person 1 folded and threw all planes, person 2 calculated the random order assignment, measured all the distances, checked that the angles of flight were right, and checked that the plane release was the same each time.
The factors that we considered each had two levels as follows:
Paper: A4 size, 80g and 50g
Design: High Performance Dual Glider, and Incredibly Simple Glider (patterns attached to original report)
Angle of release: Horizontal, or 45 degrees upward.
The random order assignment was calculated using the random number function of a calculator. Each combination of factors was assigned a number from one to eight, the random numbers were generated and accordingly the order of the experiment was found.
These data are also available at OzDASL, the Australasian Data and Story Library (https://dasl.datadescription.com/).
Mackisack, M. S. (1994). What is the use of experiments conducted by statistics students? Journal of Statistics Education, 2, no 1.
data(PaperPlanes)
data(PaperPlanes)
Period and pendulum length for a number of string and mass pendulums constructed by physics students. The same mass was used throughout, but the length of the string was varied from 10cm to 16 m.
A data frame with 27 observations on the following 3 variables.
length of the pendulum (in meters)
average time of period (in seconds) over several swings of the pendulum
an estimate of the accuracy of the length measurement
Calvin College physics students under the direction of Professor Steve Plath.
data(Pendulum) gf_point(period ~ length, data = Pendulum)
data(Pendulum) gf_point(period ~ length, data = Pendulum)
Does having a pet or a friend cause more stress?
A data frame with 45 observations on the following 2 variables.
a factor with levels C
ontrol,
F
riend, or P
et
average heart rate while performing a stressful task
Fourty-five women, all self-proclaimed dog-lovers, were randomly divided into three groups of subjects. Each performed a stressful task either alone, with a friend present, or with their dog present. The average heart rate during the task was used as a measure of stress.
K. M. Allen, J. Blascovich, J. Tomaka, and R. M. Kelsey, Presence of human friends and pet dogs as moderators of autonomic responses to stress in women, Journal of Personality and Social Psychology 61 (1991), no. 4, 582–589.
These data also appear in
Brigitte Baldi and David S. Moore, The Practice of Statistics in the Life Sciences, Freeman, 2009.
data(PetStress) xyplot(rate ~ group, data = PetStress, jitter.x = TRUE, type = c('p', 'a')) gf_jitter(rate ~ group, data = PetStress, width = 0.1, height = 0) %>% gf_line(group = 1, stat = "summary", color = "red")
data(PetStress) xyplot(rate ~ group, data = PetStress, jitter.x = TRUE, type = c('p', 'a')) gf_jitter(rate ~ group, data = PetStress, width = 0.1, height = 0) %>% gf_line(group = 1, stat = "summary", color = "red")
Phenotype and genotype data from the Finland United States Investigation of NIDDM (type 2) Diabetes (FUSION) study.
Data frames with the following variables.
subject ID number for matching between data sets
a factor with levels case
control
body mass index
a factor with levels
F
M
age of subject at time phenotypes were colelcted
a factor with levels former
never
occasional
regular
total cholesterol
waist circumference (cm)
weight (kg)
height (cm)
waist hip ratio
systolic blood pressure
diastolic blood pressure
RS name of SNP
numeric ID for SNP
first allele coded as 1 = A, 2 = C, 3 = G, 4 = T
second allele coded as 1 = A, 2 = C, 3 = G, 4 = T
both alleles coded as a factor
number of A alleles
number of C alleles
number of G alleles
number of T alleles
Similar to the data presented in
Laura J. Scott, Karen L. Mohlke, Lori L. Bonnycastle, Cristen J. Willer, Yun Li, William L. Duren, Michael R. Erdos, Heather M. Stringham, Pe- ter S. Chines, Anne U. Jackson, Ludmila Prokunina-Olsson, Chia-Jen J. Ding, Amy J. Swift, Narisu Narisu, Tianle Hu, Randall Pruim, Rui Xiao, Xiao- Yi Y. Li, Karen N. Conneely, Nancy L. Riebow, Andrew G. Sprau, Maurine Tong, Peggy P. White, Kurt N. Hetrick, Michael W. Barnhart, Craig W. Bark, Janet L. Goldstein, Lee Watkins, Fang Xiang, Jouko Saramies, Thomas A. Buchanan, Richard M. Watanabe, Timo T. Valle, Leena Kinnunen, Goncalo R. Abecasis, Elizabeth W. Pugh, Kimberly F. Doheny, Richard N. Bergman, Jaakko Tuomilehto, Francis S. Collins, and Michael Boehnke, A genome-wide association study of type 2 diabetes in Finns detects multiple susceptibility vari- ants, Science (2007).
data(Pheno); data(FUSION1); data(FUSION2) FUSION1m <- merge(FUSION1, Pheno, by = "id", all.x = FALSE, all.y = FALSE) xtabs( ~ t2d + genotype, data = FUSION1m) xtabs( ~ t2d + Gdose, data = FUSION1m) chisq.test( xtabs( ~ t2d + genotype, data = FUSION1m ) ) f1.glm <- glm( factor(t2d) ~ Gdose, data = FUSION1m, family = binomial) summary(f1.glm)
data(Pheno); data(FUSION1); data(FUSION2) FUSION1m <- merge(FUSION1, Pheno, by = "id", all.x = FALSE, all.y = FALSE) xtabs( ~ t2d + genotype, data = FUSION1m) xtabs( ~ t2d + Gdose, data = FUSION1m) chisq.test( xtabs( ~ t2d + genotype, data = FUSION1m ) ) f1.glm <- glm( factor(t2d) ~ Gdose, data = FUSION1m, family = binomial) summary(f1.glm)
This data set contains information collected from rolling the pair of pigs (found in the game "Pass the Pigs") 6000 times.
A data frame with 6000 observations on the following 6 variables.
roll number (1-6000)
numerical code for position of black pig
position of black pig coded as a factor
numerical code for position of pink pig
position of pink pig coded as a factor
score of the roll
height from which pigs were rolled (5 or 8 inches)
starting position of the pigs (0 = both pigs backwards, 1 = one bacwards one forwards, 2 = both forwards)
In "Pass the Pigs", players roll two pig-shaped rubber dice and earn or lose points depending on the configuration of the rolled pigs. Players compete individually to earn 100 points. On each turn, a player rolls he or she decides to stop or until "pigging out" or
The pig configurations and their associated scores are
1 = Dot Up (0)
2 = Dot Down (0)
3 = Trotter (5)
4 = Razorback (5)
5 = Snouter (10)
6 = Leaning Jowler (15)
7 = Pigs are touching one another (-1; lose all points)
One pig Dot Up and one Dot Down ends the turn (a "pig out") and results in 0 points for the turn. If the pigs touch, the turn is ended and all points for the game must be forfeited. Two pigs in the Dot Up or Dot Down configuration score 1 point. Otherwise, The scores of the two pigs in different configurations are added together. The score is doubled if both both pigs have the same configuration, so, for example, two Snouters are worth 40 rather than 20.
John C. Kern II, Duquesne University ([email protected])
data(Pigs) tally( ~ black, data = Pigs ) if (require(tidyr)) { Pigs %>% select(roll, black, pink) %>% gather(pig, state, black, pink) %>% tally( state ~ pig, data = ., format = "prop", margins = TRUE) }
data(Pigs) tally( ~ black, data = Pigs ) if (require(tidyr)) { Pigs %>% select(roll, black, pink) %>% gather(pig, state, black, pink) %>% tally( state ~ pig, data = ., format = "prop", margins = TRUE) }
Major League Baseball pitching statistics for the 2005 season.
A data frame with 653 observations on the following 26 variables.
unique identifier for each player
year
for players who played with
multiple teams in the same season, stint
is increased by one each
time the player joins a new team
three-letter identifier for team
league team plays in, coded as AL
or NL
wins
losses
games played in
games started
complete games
shut outs
saves recorded
outs recorded (innings pitched, measured in outs rather than innings)
hits allowed
earned runs allowed
home runs allowed
walks (bases on balls) allowed
strike outs
earned run average
intentional walks
wild pitches
number of batters hit by pitch
balks
batters faced pitching
ratio of ground balls to fly balls
runs allowed
data(Pitching2005) gf_point(IPouts/3 ~ W, data = Pitching2005, ylab = "innings pitched", xlab = "wins")
data(Pitching2005) gf_point(IPouts/3 ~ W, data = Pitching2005, ylab = "innings pitched", xlab = "wins")
See maxLik2
and maxLik
for how to create
the objects this method prints.
## S3 method for class 'maxLik2' plot(x, y, ci = "Wald", hline = FALSE, ...)
## S3 method for class 'maxLik2' plot(x, y, ci = "Wald", hline = FALSE, ...)
x |
an object of class |
y |
ignored |
ci |
a character vector with values among
|
hline |
a logical indicating whether a horizontal line should be added |
... |
additional arguments, currently ignored. |
The data give the survival times (in hours) in a 3 x 4 factorial experiment, the factors being (a) three poisons and (b) four treatments. Each combination of the two factors is used for four animals. The allocation to animals is completely randomized.
A data frame with 48 observations on the following 3 variables.
type of poison (1, 2, or 3)
manner of treatment (1, 2, 3, or 4)
time until death (hours)
These data are also available from OzDASL, the Australian Data and Story Library (https://dasl.datadescription.com/). (Note: The time measurements of the data at OzDASL are in units of tens of hours.)
Box, G. E. P., and Cox, D. R. (1964). An analysis of transformations (with Discussion). J. R. Statist. Soc. B, 26, 211-252.
Aitkin, M. (1987). Modelling variance heterogeneity in normal regression using GLIM. Appl. Statist., 36, 332-339.
Smyth, G. K., and Verbyla, A. P. (1999). Adjusted likelihood methods for modelling dispersion in generalized linear models. Environmetrics 10, 696-709. http://www.statsci.org/smyth/pubs/ties98tr.html.
data(poison) poison.lm <- lm(time~factor(poison) * factor(treatment), data = Poison) plot(poison.lm,w = c(4,2)) anova(poison.lm) # improved fit using a transformation poison.lm2 <- lm(1/time ~ factor(poison) * factor(treatment), data = Poison) plot(poison.lm2,w = c(4,2)) anova(poison.lm)
data(poison) poison.lm <- lm(time~factor(poison) * factor(treatment), data = Poison) plot(poison.lm,w = c(4,2)) anova(poison.lm) # improved fit using a transformation poison.lm2 <- lm(1/time ~ factor(poison) * factor(treatment), data = Poison) plot(poison.lm2,w = c(4,2)) anova(poison.lm)
Investigators studied physical characteristics and ability in 13 football punters. Each volunteer punted a football ten times. The investigators recorded the average distance for the ten punts, in feet. They also recorded the average hang time (time the ball is in the air before the receiver catches it), and a number of measures of leg strength and flexibility.
A data frame with 13 observations on the following 7 variables.
mean distance for 10 punts (feet)
mean hang time (seconds)
right leg strength (pounds)
left leg strength (pounds)
right leg flexibility (degrees)
left leg flexibility (degrees)
overall leg strength (foot-pounds)
These data are also available at OzDASL (https://dasl.datadescription.com/).
"The relationship between selected physical performance variables and football punting ability" by the Department of Health, Physical Education and Recreation at the Virginia Polytechnic Institute and State University, 1983.
data(Punting) gf_point(hang ~ distance, data = Punting)
data(Punting) gf_point(hang ~ distance, data = Punting)
Data from an experiment to see whether flavor and location of rat poison influence the consumption by rats.
A data frame with 20 observations on the following 3 variables.
a numeric vector
a factor with levels bread
butter-vanilla
plain
roast beef
a factor with levels A
B
C
D
E
data(RatPoison) gf_line(consumption ~ flavor, group = ~ location, color = ~ location, data = RatPoison) %>% gf_point()
data(RatPoison) gf_line(consumption ~ flavor, group = ~ location, color = ~ location, data = RatPoison) %>% gf_point()
A matrix of random golf ball numbers simulated using
rmultinom(n = 10000,size = 486,prob = rep(0.25,4))
.
data(rgolfballs)
data(rgolfballs)
Results of an experiment comparing a rubber band travels to the amount it was stretched prior to launch.
A data frame with 16 observations on the following 2 variables.
amount rubber band was stretched before launch
distance rubber band traveled
data(RubberBand) gf_point(distance ~ stretch, data = RubberBand) %>% gf_lm(interval = "confidence")
data(RubberBand) gf_point(distance ~ stretch, data = RubberBand) %>% gf_lm(interval = "confidence")
Subjects were asked to to complete a pencil and paper maze when they were smelling a floral scent and when they were not.
A data frame with 21 observations on the following 12 variables.
ID number
a factor with
levels F
andM
a factor with levels
N
, Y
opinion of the odor
(indiff
, neg
, or pos
)
age of subject (in years)
which treatment was first,
scented
or unscented
time (in seconds) in first unscented trial
time (in seconds) in second unscented trial
time (in seconds) in third unscented trial
time (in seconds) in first scented trial
time (in seconds) in second scented trial
time (in seconds) in third scented trial
These data are also available at DASL, the data and story library (https://dasl.datadescription.com/).
Hirsch, A. R., and Johnston, L. H. "Odors and Learning," Smell & Taste Treatment and Research Foundation, Chicago.
data(Scent) summary(Scent)
data(Scent) summary(Scent)
This command will display and/or execute small snippets of R code from the book Foundations and Applications of Statistics: An Introduction Using R.
snippet( name, eval = TRUE, execute = eval, view = !execute, echo = TRUE, ask = getOption("demo.ask"), verbose = getOption("verbose"), lib.loc = NULL, character.only = FALSE, regex = NULL, max.files = 10L )
snippet( name, eval = TRUE, execute = eval, view = !execute, echo = TRUE, ask = getOption("demo.ask"), verbose = getOption("verbose"), lib.loc = NULL, character.only = FALSE, regex = NULL, max.files = 10L )
name |
name of snippet |
eval |
a logical. An alias for 'execute'. |
execute |
a logical. If |
view |
a logical. If |
echo |
a logical. If |
ask |
a logical (or "default") indicating if
|
verbose |
a logical. If |
lib.loc |
character vector of directory names of R libraries, or NULL. The default value of NULL corresponds to all libraries currently known. |
character.only |
logical. If |
regex |
ignored. Retained for backwards compatibility. |
max.files |
an integer limiting the number of files retrieved. |
snippet
works much like demo
, but the interface is
simplified. Partial matching is used to select snippets, so any unique
prefix is sufficient to specify a snippet. Sequenced snippets (identified by
trailing 2-digit numbers) will be executed in sequence if a unique prefix to
the non-numeric portion is given. To run just one of a sequence of snippets,
provide the full snippet name. See the examples.
Randall Pruim
snippet("normal01") # prefix works snippet("normal") # this prefix is ambiguous snippet("norm") # sequence of "histogram" snippets snippet("hist", eval = FALSE, echo = TRUE, view = FALSE) # just one of the "histogram" snippets snippet("histogram04", eval = FALSE, echo = TRUE, view = FALSE) # Prefix too short, but a helpful message is displayed snippet("h", eval = FALSE, echo = TRUE, view = FALSE)
snippet("normal01") # prefix works snippet("normal") # this prefix is ambiguous snippet("norm") # sequence of "histogram" snippets snippet("hist", eval = FALSE, echo = TRUE, view = FALSE) # just one of the "histogram" snippets snippet("histogram04", eval = FALSE, echo = TRUE, view = FALSE) # Prefix too short, but a helpful message is displayed snippet("h", eval = FALSE, echo = TRUE, view = FALSE)
A bar of soap was weighed after showering to see how much soap was used each shower.
A data frame with 15 observations on the following 3 variables.
days since start of soap usage and data collection
weight of bar of soap (in grams)
According to Rex Boggs:
I had a hypothesis that the daily weight of my bar of soap [in grams] in my shower wasn't a linear function, the reason being that the tiny little bar of soap at the end of its life seemed to hang around for just about ever. I wanted to throw it out, but I felt I shouldn't do so until it became unusable. And that seemed to take weeks.
Also I had recently bought some digital kitchen scales and felt I needed to use them to justify the cost. I hypothesized that the daily weight of a bar of soap might be dependent upon surface area, and hence would be a quadratic function ... .
The data ends at day 22. On day 23 the soap broke into two pieces and one piece went down the plughole.
Data collected by Rex Boggs and available from OzDASL (https://dasl.datadescription.com/).
data(Soap) gf_point(weight ~ day, data = Soap)
data(Soap) gf_point(weight ~ day, data = Soap)
Measurements of the diameter (in meters) and mass (in kilograms) of a set of steel ball bearings.
A data frame with 12 observations on the following 2 variables.
diameter of bearing (m)
mass of the bearing (kg)
These data were collected by Calvin College physics students under the direction of Steve Plath.
data(Spheres) gf_point(mass ~ diameter, data = Spheres) gf_point(mass ~ diameter, data = Spheres) %>% gf_refine(scale_x_log10(), scale_y_log10())
data(Spheres) gf_point(mass ~ diameter, data = Spheres) gf_point(mass ~ diameter, data = Spheres) %>% gf_refine(scale_x_log10(), scale_y_log10())
This function creates plots showing the "consumption" of residual sum of squares resulting from adding predictors to a model.
SSplot( model1, model2, n = 1, col1 = "gray50", size1 = 0.6, col2 = "navy", size2 = 1, col3 = "red", size3 = 1, ..., env = parent.frame() )
SSplot( model1, model2, n = 1, col1 = "gray50", size1 = 0.6, col2 = "navy", size2 = 1, col3 = "red", size3 = 1, ..., env = parent.frame() )
model1 |
a linear model |
model2 |
a linear model, often using |
n |
an integer specifying how many times to regenerate
|
col1 , col2 , col3
|
Colors for the line segments in the plot |
size1 , size2 , size3
|
Sizes of the line segments in the plot |
... |
additional arguments (currently ignored) |
env |
an environment in which to evaluate the models. |
SSplot( lm(strength ~ limestone + water, data = Concrete), lm(strength ~ limestone + rand(7), data = Concrete), n = 50) ## Not run: SSplot( lm(strength ~ water + limestone, data = Concrete), lm(strength ~ water + rand(7), data = Concrete), n = 1000) ## End(Not run)
SSplot( lm(strength ~ limestone + water, data = Concrete), lm(strength ~ limestone + rand(7), data = Concrete), n = 50) ## Not run: SSplot( lm(strength ~ water + limestone, data = Concrete), lm(strength ~ water + rand(7), data = Concrete), n = 1000) ## End(Not run)
An experiment was conducted by students at The Ohio State University in the fall of 1993 to explore the nature of the relationship between a person's heart rate and the frequency at which that person stepped up and down on steps of various heights.
A data frame with 30 observations on the following 7 variables.
performance order
number of experimenter block
resting heart rate (beats per minute)
final heart rate
height of step
(hi
or lo
)
whether subject stepped
fast
, medium
, or slow
An experiment was conducted by students at The Ohio State University in the
fall of 1993 to explore the nature of the relationship between a person's
heart rate and the frequency at which that person stepped up and down on
steps of various heights. The response variable, heart rate, was measured in
beats per minute. There were two different step heights: 5.75 inches (coded
as lo
), and 11.5 inches (coded as hi
). There were three rates
of stepping: 14 steps/min. (coded as slow
), 21 steps/min. (coded as
medium
), and 28 steps/min. (coded as fast
). This resulted in
six possible height/frequency combinations. Each subject performed the
activity for three minutes. Subjects were kept on pace by the beat of an
electric metronome. One experimenter counted the subject's pulse for 20
seconds before and after each trial. The subject always rested between
trials until her or his heart rate returned to close to the beginning rate.
Another experimenter kept track of the time spent stepping. Each subject was
always measured and timed by the same pair of experimenters to reduce
variability in the experiment. Each pair of experimenters was treated as a
block.
These data are available at DASL, the data and story library (https://dasl.datadescription.com/).
data(Step) gf_jitter(HR-restHR ~ freq, color = ~height, data = Step, group = ~height, height = 0, width = 0.1) %>% gf_line(stat = "summary", group = ~height) gf_jitter(HR-restHR ~ height, color = ~freq, data = Step, group = ~freq, height = 0, width = 0.1) %>% gf_line(stat = "summary", group = ~freq)
data(Step) gf_jitter(HR-restHR ~ freq, color = ~height, data = Step, group = ~height, height = 0, width = 0.1) %>% gf_line(stat = "summary", group = ~height) gf_jitter(HR-restHR ~ height, color = ~freq, data = Step, group = ~freq, height = 0, width = 0.1) %>% gf_line(stat = "summary", group = ~freq)
Results of an experiment on the effect of prior information on the time to fuse random dot steregrams. One group (NV) was given either no information or just verbal information about the shape of the embedded object. A second group (group VV) received both verbal information and visual information (e.g., a drawing of the object).
A data frame with 78 observations on the following 2 variables.
time until subject was able to fuse a random dot stereogram
treatment group: NV
(no visual
instructions) VV
(visual instructions)
These data are available at DASL, the data and story library (https://dasl.datadescription.com/).
Frisby, J. P. and Clatworthy, J. L., "Learning to see complex random-dot stereograms," Perception, 4, (1975), pp. 173-178.
Cleveland, W. S. Visualizing Data. 1993.
data(Stereogram) favstats(time ~ group, data = Stereogram) gf_violin(time ~ group, data = Stereogram, alpha = 0.2, fill = "skyblue") %>% gf_jitter(time ~ group, data = Stereogram, height = 0, width = 0.25)
data(Stereogram) favstats(time ~ group, data = Stereogram) gf_violin(time ~ group, data = Stereogram, alpha = 0.2, fill = "skyblue") %>% gf_jitter(time ~ group, data = Stereogram, height = 0, width = 0.25)
Standardized test scores and GPAs for 1000 students.
A data frame with 1000 observations on the following 6 variables.
ACT score
SAT score
has the student graduated from college?
college GPA at graduation
high school GPA
year of graduation or expected graduation
data(Students) gf_point(ACT ~ SAT, data = Students) gf_point(gradGPA ~ hsGPA, data = Students)
data(Students) gf_point(ACT ~ SAT, data = Students) gf_point(gradGPA ~ hsGPA, data = Students)
The results from a study comparing different preparation methods for taste test samples.
A data frame with 16 observations on 2 (taste1
) or 4
(tastetest
) variables.
taste score from a group of 50 testers
a factor with levels
coarse
fine
a factor with levels hi
lo
a factor with levels A
B
C
D
The samples were prepared for tasting using either a coarse screen or a fine screen, and with either a high or low liquid content. A total taste score is recorded for each of 16 groups of 50 testers each. Each group had 25 men and 25 women, each of whom scored the samples on a scale from -3 (terrible) to 3 (excellent). The sum of these individual scores is the overall taste score for the group.
E. Street and M. G. Carroll, Preliminary evaluation of a food product, Statistics: A Guide to the Unknown (Judith M. Tanur et al., eds.), Holden-Day, 1972, pp. 220-238.
data(TasteTest) data(Taste1) gf_jitter(score ~ scr, data = TasteTest, color = ~liq, width = 0.2, height =0) %>% gf_line(stat = "summary", group = ~liq) df_stats(score ~ scr | liq, data = TasteTest)
data(TasteTest) data(Taste1) gf_jitter(score ~ scr, data = TasteTest, color = ~liq, width = 0.2, height =0) %>% gf_line(stat = "summary", group = ~liq) df_stats(score ~ scr | liq, data = TasteTest)
This function computes degrees of freedom for a 2-sample t-test from the standard deviations and sample sizes of the two samples.
tdf(sd1, sd2, n1, n2)
tdf(sd1, sd2, n1, n2)
sd1 |
standard deviation of the sample 1 |
sd2 |
standard deviation of the sample 2 |
n1 |
size of sample 1 |
n2 |
size of sample 2 |
estimated degrees of freedom for 2-sample t-test
data(KidsFeet, package="mosaicData") fs <- favstats( length ~ sex, data=KidsFeet ); fs t.test( length ~ sex, data=KidsFeet ) tdf( fs[1,'sd'], fs[2,'sd'], fs[1,'n'], fs[2,'n'])
data(KidsFeet, package="mosaicData") fs <- favstats( length ~ sex, data=KidsFeet ); fs t.test( length ~ sex, data=KidsFeet ) tdf( fs[1,'sd'], fs[2,'sd'], fs[1,'n'], fs[2,'n'])
Tread wear is estimated by two methods: weight loss and groove wear.
A data frame with 16 observations on the following 2 variables.
estimated wear (1000's of miles) base on weight loss
estimated wear (1000's of miles) based on groove wear
These data are available at DASL, the Data and Story Library (https://dasl.datadescription.com/).
R. D. Stichler, G. G. Richey, and J. Mandel, "Measurement of Treadware of Commercial Tires", Rubber Age, 73:2 (May 1953).
data(TireWear) gf_point(weight ~ groove, data = TireWear)
data(TireWear) gf_point(weight ~ groove, data = TireWear)
Used by Tufte as an example of the importance of context, these data show the traffic fatality rates in New England in the 1950s. Connecticut increased enforcement of speed limits in 1956. In their full context, it is difficult to say if the decline in Connecticut traffic fatalities from 1955 to 1956 can be attributed to the stricter enforcement.
A data frame with 9 observations on the following 6 variables.
a year from 1951 to 1959
number of traffic deaths in Connecticut
deaths per 100,000 in New York
deaths per 100,000 in Connecticut
deaths per 100,000 in Massachusetts
deaths per 100,000 in in Rhode Island
Tufte, E. R. The Visual Display of Quantitative Information, 2nd ed. Graphics Press, 2001.
Donald T. Campbell and H. Laurence Ross. "The Connecticut Crackdown on Speeding: Time-Series Data in Quasi-Experimental Analysis", Law & Society Review Vol. 3, No. 1 (Aug., 1968), pp. 33-54.
Gene V. Glass. "Analysis of Data on the Connecticut Speeding Crackdown as a Time-Series Quasi-Experiment" Law & Society Review, Vol. 3, No. 1 (Aug., 1968), pp. 55-76.
data(Traffic) gf_line(cn.deaths ~ year, data = Traffic) if (require(tidyr)) { TrafficLong <- Traffic %>% select(-2) %>% gather(state, fatality.rate, ny:ri) gf_line(fatality.rate ~ year, group = ~state, color = ~state, data = TrafficLong) %>% gf_point(fatality.rate ~ year, group = ~state, color = ~state, data = TrafficLong) %>% gf_lims(y = c(0, NA)) }
data(Traffic) gf_line(cn.deaths ~ year, data = Traffic) if (require(tidyr)) { TrafficLong <- Traffic %>% select(-2) %>% gather(state, fatality.rate, ny:ri) gf_line(fatality.rate ~ year, group = ~state, color = ~state, data = TrafficLong) %>% gf_point(fatality.rate ~ year, group = ~state, color = ~state, data = TrafficLong) %>% gf_lims(y = c(0, NA)) }
Measurements from an experiment that involved firing projectiles with a small trebuchet under different conditions.
Data frames with the following variables.
the object serving as projectilebean
big
washerb
bigWash
BWB
foose
golf
MWB
SWB
tennis ball
wood
weight of projectile (in grams)
weight of counter weight (in kg)
distance projectile traveled (in cm)
a factor with levels a
b
B
c
describing the configuration of the trebuchet.
Trebuchet1
and Trebuchet2
are subsets of Trebuchet
restricted
to a single value of counterWt
Data collected by Andrew Pruim as part of a Science Olympiad competition.
data(Trebuchet); data(Trebuchet1); data(Trebuchet2) gf_point(distance ~ projectileWt, data = Trebuchet1) gf_point(distance ~ projectileWt, data = Trebuchet2) gf_point(distance ~ projectileWt, color = ~ factor(counterWt), data = Trebuchet) %>% gf_smooth(alpha = 0.2, fill = ~factor(counterWt))
data(Trebuchet); data(Trebuchet1); data(Trebuchet2) gf_point(distance ~ projectileWt, data = Trebuchet1) gf_point(distance ~ projectileWt, data = Trebuchet2) gf_point(distance ~ projectileWt, color = ~ factor(counterWt), data = Trebuchet) %>% gf_smooth(alpha = 0.2, fill = ~factor(counterWt))
These objects are undocumented.
Some are left-overs from a previous version of the book and package. In other cases, the functions are of limited suitability for general use.
Randall Pruim
Unemployment data
data(Unemployment)
data(Unemployment)
A data.frame with 10 observations on the following 4 variables.
unemp
Millions of unemployed people
production
Federal Reserve Board index of industrial production
year
iyear
indexed year
Paul F. Velleman and Roy E. Welsch. "Efficient Computing of Regression Diagnostics", The American Statistician, Vol. 35, No. 4 (Nov., 1981), pp. 234-242. (https://www.jstor.org/stable/2683296)
data(Unemployment)
data(Unemployment)
Compute vectors associated with 1-way ANOVA
vaov(x, ...) ## S3 method for class 'formula' vaov(x, data = parent.frame(), ...)
vaov(x, ...) ## S3 method for class 'formula' vaov(x, data = parent.frame(), ...)
x |
a formula. |
... |
additional arguments. |
data |
a data frame. |
This is primarily designed for demonstration purposes to show how 1-way ANOVA models partition variance. It may not work properly for more complicated models.
A data frame with variables including grandMean
,
groupMean
, ObsVsGrand
, STotal
, ObsVsGroup
,
SError
, GroupVsGrand
, and STreatment
. The usual SS
terms can be computed from these by summing.
aov(pollution ~ location, data = AirPollution) vaov(pollution ~ location, data = AirPollution)
aov(pollution ~ location, data = AirPollution) vaov(pollution ~ location, data = AirPollution)
Alternatives to prop.test
and binom.test
.
wilson.ci(x, n = 100, conf.level = 0.95)
wilson.ci(x, n = 100, conf.level = 0.95)
x |
number of 'successes' |
n |
number of trials |
conf.level |
confidence level |
wald.ci
produces Wald confidence intervals. wilson.ci
produces Wilson confidence intervals (also called “plus-4” confidence
intervals) which are Wald intervals computed from data formed by adding 2
successes and 2 failures. The Wilson confidence intervals have better
coverage rates for small samples.
Lower and upper bounds of a two-sided confidence interval.
Randall Pruim
A. Agresti and B. A. Coull, Approximate is better then ‘exact’ for interval estimation of binomial proportions, American Statistician 52 (1998), 119–126.
prop.test(12,30) prop.test(12,30, correct=FALSE) wald.ci(12,30) wilson.ci(12,30) wald.ci(12+2,30+4)
prop.test(12,30) prop.test(12,30, correct=FALSE) wald.ci(12,30) wilson.ci(12,30) wald.ci(12+2,30+4)
The labor force participation rate of women in each of 19 U.S. cities in each of two years. # Reference: United States Department of Labor Statistics # # Authorization: free use # # Description: # # Variable Names: # # 1. City: City in the United States # 2. labor72: Labor Force Participation rate of women in 1972 # 3. labor68: Labor Force Participation rate of women in 1968 # # The Data: #
A data frame with 19 observations on the following 3 variables.
name of a U.S. city (coded as a factor with 19 levels)
percent of women in labor force in 1972
percent of women in labor force in 1968
These data are from the United States Department of Labor Statistics and are also available at DASL, the Data and Story Library (https://dasl.datadescription.com/).
data(WorkingWomen) gf_point(labor72 ~ labor68, data = WorkingWomen)
data(WorkingWomen) gf_point(labor72 ~ labor68, data = WorkingWomen)