假设多年来动物园里每天都有动物活动的时间序列.非常大的数据集的子集可能如下所示: library(data.table)type - c(rep('giraffe',90),rep('monkey',90),rep('anteater',90))status - as.factor(c(rep('display',31),
library(data.table) type <- c(rep('giraffe',90),rep('monkey',90),rep('anteater',90)) status <- as.factor(c(rep('display',31),rep('caged',28),rep('display',31), rep('caged',25), rep('display',35),rep('caged',30),rep('caged',10), rep('display',10),rep('caged',10),rep('display',60))) date <- rep(seq.Date( as.Date("2001-01-01"), as.Date("2001-03-31"), "day" ),3)
“类型”是动物类型,“状态”是动物当天所做事情的指示,例如,笼养或展示.
animals <- data.table(type,status,date);animals type status date 1: giraffe display 2001-01-01 2: giraffe display 2001-01-02 3: giraffe display 2001-01-03 4: giraffe display 2001-01-04 5: giraffe display 2001-01-05 --- 266: anteater display 2001-03-27 267: anteater display 2001-03-28 268: anteater display 2001-03-29 269: anteater display 2001-03-30 270: anteater display 2001-03-31
假设我们想要将其汇总到月度系列中,该系列列出了动物的整个月状态信息.在新系列中,“状态”反映了该月初动物的状态. “fullmonth”是一个二进制变量(1 = TRUE,0 = FALSE),表示此状态是否持续整个月,“anydisp”是否为二进制变量(1 = TRUE,0 = FALSE),表示动物是否开启在该月中的任何时间显示(> = 1天).因此,因为长颈鹿在1月和3月的整个月展出,但在2月份被关在笼子里,因此得到了相应的标记.
date <- rep(seq.Date( as.Date("2001-01-01"), as.Date("2001-03-31"),"month"),3) type <- c(rep('giraffe',3),rep('monkey',3),rep('anteater',3)) status <- as.factor(c('display','caged','display','caged','display','caged', 'caged','display','display')) fullmonth <- c(1,1,1,0,1,0,0,1,1) anydisp <- c(1,0,1,1,1,1,1,1,1) animals2 <- data.table(date,type,status,fullmonth,anydisp);animals2 date type status fullmonth anydisp 2001-01-01 giraffe display 1 1 2001-02-01 giraffe caged 1 0 2001-03-01 giraffe display 1 1 2001-01-01 monkey caged 0 1 2001-02-01 monkey display 1 1 2001-03-01 monkey caged 0 1 2001-01-01 anteater caged 0 1 2001-02-01 anteater display 1 1 2001-03-01 anteater display 1 1
我认为动物园可能是要走的路但是在玩完之后我发现它不能很好地处理非数值,即使我将任意值分配给定性组件(状态),也不清楚它将如何解决问题.
##aggregate function with zoo? library(zoo) animals$activity <- as.numeric(ifelse(status=='display',1,0)) animals2 <- subset(animals, select=c(date,activity)) datas <- zoo(animals2) monthlyzoo <- aggregate(datas,as.yearmon,sum) Error in Summary.factor(1L, na.rm = FALSE) : sum not meaningful for factors
有人知道使用sqldf或data.table的解决方案吗?
更新
想要添加一个新要求,即所显示的日期是本月的第一天,即使数据在本月晚些时候开始.例如,此数据集说明了这种情况:
animals2 <- animals[30:270,];head(animals2) setkey(animals2, "type", "date") oo <- animals2[, list(date=date[1], status = status[1], fullmonth = 1 * all(status == status[1]), anydisplay = any(status == "display") * 1 ), by = list(month(date), type)][, month := NULL] oo type date status fullmonth anydisplay 1: anteater 2001-01-30 caged 0 1 2: anteater 2001-02-01 display 1 1 3: anteater 2001-03-01 display 1 1 4: giraffe 2001-01-01 display 1 1 5: giraffe 2001-02-01 caged 1 0 6: giraffe 2001-03-01 display 1 1 7: monkey 2001-01-01 caged 0 1 8: monkey 2001-02-01 display 1 1 9: monkey 2001-03-01 display 0 1 sqldf("select min(date) date, type, status, max(status) = min(status) fullmonth, sum(status = 'display') > 0 anydisp from animals2 group by type, strftime('%Y %m', date * 3600 * 24, 'unixepoch') order by type, date") date type status fullmonth anydisp 1 2001-01-30 anteater caged 0 1 2 2001-02-01 anteater display 1 1 3 2001-03-01 anteater display 1 1 4 2001-01-01 giraffe display 1 1 5 2001-02-01 giraffe caged 1 0 6 2001-03-01 giraffe display 1 1 7 2001-01-01 monkey caged 0 1 8 2001-02-01 monkey display 1 1 9 2001-03-01 monkey caged 0 1
这可以通过后期处理修改日期的任何解决方案来适应:
dateswitch <- paste(year(animals2$date),month(animals2$date),1,sep='/') dateswitch <- as.Date(dateswitch, "%Y/%m/%d") animals2$date <- as.Date(dateswitch)这是一个sqldf解决方案:
library(sqldf) # define input data.frame where type, status and date variables are defined in question animals <- data.frame(type,status,date) sqldf("select min(date) date, type, status, max(status) = min(status) fullmonth, sum(status = 'display') > 0 anydisp from animals group by type, strftime('%Y %m', date * 3600 * 24, 'unixepoch') order by type, date")
此命令的输出与显示的数据是:
date type status fullmonth anydisp 1 2001-01-01 anteater caged 0 1 2 2001-02-01 anteater display 1 1 3 2001-03-01 anteater display 1 1 4 2001-01-01 giraffe display 1 1 5 2001-02-01 giraffe caged 1 0 6 2001-03-01 giraffe display 1 1 7 2001-01-01 monkey caged 0 1 8 2001-02-01 monkey display 1 1 9 2001-03-01 monkey caged 0 1
增加:海报后来在问题中添加了一个额外的要求,即将数据显示为该月的第一天,即使数据直到该月晚些时候才开始.如果DF是上面sqldf语句的结果,那么将其转换为:
library(zoo) transform(DF, date = as.Date(as.yearmon(date)))
或者最好是消除日期部分(因为如果没有该日期的数据可能会被视为误导)并且只使用“yearmon”类给出年份和月份:
library(zoo) transform(DF, date = as.yearmon(date))