我有一个矩阵(垫)与dims“13,20000000”和以下组 [1,] "wildtype" [2,] "wildtype" [3,] "wildtype" [4,] "wildtype" [5,] "wildtype" [6,] "wildtype" [7,] "wildtype" [8,] "wildtype" [9,] "wildtype" [10,] "wildtype" [11,] "mutant" [12,
[1,] "wildtype" [2,] "wildtype" [3,] "wildtype" [4,] "wildtype" [5,] "wildtype" [6,] "wildtype" [7,] "wildtype" [8,] "wildtype" [9,] "wildtype" [10,] "wildtype" [11,] "mutant" [12,] "mutant" [13,] "mutant"
使用以下R代码,我在每个数据点上运行lm()20M次.
lm(mat~groups)非常快.需要很长时间的是使用summary(lm1)为每个模型提取pvalue.
我怎样才能加速提取pvalues?
tvals_out <-'/tmp/tvals_lm.csv' infile <- '/tmp/tempdata.dat' con <- file(infile, "rb") dim <- readBin(con, "integer", 2) mat <- matrix( readBin(con, "numeric", prod(dim)), dim[1], dim[2]) close(con) groups = factor(c(rep('wt', 10), rep('mut', 3))) lm1 <- lm(mat ~ groups) # This is the longest running bit sum_lm1 <- summary(lm1) num_pixels <- dim(mat)[2] result_pvalues <- numeric(num_pixels) result_pvalues <- vapply(sum_lm1, function(x) x$coefficients[,4][2], FUN.VALUE = 1) write.table(result_pvalues, tvals_out, sep=','); outCon <- file(tvals_out, "wb") writeBin(result_pvalues, outCon) close(outCon)
编辑:
我从mat对象中添加了10个数据点的样本数据
m <- c(28, 28, 28, 29, 33, 39, 49, 58, 63,64,30, 27, 24, 20, 17, 19, 33, 49, 56,57,36, 32, 28, 23, 20, 27, 48, 77, 96, 103,27, 26, 26, 23, 21, 23, 33, 46, 53,52,24, 20, 17, 13, 11, 14, 33, 47, 40,32,40, 46, 49, 48, 44, 49, 57, 59, 61,53,22, 24, 26, 32, 38, 39, 44, 53, 59,58,16, 16, 14, 10,7, 14, 34, 55, 62,61,28, 25, 21, 19, 22, 32, 45, 58, 64,61,28, 26, 21, 16, 14, 19, 33, 50, 59,59,17, 16, 15, 14, 17, 25, 38, 54, 61,58,11, 11, 12, 13, 16, 23, 34, 46, 51,45,22, 21, 20, 19, 16, 18, 32, 51, 50,38) mat <- matrix(m, nrow=13)以下函数能够在大约25秒内从13×20,000,000矩阵(与您的矩阵)的拟合中提取p值.
pvalOnly2 <- function(fit) { # get estimates est <- fit$coefficients[fit$qr$pivot, ] # get R: see stats:::summary.lm to see how this is calculated p1 <- 1L:(fit$rank) R <- diag(chol2inv(fit$qr$qr[p1, p1, drop = FALSE])) # get residual sum of squares for each resvar <- colSums(fit$residuals^2) / fit$df.residual # R is same for each coefficient, resvar is same within each model se <- sqrt(outer(R, resvar)) pt(abs(est / se), df = fit$df.residual, lower.tail = FALSE) * 2 }
这计算与调用汇总(或本杰明的pvalOnly函数)相同的p值.但是,它会跳过摘要为每个模型执行的所有其他步骤,从而使其更快. (请注意,Benjamin的pvalOnly调用vcov,后者又调用summary,这就是为什么它不能节省时间).
在一个小矩阵上,这比摘要快约30倍:
m <- c(28, 28, 28, 29, 33, 39, 49, 58, 63,64,30, 27, 24, 20, 17, 19, 33, 49, 56,57,36, 32, 28, 23, 20, 27, 48, 77, 96, 103,27, 26, 26, 23, 21, 23, 33, 46, 53,52,24, 20, 17, 13, 11, 14, 33, 47, 40,32,40, 46, 49, 48, 44, 49, 57, 59, 61,53,22, 24, 26, 32, 38, 39, 44, 53, 59,58,16, 16, 14, 10,7, 14, 34, 55, 62,61,28, 25, 21, 19, 22, 32, 45, 58, 64,61,28, 26, 21, 16, 14, 19, 33, 50, 59,59,17, 16, 15, 14, 17, 25, 38, 54, 61,58,11, 11, 12, 13, 16, 23, 34, 46, 51,45,22, 21, 20, 19, 16, 18, 32, 51, 50,38) mat <- matrix(m, nrow=13) groups <- rep(c("wildtype", "mutant"), times = c(10, 3)) fit <- lm(mat ~ groups) library(microbenchmark) microbenchmark(summary = do.call("cbind", lapply(summary(fit), function(f) coef(f)[, 4])), pvalOnly2(fit))
结果:
Unit: microseconds expr min lq mean median uq max neval cld summary 3383.085 3702.238 3978.110 3919.0755 4147.4015 5475.223 100 b pvalOnly2(fit) 81.538 91.541 136.903 137.1275 157.5535 459.415 100 a
然而,当你有更多的模型适合时,速度优势要大得多.在13×1000的矩阵上,它具有大约300倍的优势.在我的机器上,当有2000万列时,它在25秒内计算p值 – 实际上是拟合< -lm(mat~groups)步骤的两倍.
> mat <- mat[, rep(1:10, 2e6)] # just replicating same coefs > dim(mat) [1] 13 20000000 > system.time(fit <- lm(mat ~ groups)) user system elapsed 37.272 10.296 58.372 > system.time(pvals <- pvalOnly2(fit)) user system elapsed 21.945 1.889 24.322
得到的p值是正确的(与您的摘要相同):
> dim(pvals) [1] 2 20000000 > pvals[, 1:10] [,1] [,2] [,3] [,4] [,5] [,6] (Intercept) 0.006048267 0.01234835 0.02655251 0.0004555316 0.001004109 0.01608319 groupswildtype 0.129224604 0.22806894 0.88113522 0.2064583345 0.103624361 0.84642990 [,7] [,8] [,9] [,10] (Intercept) 0.0004630405 0.1386393 0.05107805 5.042796e-05 groupswildtype 0.2717139022 0.1539826 0.66351492 5.942893e-02
(顺便说一下,分析显示函数中的几乎所有运行时间都花在了pt函数中 – 因为在C中完成它的速度和在任何语言中都要快得多).
在回复您的评论时,您还可以使用以下函数获取每个模型的p值(来自F统计量),这与pvalOnly2的速度相似:
modelPvalOnly <- function(fit) { f <- t(fit$fitted.values) if (attr(fit$terms, "intercept")) { mss <- rowSums((f - rowMeans(f)) ^ 2) numdf <- fit$rank - 1 } else { mss <- rowSums(f ^ 2) numdf <- fit$rank } resvar <- colSums(fit$residuals^2) / fit$df.residual fstat <- (mss / numdf) / resvar pval <- pf(fstat, numdf, fit$df.residual, lower.tail = FALSE) pval }