我有一组x个点来沿x轴绘制线段以在R中创建自定义读取贴图: 绘制这些段的一半任务是确定它们的y位置,以便没有两个重叠的段在同一y级别上.对于每个段,我从第一个位置迭代y个级别
绘制这些段的一半任务是确定它们的y位置,以便没有两个重叠的段在同一y级别上.对于每个段,我从第一个位置迭代y个级别,直到我到达一个尚未包含与当前位置重叠的段的位置.然后我记录当前段的结束位置并移动到下一个段.
实际代码的功能如下:
# Dummy data # A list of start and end positions for each segment along the X axis. Sorted by start. # Passing the function few.reads draws a map in half a second. Passing it many.reads takes about half an hour to complete. few.reads <- data.frame( start=c(rep(10,150), rep(16,100), rep(43,50)), end=c(rep(30,150), rep(34,100), rep(57,50)) ); many.reads <- data.frame( start=c(rep(10,15000), rep(16,10000), rep(43,5000)), end=c(rep(30,15000), rep(34,10000), rep(57,5000)) ); #--- # A function to draw a series of overlapping segments (or "reads" in my along # The x-axis. Where reads overlap, they are "stacked" down the y axis #--- drawReads <- function(reads){ # sort the reads by their start positions reads <- reads[order(reads$start),]; # minimum and maximum for x axis minstart <- min(reads$start); maxend <- max(reads$end); # initialise yread: a list to keep track of used y levels yread <- c(minstart - 1); ypos <- c(); #holds the y position of the ith segment #--- # This iteration step is the bottleneck. Worst case, when all reads are stacked on top # of each other, it has to iterate over many y levels to find the correct position for # the later reads #--- # iterate over segments for (r in 1:nrow(reads)){ read <- reads[r,]; start <- read$start; placed <- FALSE; # iterate through yread to find the next availible # y pos at this x pos (start) y <- 1; while(!placed){ if(yread[y] < start){ ypos[r] <- y; yread[y] <- read$end; placed <- TRUE; } # current y pos is used by another segment, increment y <- y + 1; # initialize another y pos if we're at the end of the list if(y > length(yread)){ yread[y] <- minstart-1; } } } #--- # This is the plotting step # Once we are here the rest of the process is very quick #--- # find the maximum y pos that is used to size up the plot maxy <- length(yread); miny = 1; reads$ypos <- ypos + miny; print("New Plot...") # Now we have all the information, start the plot plot.new(); plot.window(xlim=c(minstart, maxend+((maxend-minstart)/10)), ylim=c(1,maxy)); axis(3,xaxp=c(minstart,maxend,(maxend-minstart)/10)); axis(2, yaxp=c(miny,maxy,3),tick=FALSE,labels=FALSE); print("Draw the reads..."); maxy <- max(reads$ypos); segments(reads$start, maxy-reads$ypos, reads$end, maxy-reads$ypos, col="blue"); }
我的实际数据集非常大,并且包含可以记录多达600000个读取的区域.读取将自然地堆叠在一起,因此很容易实现最坏情况,其中所有读取彼此重叠.绘制大量读取所需的时间对我来说是不可接受的,所以我正在寻找一种方法来提高流程的效率.我可以用更快的东西替换我的循环吗?是否有一种算法可以更快地安排读取?我现在真的想不出更好的方法.
谢谢你的帮助.
以贪婪的方式填充每个y级别.在一个关卡完成后,向下一级并且永远不会再回升.伪代码:
y <- 1 while segment-list.not-empty i <- 1 current <- segment-list[i] current.plot(y) segment-list.remove(i) i <- segment-list.find_first_greater(current.end) while (i > 0) current <- segment-list[i] current.plot(y) segment-list.remove(i) y <- y + 1
这不一定会在任何意义上产生“最佳”情节,但至少它是O(n log n).