I'm trying to load in a small number of fields from a tab-separated file with many more unused fields using fread in the data.table package.
我正在尝试使用数据中的fread从表分隔的文件加载少量字段,其中有许多未使用的字段。表方案。
To this end, I'm using the select option, which works great for reading in the columns.
为此,我使用select选项,它非常适合在列中阅读。
However, when I don't specify the classes of the various fields, the automated selector doesn't work (most/all of the numeric variables end up being read as numerically tiny numbers like 1.896916e-316).
然而,当我没有指定各种字段的类时,自动选择器就不起作用了(大多数/所有的数值变量最终被读取为数值上很小的数字,比如1.896916e-316)。
To fix this, my first instinct was to change the code from:
为了解决这个问题,我的第一反应是将代码从:
DT <- fread("data.txt", select = c ("V1", "V2", ..., "Vn"))to
来
DT <- fread("data.txt", select = c("V1", "V2", ..., "Vn"), colClasses = c("numeric", ..., "character"))i.e., to match the select character vector with a colClasses character vector of equal length, with (obviously) the type of the i-th selected field from select set equal to the i-th element of colClasses.
即。,将select字符向量与colClasses长度相等的字符向量匹配,其中(显然)select set中第i个选择字段的类型等于colClasses的第i个元素。
However, fread doesn't seem to like this--even when select is used, colClasses expects a character vector with as many fields as the WHOLE file:
不过,fread似乎不喜欢这样——即使使用select, colClasses也希望字符向量具有与整个文件相同的字段:
Error in fread("data.txt", select = c("V1", "V2", ..., "Vn", : colClasses is unnamed and length 25 but there are 256 columns. See ?data.table for colClasses usage.
误差在从文件中读数据。txt", select = c("V1", "V2",…“Vn”:colClasses未命名,长度为25,但有256个列。看到了什么?数据。表colClasses使用。
This could be fine if I only had to do this with one file--I'd simply pad out the rest of the character vector with "character" (or whatever type) because they're being tossed anyway.
如果我只需要使用一个文件就可以这样做——我只需用“character”(或其他类型)填充其余的字符向量,因为它们无论如何都会被抛出。
However, I'm planning to repeat this process 13 times or so on files corresponding to other years--they have the same column names, but appear in perhaps different orders (and there are different numbers of columns from year to year), which ruins the loop-ability (as well as taking a lot more time).
然而,我打算重复这个过程13倍左右的文件对应于其他年——他们有相同的列名称,但可能出现在不同的订单(有不同数量的列从每年),这废墟loop-ability(以及更多的时间)。
The following worked, but hardly seems efficient (coding-wise):
下面的方法是有效的,但似乎不太有效(代码方面):
DT <- fread("data.txt", select=c("V1", "V2", "V3"), colClasses = c(V1 = "factor", V2 = "character", V3 = "numeric"))This is a pain because I'm taking 25 columns, so it's a huge block of code being taken up by specifying the column types. I can't take advantage of rep to save space, e.g.
这很麻烦,因为我要写25列,所以指定列类型会占用大量的代码。我不能利用rep来节省空间。
colClasses = c(rep("character", times = 3), rep("numeric", times = 20))Any suggestions for making this look/work better?
有什么建议可以让这个看起来更好?
Here is a preview of the data for reference:
以下是有关资料的预览,以供参考:
LEAID FIPST NAME SCHLEV AGCHRT CCDNF GSLO V33 TOTALREV TFEDREV 1: 0100002 01 ALABAMA YOUTH SERVICES N 3 1 03 0 -2 -2 2: 0100005 01 ALBERTVILLE CITY 03 3 1 PK 4143 38394000 6326000 3: 0100006 01 MARSHALL COUNTY 03 3 1 PK 5916 58482000 11617000 4: 0100007 01 HOOVER CITY 03 3 1 PK 13232 154703000 10184000 5: 0100008 01 MADISON CITY 03 3 1 PK 8479 89773000 6648000--- 18293: 5680180 56 NORTHEAST WYOMING BOCES 07 3 1 N -2 -2 -218294: 5680250 56 REGION V BOCES 07 3 1 N -2 -2 -218295: 5680251 56 WYOMING DEPARTMENT OF FAMILY SERVICES 02 3 1 KG 82 -2 -218296: 5680252 56 YOUTH EMERGENCY SERVICES, INC. - ADMINISTRATION OFFICE N 3 1 07 29 -1 -118297: 5680253 56 WYOMING BEHAVIORAL INSTITUTE N N 1 01 0 -2 -22 个解决方案
#1
3
Actually found a solution in a more careful reading of this illustration of the drop/select/colClasses options by Mr. Dowle:
实际上,在更仔细地阅读杜尔先生对drop/select/colClasses选项的说明时,找到了一个解决方案:
DT <- fread("data.txt", select = c("V1", "V2", "V3"), colClasses = list(character = c("char_names"), factor = c("factor_names"), numeric = c("numeric_names")))I didn't realize this before because there were some other problems with my fread attempts due to bad formatting of my .csv file.
我以前没有意识到这一点,因为我的fread尝试还有其他问题,因为我的。csv文件格式不好。
Still, I am wont to call it a bug that the natural approach doesn't work:
尽管如此,我还是习惯称它为自然方法不起作用的bug:
DT <- fread("data.txt", select = c("V1", ..., "Vn"), colClasses = c("type1", ..., "typen"))#2
1
Perhaps something along these lines:
也许有一些类似的东西:
varnames <- readLines(file='filename.txt', n=1) valid <- c("LEAID", "FIPST", "NAME", "SCHLEV", "AGCHRT", "CCDNF", "GSLO", "V33", "TOTALREV", "TFEDREV") colC <- varnames %in% valid colCchar <- colC colCchar[!colC] <-"NULL" colCchar[colC] <- c( rep("numeric", 2), rep("character",2), rep("numeric", 2), "character", rep("numeric", 3) ) dt<-fread("data.txt", colClasses=colCchar)Obviously untested since the 200+ first line was not provided. It won't be stable to variation in order of variables in the targets, but your problem description did "leave something to be desired". I cannot quite figure out how the column names would be the same but would vary. You may need to use match to get the order of the desired variables.
显然没有经过测试,因为没有提供200+第一行。按照目标中的变量的顺序变化是不稳定的,但是您的问题描述确实“需要一些东西”。我不知道列名是如何相同的,但是会有不同。您可能需要使用match来获得所需变量的顺序。