特色栏目： python 批处理 net编程 Javascript Php Asp Css Html5 Android seo centos

使用XML / RCurl R包解析HTML表,而不使用readHTMLTable函数

来源：互联网收集：自由互联发布时间：2021-06-13

我试图从单个html表中删除/提取数据： http://www.theplantlist.org/tpl/record/kew-419248和一些非常相似的页面. 我最初尝试使用以下函数来读取表格,但它并不理想,因为我想将每个物种名称分成其

我试图从单个html表中删除/提取数据： http://www.theplantlist.org/tpl/record/kew-419248和一些非常相似的页面.
我最初尝试使用以下函数来读取表格,但它并不理想,因为我想将每个物种名称分成其组成部分(属/物种/种类/作者等).

library(XML)
readHTMLTable("http://www.theplantlist.org/tpl/record/kew-419248")

我使用SelectorGadget为每个要提取的表元素标识一个唯一的XPATH(不一定是最短的)：

对于属名：// [contains(concat(“”,@ class,“”),concat(“”,“同义词”,“”))] //
[contains(concat(“”,@ class,“”),concat(“”,“genus”,“”))]

对于物种名称：// [contains(concat(“”,@ class,“”),concat(“”,“Synonym”,“”))] // [contains(concat(“”,@ class,“” ),concat(“”,“species”,“”))]

对于infraspecies rank：// * [contains(concat(“”,@ class,“”),concat(“”,“infraspr”,“”))]

对于infraspecies名称：// * [contains(concat(“”,@ class,“”),concat(“”,“infraspe”,“”))]

对于置信水平(图像)：// [contains(concat(“”,@ class,“”),concat(“”,“synonyms”,“”))] // img对于sources：// [contains(concat) (“”,@ class,“”),concat(“”,“source”,“”))] // a

我现在想要将信息提取到数据帧/表中.

我尝试使用XML包的xpathSApply函数来提取一些这样的数据：

例如对于infraspecies排名

library(XML)
library(RCurl)
infraspeciesrank = htmlParse(getURL("http://www.theplantlist.org/tpl/record/kew-419248"))
path=' //*[contains(concat( " ", @class, " " ), concat( " ", "infraspr", " " ))]'
xpathSApply(infraspeciesrank, path)

然而,这种方法存在问题,因为数据存在间隙(例如,只有表中的某些行具有亚种类等级,因此我返回的是表中三个等级的列表,没有间隙).数据输出也是我无法附加到数据帧的类.

有谁知道从这个表中提取信息到数据帧的更好方法？

任何帮助将非常感激！

汤姆

这是另一种解决方案,它将每个物种名称分成其组成部分

library(XML)
library(plyr)

# read url into html tree
url = "http://www.theplantlist.org/tpl/record/kew-419248"
doc = htmlTreeParse(url, useInternalNodes = T)

# extract nodes containing desired information
xp_expr = "//table[@class= 'names synonyms']/tbody/tr"
nodes = getNodeSet(doc, xp_expr)

# function to extract desired fields from a given node    
fields = list('genus', 'species', 'infraspe', 'authorship')
read_node = function(node){

    dl = lapply(fields, function(x) xpathSApply(node, 
       paste(".//*[@class = ", "'", x, "'", "]", sep = ""), xmlValue))
    tmp = rep(' ', length(dl))
    tmp[sapply(dl, length) == 1] = unlist(dl)
    confidence = xpathSApply(node, './/img', xmlGetAttr, 'alt')
    return(c(tmp, confidence))
}

# apply function to all nodes and return data frame
df = ldply(nodes, read_node)
names(df) = c(fields, 'confidence')

它产生以下输出

genus      species     infraspe                      authorship confidence
1 Critesion     chilense              (Roem. & Schult.) Ã\u0081.LÃ¶ve          H
2   Hordeum     chilense     chilense                                          L
3   Hordeum  cylindricum                                       Steud.          H
4   Hordeum depauperatum                                       Steud.          H
5   Hordeum     pratense brongniartii                       Macloskie          L
6   Hordeum    secalinum     chilense                   Ã\u0089.Desv.          L

上一篇：salesforce – 我想在VisualForce上公开裸xml / json
下一篇：使用linq搜索xdocument

使用XML / RCurl R包解析HTML表,而不使用readHTMLTable函数

相关文章