我有一些输入长(大约3k行)的 XML文档,通常看起来像: chapter someAttributes="someValues" titlesomeTitle/title pmultiple paragraphs/p p.../p li p- some text/p /li li p- some other text/p /li !-- another li elements -- pmulti
<chapter someAttributes="someValues"> <title>someTitle</title> <p>multiple paragraphs</p> <p>...</p> <li> <p>- some text</p> </li> <li> <p>- some other text</p> </li> <!-- another li elements --> <p>multiple other paragraphs</p> <p>...</p> <li> <p>1. some text</p> </li> <li> <p>2. some other text</p> </li> <!-- another li elements --> <p>multiple other paragraphs</p> <p>...</p> <!-- there are other elements such as table, illustration, ul etc. --> </chapter>
我想要的是根据一些语义和返回包装的XML来包装每个分散的(我的意思是段落,表格,插图等)li元素序列与ol或ul元素.
>如果段落中的第一个字符等于 – ,那么它应该是带有mark =“DASH”属性的ul
>如果段落以1.,2.,3等开头,那么我想要ol with numeration =“ARABIC”
例如(它只是一个序列):
<ul mark="DASH"> <li> <p> some text</p> </li> <li> <p> some other text</p> </li> <ul>
如你所见,我需要从所有段落中删除“标记字符”,即 – 或1.,2.,3等.
输入XML比我描述的更复杂(嵌套序列,表元素中的内部序列),但我正在寻找一些想法,特别是如何捕获&处理具有这种语义的特定序列.
我想要输出XML具有完全相同的顺序,只需要包装li元素.如果需要,可以使用XSLT 2.0 / EXSLT.
这是一个XSLT 2.0样式表:<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"> <xsl:output indent="yes"/> <xsl:template match="@* | node()"> <xsl:copy> <xsl:apply-templates select="@*, node()"/> </xsl:copy> </xsl:template> <xsl:template match="chapter"> <xsl:copy> <xsl:for-each-group select="*" group-adjacent="boolean(self::li)"> <xsl:choose> <xsl:when test="current-grouping-key() and ./p[1][starts-with(., '-')]"> <ul mark="DASH"> <xsl:apply-templates select="current-group()"/> </ul> </xsl:when> <xsl:when test="current-grouping-key() and ./p[1][matches(., '[0-9]\.')]"> <ol numeration="arabic"> <xsl:apply-templates select="current-group()"/> </ol> </xsl:when> <xsl:otherwise> <xsl:copy-of select="current-group()"/> </xsl:otherwise> </xsl:choose> </xsl:for-each-group> </xsl:copy> </xsl:template> <xsl:template match="li/p/text()[1]"> <xsl:value-of select="replace(., '^(-|[0-9]\.)', '')"/> </xsl:template> </xsl:stylesheet>
当我使用Saxon 9.3和样式表以及样本输入时
<chapter someAttributes="someValues"> <title>someTitle</title> <p>multiple paragraphs</p> <p>...</p> <li> <p>- some text</p> </li> <li> <p>- some other text</p> </li> <!-- another li elements --> <p>multiple other paragraphs</p> <p>...</p> <li> <p>1. some text</p> </li> <li> <p>2. some other text</p> </li> <!-- another li elements --> <p>multiple other paragraphs</p> <p>...</p> <!-- there are other elements such as table, illustration, ul etc. --> </chapter>
我得到以下输出:
<?xml version="1.0" encoding="UTF-8"?> <chapter> <title>someTitle</title> <p>multiple paragraphs</p> <p>...</p> <ul mark="DASH"> <li> <p> some text</p> </li> <li> <p> some other text</p> </li> </ul> <p>multiple other paragraphs</p> <p>...</p> <ol numeration="arabic"> <li> <p> some text</p> </li> <li> <p> some other text</p> </li> </ol> <p>multiple other paragraphs</p> <p>...</p> </chapter>