XPath: Find Next Occurrence of an Element in an Ancestor - xpath

I am trying to find the next instance of within a section of sec-type='reading'.
XML Sample:
<?xml version="1.0" encoding="US-ASCII"?>
<book>
<sec sec-type="reading">
<title>Section 1</title>
<p>Sample <bold>Bold</bold>Text <fn><label>1</label></fn> Some more text</p>
<!-- more variations of stuff at various levels -->
<sec>
<title>Section 1.1</title>
<p>Another paragraph with a footnote <fn><label>2</label></fn></p>
</sec>
<sec>
<title>Section 1.2</title>
<p>Another paragraph with a footnote <fn><label>3</label></fn></p>
</sec>
</sec>
<sec sec-type="reading">
<title>Section 2</title>
<p>Sample <bold>Bold</bold>Text <fn><label>6</label></fn> Some more text</p>
<!-- more variations of stuff at various levels -->
<sec>
<title>Section 2.1</title>
<p>Another paragraph with a footnote <fn><label>8</label></fn></p>
</sec>
<sec>
<title>Section 2.2</title>
<p>Another paragraph with a footnote <fn><label>9</label></fn></p>
</sec>
</sec>
</book>
The purpose is to see if the FN labels are sequentially ordered within a section. I've numbered the second section with 6-9 to make it easier to see if it's working.
This is what I want:
Footnote 1 [Next: 2]
Footnote 2 [Next: 3]
Footnote 3 [Next: ]
Footnote 6 [Next: 8]
Footnote 8 [Next: 9]
Footnote 9 [Next: ]
The eventual goal is to return a warning for Footnote 6 [Next: 8]
This is the schematron I've got so far. This gives me:
Footnote 1 [Next: 2]
Footnote 2 [Next: 3]
**Footnote 3 [Next: 6]**
Footnote 6 [Next: 8]
Footnote 8 [Next: 9]
Footnote 9 [Next: ]
It finds the next instance of the footnote. However, I don't want it to cross the sections - so Footnote 3 [Next: 6] is wrong.
<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://purl.oclc.org/dsdl/schematron"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
queryBinding="xslt2" >
<!--check if footnotes are sequential within a reading -->
<pattern id="footnote-sequential">
<rule context="fn">
<let name="next" value="following::fn[1]/label/text()"/>
<assert test="number(label/text()) > 40">
Footnote <value-of select='label'/>
[Next: <value-of select="$next"/>]
</assert>
</rule>
</pattern>
</schema>
Note: number(label/text()) > 40 in the assert is there to catch everything at the moment. It'll eventually be something along the lines of number(current)+1 != number(next)
The closest I've gotten is ancestor::sec[#sec-type='reading']//following::fn[1]/label/text() - but that looses the "next" and gives me weird results like this:
Footnote 1 [Next: 1236]
Footnote 2 [Next: 1236]
Footnote 3 [Next: 1236]
Footnote 6 [Next: 689]
Footnote 8 [Next: 689]
Footnote 9 [Next: 689]

You need intersect.
[EDIT]
Set $next to the fn element instead of the label text:
<let name="next" value="following::fn[1]"/>
[/EDIT]
Go to your current section and take all footnotes of this section:
<let name="sect-fns" value="ancestor::sec[#sec-type='reading']//fn" />
Make the intersect on $next and $sect-fns:
<let name="next" value="$next intersect $sect-fns" />
Check if $next is empty or its label is number(./label) + 1:
<assert test="not($next) or number(./label) + 1 = number($next/label)">

Related

How to avoid row names in further analysis in R?

I´m just running the following example from GGEBiplotGUI package and of course, it works properly.
library(GGEBiplotGUI)
data("Ontario")
Ontario
GGEBiplot(Data = Ontario)
But when I download "Ontario" data and I want to run the above cited script on my PC. See the example below.
Ontario <- read.csv("Book.csv")
library(GGEBiplotGUI)
GGEBiplot(Data = Ontario)
The result is the following table (from column 0 to 10) taking numbers (From 1 to 17) as genotypes and "X" as another location.
See the result below please.
X BH93 EA93 HW93 ID93 KE93 NN93 OA93 RN93 WP93
1 ann 4.460 4.150 2.849 3.084 5.940 4.450 4.351 4.039 2.672
2 ari 4.417 4.771 2.912 3.506 5.699 5.152 4.956 4.386 2.938
3 aug 4.669 4.578 3.098 3.460 6.070 5.025 4.730 3.900 2.621
4 cas 4.732 4.745 3.375 3.904 6.224 5.340 4.226 4.893 3.451
5 del 4.390 4.603 3.511 3.848 5.773 5.421 5.147 4.098 2.832
6 dia 5.178 4.475 2.990 3.774 6.583 5.045 3.985 4.271 2.776
7 ena 3.375 4.175 2.741 3.157 5.342 4.267 4.162 4.063 2.032
8 fun 4.852 4.664 4.425 3.952 5.536 5.832 4.168 5.060 3.574
9 ham 5.038 4.741 3.508 3.437 5.960 4.859 4.977 4.514 2.859
10 har 5.195 4.662 3.596 3.759 5.937 5.345 3.895 4.450 3.300
11 kar 4.293 4.530 2.760 3.422 6.142 5.250 4.856 4.137 3.149
12 kat 3.151 3.040 2.388 2.350 4.229 4.257 3.384 4.071 2.103
13 luc 4.104 3.878 2.302 3.718 4.555 5.149 2.596 4.956 2.886
14 m12 3.340 3.854 2.419 2.783 4.629 5.090 3.281 3.918 2.561
15 reb 4.375 4.701 3.655 3.592 6.189 5.141 3.933 4.208 2.925
16 ron 4.940 4.698 2.950 3.898 6.063 5.326 4.302 4.299 3.031
17 rub 3.786 4.969 3.379 3.353 4.774 5.304 4.322 4.858 3.382
How can I fix this problem? I mean, in order to avoid "rownames" and "x" as a variables in the GGEBiplotGUI analysis.
I have also tried with these codes and they didn´t work:
attributes(Ontario)$row.names <- NULL
print(Ontario, row.names = F)
row.names(Ontario) <- NULL
Ontario[, -1] ## It deletes the first column not the 0 one.
Many thanks in advance!
This code worked properly.
Ontario <- read.csv("Libro.csv")
rownames(Ontario)<-Ontario$X
Ontario1<-Ontario[,-1]
library(GGEBiplotGUI)
GGEBiplot(Data = Ontario)

Get posts with tags having a column value not more than 7

Say, I have two entities in many-to-many relationship:
posts
-id
-title
-body
tags
-id
-title
-sequence (int)
post_tag
-post_id
-tag_id
I want to grab those Posts which have the tags with the last sequence value of 7. Bear with me. An example will make more sense:
post_tag:
=========
-post_id: 1
-tag_id: 3
-post_id: 1
-tag_id: 4
tags:
=====
-id: 3
-sequence: 2
-id: 4
-sequence: 7
Post ID 1 should be returned
post_tag:
=========
-post_id: 2
-tag_id: 4
-post_id: 2
-tag_id: 5
tags:
=====
-id: 4
-sequence: 7
-id: 5
-sequence: 8
Post ID 2 should NOT be returned because it has a tag whose sequence exceeds 7.
post_tag:
=========
-post_id: 3
-tag_id: 2
-post_id: 3
-tag_id: 3
-post_id: 3
-tag_id: 4
tags:
=====
-id: 2
-sequence: 1
-id: 3
-sequence: 2
-id: 4
-sequence: 7
Post ID 3 should be returned
Here is what I have tried so far:
$posts = Post::whereHas('tags', function($q){ $q->where('sequence', 7);})->get();
But it returns even those posts whose sequence is upper than 7. And I am not complaining. I know why it returns the posts with tags having sequence greater than 7. I just don't know how to solve this.
Hint: In terms of query, the problem can be thought of as:
Not more than X and X inclusive
Probably what you are looking for is whereDoesntHave()
$posts = Post::whereHas('tags', function ($q) {
$q->where('sequence', 7);
})
->whereDoesntHave('tags', function($q) {
$q->where('sequence', '>', 7);
})->get();
What you actually need is a check for equality and less than, like this:
$posts = Post::whereHas('tags', function($q) {
$q->where('sequence', '<=', 7);
})->get();

Nokogiri extract object that is not wrapped in tags

I'm loading regulations to a database and putting them in a tree hierarchy. When the XML Im scraping is set up as below, it is trivial to scrape it:
<CHAPTER>
<PART>
<SUBPART>
<SECTION>
<HD>Section Title</HD>
</SECTION>
<SECTION> ... </SECTION>
<APPENDIX>
<HD>Appendix Title</HD>
<P>Appendix content...</P>
<FOO>More content in unexpected tags</FOO>
</APPENDIX>
</SUBPART>
<SUBPART> ... </SUBPART>
</PART>
</CHAPTER>
Since I have to know what the parent ID is, I do something along this line:
parent_id = 1
doc.xpath("//chapter/part/subpart").each do |subpart|
title = subpart.xpath("hd").first.text
# add is a method that creates object and saves it to database, returning its id
id = add(title,'SUBPART',parent_id)
subpart.xpath('section').each do |section|
title = section.xpath('hd').first.text
add(title,'SECTION',id)
end
subpart.xpath('appendix').each do |app|
title = section.xpath('hd').first.text
content = app.to_s
add(title,'APPENDIX',id,content) #content is an optional input
end
end
However, the XML is not always set up in such a logical way. Sometimes, the appendices are not wrapped in tags :(
When this is the case, the XML looks like this:
<EXTRACT>
<HD SOURCE="HD1">Appendix A to § 1926.60—Substance Data Sheet, for 4-4′ Methylenedianiline</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix A are identical to those set forth in appendix A to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix B to § 1926.60—Substance Technical Guidelines, MDA</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix B are identical to those set forth in appendix B to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix C to § 1926.60—Medical Surveillance Guidelines for MDA</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix C are identical to those set forth in appendix C to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix D to § 1926.60—Sampling and Analytical Methods for MDA Monitoring and Measurement Procedures</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix D are identical to those set forth in appendix D to § 1910.1050 of this chapter.</P>
</NOTE>
</EXTRACT>
<CITA>
[57 FR 35681, Aug. 10, 1992, as amended at 57 FR 49649, Nov. 3, 1992; 61 FR 5510, Feb. 13, 1996; 61 FR 31431, June 20, 1996; 63 FR 1296, Jan. 8, 1998; 69 FR 70373, Dec. 6, 2004; 70 FR 1143, Jan. 5, 2005; 71 FR 16674, Apr. 3, 2006; 71 FR 50191, Aug. 24, 2006; 73 FR 75588, Dec. 12, 2008; 76 FR 33611, June 8, 2011; 77 FR 17889, Mar. 26, 2012]
</CITA>
The only way I can think of extracting these appendices is to iterate through the <EXTRACT> node and check the tags to see if its name is "HD" and "Appendix" is in the text. Then save everything after until I hit the next <HD> with "Appendix" in the text.
Feels like a pretty clunky solution. Is there a better way to do this?

Splitting a String with Pig

I have a String in the following format :
Sat, 09 Jul 2011 05:38:24 GMT
I would have an output like this :
09 Jul 2011
05:38:24
Thanks.
[EDIT]
I have tried many solutions, I have had errors. I will re-explain the problem. I have an XML file where I have a node : Tue, 05 Jul 2011 10:10:30 GMT from which I would like to extract two separated String as illustrated above.
I have tried this code:
register /usr/lib/pig/piggybank.jar;
items = LOAD ' depeche/2011_7_10_12_30_rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
source_name = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>(.*)</pubDate>', 1) AS pubdate:chararray,
sortie = FOREACH pubdate GENERATE SUBSTRING((chararray)$0, 4, 25);
illustrate sortie;
error:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 21, column 333> mismatched input '=' expecting SEMI_COLON
EDITED ANSWER:
That example is a bit more clear ... I grabbed an RSS feed example, and did a quick test. The code below worked using a sample which contained all of the elements in your example above. I used REGEX_EXTRACT instead of SUBSTRING to get the pubdate, however.
--rss.pig
REGISTER piggybank.jar
items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS pubdate:chararray;
dump data;
--rss.txt
<rss version="2.0">
<channel>
<title>News</title>
<link>http://www.hannonhill.com</link>
<description>Hannon Hill News</description>
<language>en-us</language>
<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
<generator>Cascade Server</generator>
<webMaster>webmaster#hannonhill.com</webMaster>
<item>
<title>News Item 1</title>
<link>http://www.hannonhill.com/news/item1.html</link>
<description>Description of news item 1 here.</description>
<pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item1.html</guid>
</item>
<item>
<title>News Item 2</title>
<link>http://www.hannonhill.com/news/item2.html</link>
<description>Description of news item 2 here.</description>
<pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item2.html</guid>
</item>
<item>
<title>News Item 3</title>
<link>http://www.hannonhill.com/news/item3.html</link>
<description>Description of news item 3 here.</description>
<pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item3.html</guid>
</item>
</channel>
</rss>
Results for rss.pig:
(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)
ORIGINAL ANSWER:
There are several methods that would work here, so I'll cover two: SUBSTRING and REGEX_EXTRACT.
If your string length is constant, then you can use the builtin SUBSTRING function. Think of it like the cut command in Linux.
OUTPUT = FOREACH INPUT GENERATE SUBSTRING((chararray)$0, 4, 25);
Otherwise, you can use the builtin REGEX_EXTRACT to pull the string that you're looking for. Given the example, the easiest regex match that I came up with was to begin the string with the first digit, and end with the last digit, capturing all characters in between.
OUTPUT = FOREACH INPUT GENERATE REGEX_EXTRACT((chararray)$0, '([\d].*[\d])', 1);

Retrieving an element from XPath using text as basis

Terrible title, I know, but is there a way in XPath to get to a desired link by only knowing that the link is second going back from the last ellipsis?
In this instance, the desired link is /2
<div class="page">
1
2
3
...
50
51
52
</div>
In this case, it is /3.
<div class="page">
1
2
3
4
...
50
51
52
</div>
And, just to throw a spanner, in the works... this one is 21:
<div class="page">
1
2
3
...
18
19
20
21
22
...
50
51
52
</div>
I've tried all sorts of ways to get at it, from writing out counts to throwing magic beans out of my window, but nothing works. And now I'm out of magic beans. :(
Any suggestions for this problem (XPath, not the magic beans!) are welcome!
/div[#class='page']/text()[normalize-space()='…'][last()]/preceding-sibling::a[2]

Resources