this is the XML doc
<CT>
<CHILD 1> 10 </CHILD 1>
<CHILD 2> 20 </CHILD 2>
</CT>
<TH>
<CHILD 3> 100 </CHILD 3>
<CHILD 4> 200 </CHILD 4>
</TH>
<CT>
<CHILD 1> 30 </CHILD 1>
<CHILD 6> 40 </CHILD 6>
</CT>
<TH>
<CHILD 7> 300 </CHILD 7>
<CHILD 8> 400 </CHILD 8>
</TH>
I want to fetch the value 30 for which I have used the following xpath
root/parent/th../ct/child1
I cannot change root/parent/th which is fixed according to my requirements but I can change the rest of the xpath starting from ../ct/child1
You have error in your xpath expression: not /th../ but /th/../
Also, /root/parent/th selects two th elements and it's not clear which one should be used because both of them have preceding ct with child1 child.
Straightforward answer is
/root/parent/TH[last()]/preceding-sibling::CT[1]/CHILD1
I.e. first preceding sibling element CT of last TH.
But for more useful answer we need more info.
Related
I have n number of these type of xml files.
<students roll_no=1>
<name>abc</name>
<gender>m</gender>
<maxmarks>
<marks>
<year>2014</year>
<maths>100</maths>
<english>100</english>
<spanish>100</spanish>
<marks>
<marks>
<year>2015</year>
<maths>110</maths>
<english>110</english>
<spanish>110</spanish>
<marks>
</maxmarks>
<marksobt>
<marks>
<year>2014</year>
<maths>90</maths>
<english>95</english>
<spanish>82</spanish>
<marks>
<marks>
<year>2015</year>
<maths>94</maths>
<english>98</english>
<spanish>02</spanish>
<marks>
</marksobt>
</Students>
I need output like
roll_no name gender year eng_max_marks maths_max_marks spanish_max_marks
1 abc m 2014 100 100 100
1 abc m 2015 110 110 110
I am able to retrieve marks row wise in single statement but not able to extract roll_no and name with this.
A = LOAD 'student.xml' using org.apache.pig.piggybank.storage.XMLLoader('marks') as (x:chararray);
B = FOREACH A GENERATE XPath(x, 'marks/year'), XPath(x, 'marks/english'), XPath(x, 'marks/math'), XPath(x, 'marks/spanish');
This return
year eng_max_marks maths_max_marks spanish_max_marks
2014 100 100 100
2015 110 110 110
I can extract both the chunks but not getting how to join other fields. I can't use across join because I have n number of other files.
Let's forger attribute name (roll_no) for now. How can I extract the rest of nodes
name gender year eng_max_marks maths_max_marks spanish_max_marks
abc m 2014 100 100 100
abc m 2015 110 110 110
I don't want to use marks(1)/english approach because this nodes can also vary and don't want to adopt any dirty approach.
Any pointers????
I'm loading regulations to a database and putting them in a tree hierarchy. When the XML Im scraping is set up as below, it is trivial to scrape it:
<CHAPTER>
<PART>
<SUBPART>
<SECTION>
<HD>Section Title</HD>
</SECTION>
<SECTION> ... </SECTION>
<APPENDIX>
<HD>Appendix Title</HD>
<P>Appendix content...</P>
<FOO>More content in unexpected tags</FOO>
</APPENDIX>
</SUBPART>
<SUBPART> ... </SUBPART>
</PART>
</CHAPTER>
Since I have to know what the parent ID is, I do something along this line:
parent_id = 1
doc.xpath("//chapter/part/subpart").each do |subpart|
title = subpart.xpath("hd").first.text
# add is a method that creates object and saves it to database, returning its id
id = add(title,'SUBPART',parent_id)
subpart.xpath('section').each do |section|
title = section.xpath('hd').first.text
add(title,'SECTION',id)
end
subpart.xpath('appendix').each do |app|
title = section.xpath('hd').first.text
content = app.to_s
add(title,'APPENDIX',id,content) #content is an optional input
end
end
However, the XML is not always set up in such a logical way. Sometimes, the appendices are not wrapped in tags :(
When this is the case, the XML looks like this:
<EXTRACT>
<HD SOURCE="HD1">Appendix A to § 1926.60—Substance Data Sheet, for 4-4′ Methylenedianiline</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix A are identical to those set forth in appendix A to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix B to § 1926.60—Substance Technical Guidelines, MDA</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix B are identical to those set forth in appendix B to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix C to § 1926.60—Medical Surveillance Guidelines for MDA</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix C are identical to those set forth in appendix C to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix D to § 1926.60—Sampling and Analytical Methods for MDA Monitoring and Measurement Procedures</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix D are identical to those set forth in appendix D to § 1910.1050 of this chapter.</P>
</NOTE>
</EXTRACT>
<CITA>
[57 FR 35681, Aug. 10, 1992, as amended at 57 FR 49649, Nov. 3, 1992; 61 FR 5510, Feb. 13, 1996; 61 FR 31431, June 20, 1996; 63 FR 1296, Jan. 8, 1998; 69 FR 70373, Dec. 6, 2004; 70 FR 1143, Jan. 5, 2005; 71 FR 16674, Apr. 3, 2006; 71 FR 50191, Aug. 24, 2006; 73 FR 75588, Dec. 12, 2008; 76 FR 33611, June 8, 2011; 77 FR 17889, Mar. 26, 2012]
</CITA>
The only way I can think of extracting these appendices is to iterate through the <EXTRACT> node and check the tags to see if its name is "HD" and "Appendix" is in the text. Then save everything after until I hit the next <HD> with "Appendix" in the text.
Feels like a pretty clunky solution. Is there a better way to do this?
I have a String in the following format :
Sat, 09 Jul 2011 05:38:24 GMT
I would have an output like this :
09 Jul 2011
05:38:24
Thanks.
[EDIT]
I have tried many solutions, I have had errors. I will re-explain the problem. I have an XML file where I have a node : Tue, 05 Jul 2011 10:10:30 GMT from which I would like to extract two separated String as illustrated above.
I have tried this code:
register /usr/lib/pig/piggybank.jar;
items = LOAD ' depeche/2011_7_10_12_30_rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
source_name = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>(.*)</pubDate>', 1) AS pubdate:chararray,
sortie = FOREACH pubdate GENERATE SUBSTRING((chararray)$0, 4, 25);
illustrate sortie;
error:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 21, column 333> mismatched input '=' expecting SEMI_COLON
EDITED ANSWER:
That example is a bit more clear ... I grabbed an RSS feed example, and did a quick test. The code below worked using a sample which contained all of the elements in your example above. I used REGEX_EXTRACT instead of SUBSTRING to get the pubdate, however.
--rss.pig
REGISTER piggybank.jar
items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS pubdate:chararray;
dump data;
--rss.txt
<rss version="2.0">
<channel>
<title>News</title>
<link>http://www.hannonhill.com</link>
<description>Hannon Hill News</description>
<language>en-us</language>
<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
<generator>Cascade Server</generator>
<webMaster>webmaster#hannonhill.com</webMaster>
<item>
<title>News Item 1</title>
<link>http://www.hannonhill.com/news/item1.html</link>
<description>Description of news item 1 here.</description>
<pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item1.html</guid>
</item>
<item>
<title>News Item 2</title>
<link>http://www.hannonhill.com/news/item2.html</link>
<description>Description of news item 2 here.</description>
<pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item2.html</guid>
</item>
<item>
<title>News Item 3</title>
<link>http://www.hannonhill.com/news/item3.html</link>
<description>Description of news item 3 here.</description>
<pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item3.html</guid>
</item>
</channel>
</rss>
Results for rss.pig:
(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)
ORIGINAL ANSWER:
There are several methods that would work here, so I'll cover two: SUBSTRING and REGEX_EXTRACT.
If your string length is constant, then you can use the builtin SUBSTRING function. Think of it like the cut command in Linux.
OUTPUT = FOREACH INPUT GENERATE SUBSTRING((chararray)$0, 4, 25);
Otherwise, you can use the builtin REGEX_EXTRACT to pull the string that you're looking for. Given the example, the easiest regex match that I came up with was to begin the string with the first digit, and end with the last digit, capturing all characters in between.
OUTPUT = FOREACH INPUT GENERATE REGEX_EXTRACT((chararray)$0, '([\d].*[\d])', 1);
Using Ruby: ruby 1.9.3dev (2011-09-23 revision 33323) [i686-linux]
I have the following string:
str = 'Message relates to activity TU4 Sep 5 Activity 1 <img src="/images/layout/placeholder.png" width="222" height="149"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1.'
I want to match the following:
35 (a number which is part of href attribute value)
TU4 Sep 5 Activity (the text for tag)
First question from Manager on TU4 Sep 5 Activity 1. (the remaining text after last <br/><br/> tags)
For achieving the same I have written the following regex
result = str.match(/<a href="\/activities\/(?<activity_id>\d+)">(?<activity_title>.*)<\/a>.*<br\/><br\/>(?<message>.*)/)
This produces following result:
#<MatchData "TU4 Sep 5 Activity 1 <img src=\"/images/layout/placeholder.png\" width=\"222\" height=\"149\"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1."
activity_id:"35"
activity_title:"TU4 Sep 5 Activity 1"
message:"First question from Manager on TU4 Sep 5 Activity 1.">
But I guess this is not efficient.
Is it possible that somehow only the required values(as mentioned above under what I want to match) is returned in the matched result and the following
value gets excluded from matched result:
"TU4 Sep 5 Activity 1 <img src=\"/images/layout/placeholder.png\" width=\"222\" height=\"149\"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1."
Thanks,
Jignesh
The appropriate way to do this is NOT to use regexen. Instead, use the Nokogiri library to easily parse your html:
require 'nokogiri'
doc = Nokogiri::HTML.parse(str)
activity_id = doc.css('[href^="/activities"]').attr('href').value[/\d+$/]
activity_title = doc.css('[href^="/activities"]')[0].inner_text
message = doc.search("//text()").last
This will do exactly what your regexp was attempting, with much lower chance of random failure.
Terrible title, I know, but is there a way in XPath to get to a desired link by only knowing that the link is second going back from the last ellipsis?
In this instance, the desired link is /2
<div class="page">
1
2
3
...
50
51
52
</div>
In this case, it is /3.
<div class="page">
1
2
3
4
...
50
51
52
</div>
And, just to throw a spanner, in the works... this one is 21:
<div class="page">
1
2
3
...
18
19
20
21
22
...
50
51
52
</div>
I've tried all sorts of ways to get at it, from writing out counts to throwing magic beans out of my window, but nothing works. And now I'm out of magic beans. :(
Any suggestions for this problem (XPath, not the magic beans!) are welcome!
/div[#class='page']/text()[normalize-space()='…'][last()]/preceding-sibling::a[2]