Splitting a String with Pig - hadoop

I have a String in the following format :
Sat, 09 Jul 2011 05:38:24 GMT
I would have an output like this :
09 Jul 2011
05:38:24
Thanks.
[EDIT]
I have tried many solutions, I have had errors. I will re-explain the problem. I have an XML file where I have a node : Tue, 05 Jul 2011 10:10:30 GMT from which I would like to extract two separated String as illustrated above.
I have tried this code:
register /usr/lib/pig/piggybank.jar;
items = LOAD ' depeche/2011_7_10_12_30_rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
source_name = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>(.*)</pubDate>', 1) AS pubdate:chararray,
sortie = FOREACH pubdate GENERATE SUBSTRING((chararray)$0, 4, 25);
illustrate sortie;
error:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 21, column 333> mismatched input '=' expecting SEMI_COLON

EDITED ANSWER:
That example is a bit more clear ... I grabbed an RSS feed example, and did a quick test. The code below worked using a sample which contained all of the elements in your example above. I used REGEX_EXTRACT instead of SUBSTRING to get the pubdate, however.
--rss.pig
REGISTER piggybank.jar
items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS pubdate:chararray;
dump data;
--rss.txt
<rss version="2.0">
<channel>
<title>News</title>
<link>http://www.hannonhill.com</link>
<description>Hannon Hill News</description>
<language>en-us</language>
<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
<generator>Cascade Server</generator>
<webMaster>webmaster#hannonhill.com</webMaster>
<item>
<title>News Item 1</title>
<link>http://www.hannonhill.com/news/item1.html</link>
<description>Description of news item 1 here.</description>
<pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item1.html</guid>
</item>
<item>
<title>News Item 2</title>
<link>http://www.hannonhill.com/news/item2.html</link>
<description>Description of news item 2 here.</description>
<pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item2.html</guid>
</item>
<item>
<title>News Item 3</title>
<link>http://www.hannonhill.com/news/item3.html</link>
<description>Description of news item 3 here.</description>
<pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item3.html</guid>
</item>
</channel>
</rss>
Results for rss.pig:
(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)
ORIGINAL ANSWER:
There are several methods that would work here, so I'll cover two: SUBSTRING and REGEX_EXTRACT.
If your string length is constant, then you can use the builtin SUBSTRING function. Think of it like the cut command in Linux.
OUTPUT = FOREACH INPUT GENERATE SUBSTRING((chararray)$0, 4, 25);
Otherwise, you can use the builtin REGEX_EXTRACT to pull the string that you're looking for. Given the example, the easiest regex match that I came up with was to begin the string with the first digit, and end with the last digit, capturing all characters in between.
OUTPUT = FOREACH INPUT GENERATE REGEX_EXTRACT((chararray)$0, '([\d].*[\d])', 1);

Related

FORTRAN : maxloc

my data looks like
JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
22.60 24.60 30.60 34.60 36.20 35.70 32.10 30.20 31.40 31.60 28.00 24.80
25.40 27.60 32.40 34.60 36.50 38.10 31.70 31.40 30.30 30.20 27.00 23.90
and there are like hundreds of rows! I want to find a maximum value in each row and write it in different column next to data along with month
so my out put will be
36.20 MAY
38.10 JUN
.
.
I want to use maxloc function, but i have no idea how to use it!
Try
index = maxloc(myTable(3,:))
print *, myTable((/1,3/), index)
It should select the highest value from the third row and display the first and third value at this index.

Ruby: Grouping a date range by year and month

I am creating a range for each month in my example range.
example_range = (Time.zone.today..2.years.from_now)
Output should look like so:
=> [Wed, 03 Aug 2016..Wed, 31 Aug 2016, Thu, 01 Sep 2016..Fri,
30 Sep 2016, Sat, 01 Oct 2016..Mon, 03 Oct 2016, ...]
At the moment I'm doing this, which doesn't work for ranges longer than a year, because the grouping will put January '16 and January '17 in one group.
example_range.group_by(&:month).each { |_, month| month.first..month.last }
I also tried this, but ruby segfaults on this for some reason...
example_range.group_by(&:year).map{ |ary| ary.group_by(&:month)}
Does anyone know a more beautiful (or at least working) way of doing this?
How is this:
example_range.group_by {|date| [date.year, date.month] }.map {|_, month| month.first..month.last }
If you are using Active Support (Rails), this will also work:
example_range.group_by(&:beginning_of_month).map {|_, month| month.first..month.last }
The best solution I think is this:
example_range.group_by {|date| date.month.to_s + "-" + date.year.to_s}
You can adjust the way you need.

Nokogiri extract object that is not wrapped in tags

I'm loading regulations to a database and putting them in a tree hierarchy. When the XML Im scraping is set up as below, it is trivial to scrape it:
<CHAPTER>
<PART>
<SUBPART>
<SECTION>
<HD>Section Title</HD>
</SECTION>
<SECTION> ... </SECTION>
<APPENDIX>
<HD>Appendix Title</HD>
<P>Appendix content...</P>
<FOO>More content in unexpected tags</FOO>
</APPENDIX>
</SUBPART>
<SUBPART> ... </SUBPART>
</PART>
</CHAPTER>
Since I have to know what the parent ID is, I do something along this line:
parent_id = 1
doc.xpath("//chapter/part/subpart").each do |subpart|
title = subpart.xpath("hd").first.text
# add is a method that creates object and saves it to database, returning its id
id = add(title,'SUBPART',parent_id)
subpart.xpath('section').each do |section|
title = section.xpath('hd').first.text
add(title,'SECTION',id)
end
subpart.xpath('appendix').each do |app|
title = section.xpath('hd').first.text
content = app.to_s
add(title,'APPENDIX',id,content) #content is an optional input
end
end
However, the XML is not always set up in such a logical way. Sometimes, the appendices are not wrapped in tags :(
When this is the case, the XML looks like this:
<EXTRACT>
<HD SOURCE="HD1">Appendix A to § 1926.60—Substance Data Sheet, for 4-4′ Methylenedianiline</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix A are identical to those set forth in appendix A to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix B to § 1926.60—Substance Technical Guidelines, MDA</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix B are identical to those set forth in appendix B to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix C to § 1926.60—Medical Surveillance Guidelines for MDA</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix C are identical to those set forth in appendix C to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix D to § 1926.60—Sampling and Analytical Methods for MDA Monitoring and Measurement Procedures</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix D are identical to those set forth in appendix D to § 1910.1050 of this chapter.</P>
</NOTE>
</EXTRACT>
<CITA>
[57 FR 35681, Aug. 10, 1992, as amended at 57 FR 49649, Nov. 3, 1992; 61 FR 5510, Feb. 13, 1996; 61 FR 31431, June 20, 1996; 63 FR 1296, Jan. 8, 1998; 69 FR 70373, Dec. 6, 2004; 70 FR 1143, Jan. 5, 2005; 71 FR 16674, Apr. 3, 2006; 71 FR 50191, Aug. 24, 2006; 73 FR 75588, Dec. 12, 2008; 76 FR 33611, June 8, 2011; 77 FR 17889, Mar. 26, 2012]
</CITA>
The only way I can think of extracting these appendices is to iterate through the <EXTRACT> node and check the tags to see if its name is "HD" and "Appendix" is in the text. Then save everything after until I hit the next <HD> with "Appendix" in the text.
Feels like a pretty clunky solution. Is there a better way to do this?

Ruby Regex to match multiple portions of a string

Using Ruby: ruby 1.9.3dev (2011-09-23 revision 33323) [i686-linux]
I have the following string:
str = 'Message relates to activity TU4 Sep 5 Activity 1 <img src="/images/layout/placeholder.png" width="222" height="149"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1.'
I want to match the following:
35 (a number which is part of href attribute value)
TU4 Sep 5 Activity (the text for tag)
First question from Manager on TU4 Sep 5 Activity 1. (the remaining text after last <br/><br/> tags)
For achieving the same I have written the following regex
result = str.match(/<a href="\/activities\/(?<activity_id>\d+)">(?<activity_title>.*)<\/a>.*<br\/><br\/>(?<message>.*)/)
This produces following result:
#<MatchData "TU4 Sep 5 Activity 1 <img src=\"/images/layout/placeholder.png\" width=\"222\" height=\"149\"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1."
activity_id:"35"
activity_title:"TU4 Sep 5 Activity 1"
message:"First question from Manager on TU4 Sep 5 Activity 1.">
But I guess this is not efficient.
Is it possible that somehow only the required values(as mentioned above under what I want to match) is returned in the matched result and the following
value gets excluded from matched result:
"TU4 Sep 5 Activity 1 <img src=\"/images/layout/placeholder.png\" width=\"222\" height=\"149\"/><br/><br/>First question from Manager on TU4 Sep 5 Activity 1."
Thanks,
Jignesh
The appropriate way to do this is NOT to use regexen. Instead, use the Nokogiri library to easily parse your html:
require 'nokogiri'
doc = Nokogiri::HTML.parse(str)
activity_id = doc.css('[href^="/activities"]').attr('href').value[/\d+$/]
activity_title = doc.css('[href^="/activities"]')[0].inner_text
message = doc.search("//text()").last
This will do exactly what your regexp was attempting, with much lower chance of random failure.

Get the word after a particular word in a Ruby string?

How do I get the word after a particular word in a Ruby string?
For example:
From:Ysxrb<abc#gmail.com>\nTo: <xyzn#gmail.com>Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>
I just want to get:
Ysxrb<abc#gmail.com
xyzabc
I think your question/requirement may need a bit of refinement.
You state: "How to get the word after a particular word in a ruby string?" and your example text is this : "From:Ysxrb\nTo: Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>"
and then you finally say that what you really want out of these string are the following words:
"'Ysxrb' and 'xyzabc'".
Will you always be parsing email text, which is what this appears to be? If so, then there are some more specific approaches you could take. For instance, in this example you could do something like this:
eml = "From:Ysxrb\nTo: Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>"
tokens = eml.split(/[\s\:]/)
which would yield this:
["From", "Ysxrb", "To", "", "Subject", "", "xyzabc", "Date", "", "Tue,", "19", "Jun", "2012", "03", "26", "56", "-0700", "Message-ID", "", "<9D.A1.02635.ABB40EF4#ecout1>"]
At this point, if the word following "To" and "Subject" are what you're after, you could simply get the first non-blank array element after each one, like this:
tokens[tokens.find_index("From") + 1] => "Ysxrb"
tokens[tokens.find_index("Subject") + 2] => "xyzabc" # + 2 is needed because of the newline.
You can use a regular expresion, try this on a irb console:
string = "From:Ysxrb<abc#gmail.com>\nTo: <xyzn#gmail.com>Subject:"
/From:(.+)\n/.match string
$1
$1 hold the backreference we capture with the parenthesis in the regular expression
You could try a regexp, here's an example:
>> s = "From:Ysxrb\nTo: Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>"
=> "From:Ysxrb\nTo: Subject: xyzabc\nDate: Tue, 19 Jun 2012 03:26:56 -0700\nMessage-ID: <9D.A1.02635.ABB40EF4#ecout1>"
>> m, w1, w2 = s.match(/^From:(\w*)\W+.*Subject: (\w*)/).to_a
=> ["From:Ysxrb\nTo: Subject: xyzabc", "Ysxrb", "xyzabc"]
>> w1
=> "Ysxrb"
>> w2
=> "xyzabc"
to find out a good regexp for your requirements, you may use rubular, a Ruby regular expression editor

Resources