Nokogiri extract object that is not wrapped in tags - ruby

I'm loading regulations to a database and putting them in a tree hierarchy. When the XML Im scraping is set up as below, it is trivial to scrape it:
<CHAPTER>
<PART>
<SUBPART>
<SECTION>
<HD>Section Title</HD>
</SECTION>
<SECTION> ... </SECTION>
<APPENDIX>
<HD>Appendix Title</HD>
<P>Appendix content...</P>
<FOO>More content in unexpected tags</FOO>
</APPENDIX>
</SUBPART>
<SUBPART> ... </SUBPART>
</PART>
</CHAPTER>
Since I have to know what the parent ID is, I do something along this line:
parent_id = 1
doc.xpath("//chapter/part/subpart").each do |subpart|
title = subpart.xpath("hd").first.text
# add is a method that creates object and saves it to database, returning its id
id = add(title,'SUBPART',parent_id)
subpart.xpath('section').each do |section|
title = section.xpath('hd').first.text
add(title,'SECTION',id)
end
subpart.xpath('appendix').each do |app|
title = section.xpath('hd').first.text
content = app.to_s
add(title,'APPENDIX',id,content) #content is an optional input
end
end
However, the XML is not always set up in such a logical way. Sometimes, the appendices are not wrapped in tags :(
When this is the case, the XML looks like this:
<EXTRACT>
<HD SOURCE="HD1">Appendix A to § 1926.60—Substance Data Sheet, for 4-4′ Methylenedianiline</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix A are identical to those set forth in appendix A to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix B to § 1926.60—Substance Technical Guidelines, MDA</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix B are identical to those set forth in appendix B to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix C to § 1926.60—Medical Surveillance Guidelines for MDA</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix C are identical to those set forth in appendix C to § 1910.1050 of this chapter.</P>
</NOTE>
<HD SOURCE="HD1">Appendix D to § 1926.60—Sampling and Analytical Methods for MDA Monitoring and Measurement Procedures</HD>
<NOTE>
<HD SOURCE="HED">Note:</HD>
<P>The requirements applicable to construction work under this appendix D are identical to those set forth in appendix D to § 1910.1050 of this chapter.</P>
</NOTE>
</EXTRACT>
<CITA>
[57 FR 35681, Aug. 10, 1992, as amended at 57 FR 49649, Nov. 3, 1992; 61 FR 5510, Feb. 13, 1996; 61 FR 31431, June 20, 1996; 63 FR 1296, Jan. 8, 1998; 69 FR 70373, Dec. 6, 2004; 70 FR 1143, Jan. 5, 2005; 71 FR 16674, Apr. 3, 2006; 71 FR 50191, Aug. 24, 2006; 73 FR 75588, Dec. 12, 2008; 76 FR 33611, June 8, 2011; 77 FR 17889, Mar. 26, 2012]
</CITA>
The only way I can think of extracting these appendices is to iterate through the <EXTRACT> node and check the tags to see if its name is "HD" and "Appendix" is in the text. Then save everything after until I hit the next <HD> with "Appendix" in the text.
Feels like a pretty clunky solution. Is there a better way to do this?

Related

Why BER-TLV "DF9A" tag is recognized as "invalid"?

I have problem with understanding why all the BER-TLV parsers I found:
https://paymentcardtools.com/emv-tlv-parser
https://emvlab.org/tlvutils/
https://chrome.google.com/webstore/detail/tlv-parser/iemijfhdlipdpcjfnphcdalpccnkfedb
Recognize this tag: DF9A03001736 as "invalid", while: DF5603001736 and DF0903001736 work just fine.
What's the difference?
just follow the description provided in EMV Book 3, Annex B1
"invalid" case: DF9A03001736
DF - 1101 1111
9A - 1001 1010 - here, in the second byte of the Tag, the b8 is set (1), which means that 'Another byte follows', so the following byte (value 03) is also part of the Tag
03 - 0000 0011 - the last byte of the Tag, i.e. the actual Tag is DF9A03
so, in our sequence we have:
DF9A03 - Tag
00 - Length (no value)
17 - is already a new Tag
36 - length of the Tag 17 ...
So the parser (https://paymentcardtools.com/emv-tlv-parser) fails because no data is available (Error during parsing Tag 17: Not enough data)
correct example: DF5603001736
DF - 1101 1111
56 - 0101 0110 - there are no more bytes that constitute Tag, so we just have Tag DF56
the sequence is:
DF56 - Tag
03 - Length
001736 - Value

Error: syntax error, unexpected "Identifier", expecting EXTERNAL or GLOBAL

Hi I was wondering if yall could help me figure this error out. Im rather new to cobol as it is my first (and only) cobol class in my major.
I keep getting this error lab3a.cob:23: Error: syntax error, unexpected "Identifier", expecting EXTERNAL or GLOBAL
whenever I try to compile. And I cant seem to see what I'm doing wrong.
My Code
IDENTIFICATION DIVISION.
PROGRAM-ID. "LAB3A".
Author. Fielding Featherston
* Takes inputs from file and seperates.
ENVIRONMENT DIVISION.
INPUT-OUTPUT SECTION.
FILE-CONTROL.
SELECT InFile
ASSIGN to "lab3-in.dat"
ORGANIZATION is LINE SEQUENTIAL.
DATA DIVISION.
FILE SECTION.
FD InFile.
01 InString.
05 PIC X(13).
05 Instrument PIC X(12).
88 Brass value "Bugle" "Flugelhorn"
"Sousaphone" "Trombone"
"Trumpet" "Tuba".
88 Percussion value "Bass Drum" "Bells" "Bongos"
"Castanets" "Chimes" "Cymbals"
"Snare Drum" "Xylophone".
88 Strings value "Banjo" "Bass" "Cello" "Guitar"
"Harp" "Lyre"
"Mandolin" "Violin".
88 Woodwind value "Bagpipes" "Bassoon" "Clarinet"
"Flute" "Oboe"
"Piccolo" "Saxophone".
WORKING-STORAGE SECTION.
01 BrassCount PIC 9(3).
01 PerCount PIC 9(3).
01 StringCount PIC 9(3).
01 WoodCount PIC 9(3).
01 OtherCount PIC 9(3).
01 BrassStr PIC ZZ9.
01 PerStr PIC ZZ9.
01 StringStr PIC ZZ9.
01 WoodStr PIC ZZ9.
01 OtherStr PIC ZZ9.
01 InStringLength PIC 99.
01 EndFileStr PIC X VALUE "n".
88 EndFile VALUE "y"
When Set to False is "y".
PROCEDURE DIVISION.
000-Main.
Open Input InFile
Perform until EndFile
Read InFile
At end
Set EndFile to FALSE
Not at End
PERFORM 100-SeperateStrings
PERFORM 200-ClassCount
END-READ
END-PERFORM
CLOSE InFile
Move BrassCount to BrassStr
Move PerCount to PerStr
Move StringCount to StringStr
Move WoodCount to WoodStr
Move OtherCount to OtherStr
DISPLAY "Counts"
DISPLAY " Brass: " FUNCTION TRIM(BrassStr)
DISPLAY " Percussion: " FUNCTION TRIM(PerStr)
DISPLAY " String: " FUNCTION TRIM(StringStr)
DISPLAY " Woodwind: " FUNCTION TRIM(WoodStr)
DISPLAY " OTHER: " FUNCTION TRIM(OtherStr)
STOP RUN.
100-SeperateStrings.
MOVE FUNCTION Length(InString) to InStringLength
UNSTRING InString (14:InStringLength)
INTO Instrument
END-UNSTRING.
200-ClassCount.
IF Brass
Add 1 to BrassCount
ELSE IF Percussion
Add 1 to PerCount
ELSE IF Strings
Add 1 to StringCount
ELSE IF Woodwind
Add 1 to WoodCount
ELSE
Add 1 to OtherCount
END-IF.
An EXTERNAL or GLOBAL clause in the context of the error may only occur in a record description entry; that is, a data entry that begins with 1 or 01. Given that the error occurs between two 88 level items, it appears the compiler is confused about where it is while scanning the source code.
There is some unusual formatting that may be creating a problem with an the compiler. In particular, line 22 contains a number of TAB characters that should not, but may, confuse the compiler. Also, lines 33 and 46 contain a number of TAB characters at the end of each source line causing the lines to exceed 72 characters.
Another possible issue is expansion of tabs, whether each TAB character is replaced by 4 or 8 spaces by the compiler. Again this will affect whether the text exceeds 72 characters. In the absence of a SOURCE FORMAT directive, source text after column 72 is ignored.
Until you know the effect that tabs have on the source code, I suggest replacing all tabs with spaces.

bash awk get numbers in two digits

I want to correct wrong meta data or add missing meta data for the 75 cd's I have ripped from disc.
I got the track info from AllMusic en stripped it to almost usable "CSV" data.
Number";"1";"Piece";"Nocturne for piano No. 2 in E flat major, Op. 9/2, CT. 109";"Componist";"Frédéric Chopin
MainPiece";"";"Piece";"Symphony No. 9 in E minor ("From the New World"), B. 178 (Op. 95) (first published as No. 5)
Number";"2";"Piece";"Largo";"Componist";"Antonin Dvorák
Number";"3";"Piece";"La plus que lente, waltz for piano (or orchestra), L. 121";"Componist";"Claude Debussy
Number";"4";"Piece";"Waldesrauschen (Forest Murmurs), for piano (Zwei Konzertetuden No. 1), S. 145/1 (LW A218/1)";"Componist";"Franz Liszt
MainPiece";"";"Piece";"Oboe Concerto, for oboe, strings & continuo in D minor, Op. 8/9, RV 454
Number";"5";"Piece";"Allegro";"Componist";"Antonio Vivaldi
Number";"6";"Piece";"Largo";"Componist";"Antonio Vivaldi
Number";"7";"Piece";"Allegro";"Componist";"Antonio Vivaldi
MainPiece";"";"Piece";"Cello Concerto in A major, G. 475
Number";"8";"Piece";"1. Allegro";"Componist";"Luigi Boccherini
Number";"9";"Piece";"2. Adagio";"Componist";"Luigi Boccherini
Number";"10";"Piece";"3. Rondò - Allegro";"Componist";"Luigi Boccherini
MainPiece";"";"Piece";"Serenade No. 12 for winds in C minor ("Nacht Musique"), K. 388 (K. 384a)
Number";"11";"Piece";"Allegro";"Componist";"Wolfgang Amadeus Mozart
Number";"12";"Piece";"Liebesträume, notturno for piano No. 3 in A flat major ("O Lieb, so lang du lieben kannst"), S. 541/3 (LW A103/3)";"Componist";"Franz Liszt
MainPiece";"";"Piece";"Phantasiestücke (4) for violin, cello & piano in A minor, Op. 88
Number";"13";"Piece";"Romanze";"Componist";"Robert Schumann
MainPiece";"";"Piece";"Sinfonia Concertante for violin, cello, oboe, bassoon & orchestra, H. 1/105
Number";"14";"Piece";"Andante";"Componist";"Franz Joseph Haydn
I would like to rewrite this with awk to a script to set meta data
eyeD3 -n 01 -a composer -t mainpiece piece 01*.mp3
And with awk to rename the files
mv 01*.mp3 01 [composer] mainpiece piece.mp3
The mainpiece / piece is an manual part but I would like to rewrite 1 to 01.
I found something with printf ("%2d" ,$1,$2) but thins complaints about .mp3
Has anyone suggestions for me?

Ruby ARGF & RegEx: How to split on paragraph carriage return "\r\n" but not end of line "\r\n"

I am trying to pre-process some text using regex in ruby to input into a mapper job and would like to split on the carriage return denoting the paragraph.
The text will be coming into the mapper using ARGF.each as part of a hadoop streaming job
"\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\r\n"
"daughter of James Stevenson, Esq. of South Park, in the county of\r\n"
"Gloucester, by which lady (who died 1800) he has issue Elizabeth, born\r\n"
"June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,\r\n"
"1789\"\r\n"
"\r\n" # <----- this is where I would like to split
"Precisely such had the paragraph originally stood from the printer's\r\n"
Once I have done this I will chomp the newline /carriage return of each line.
This will look something like this:
ARGF.each do |text|
paragraph = text.split(INSERT_REGEX_HERE)
#some more blah will happen beyond here
end
UPDATE:
The desired output then is an array as follows:
[
[0] "\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,\r\n"
"daughter of James Stevenson, Esq. of South Park, in the county of\r\n"
"Gloucester, by which lady (who died 1800) he has issue Elizabeth, born\r\n"
"June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,\r\n"
"1789\"\r\n"
[1] "Precisely such had the paragraph originally stood from the printer's\r\n"
]
Ultimately what I want is the following array with no carriage returns within the array:
[
[0] "\"Walter Elliot, born March 1, 1760, married, July 15, 1784, Elizabeth,"
"daughter of James Stevenson, Esq. of South Park, in the county of"
"Gloucester, by which lady (who died 1800) he has issue Elizabeth, born"
"June 1, 1785; Anne, born August 9, 1787; a still-born son, November 5,"
"1789\""
[1] "Precisely such had the paragraph originally stood from the printer's"
]
Thanks in advance for any insights.
Beware when you do ARGF.each do |text|, the text will be every single line, NOT the whole text block.
You can provide ARGF.each a special line separator, it will return you two "lines", which are the two paragraphs in your case.
Try this:
paragraphs = ARGF.each("\r\n\r\n").map{|p| p.gsub("\r\n","")}
First, split input into two paragraphs, then use gsub to remove unwanted line breaks.
To split the text use:
result = text.gsub(/(?<!\")\\r\\n|(?<=\\\")\\r\\n/, '').split(/[\r\n]+\"\\r\\n\".*?[\r\n]+/)

Splitting a String with Pig

I have a String in the following format :
Sat, 09 Jul 2011 05:38:24 GMT
I would have an output like this :
09 Jul 2011
05:38:24
Thanks.
[EDIT]
I have tried many solutions, I have had errors. I will re-explain the problem. I have an XML file where I have a node : Tue, 05 Jul 2011 10:10:30 GMT from which I would like to extract two separated String as illustrated above.
I have tried this code:
register /usr/lib/pig/piggybank.jar;
items = LOAD ' depeche/2011_7_10_12_30_rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
source_name = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>(.*)</pubDate>', 1) AS pubdate:chararray,
sortie = FOREACH pubdate GENERATE SUBSTRING((chararray)$0, 4, 25);
illustrate sortie;
error:
[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 21, column 333> mismatched input '=' expecting SEMI_COLON
EDITED ANSWER:
That example is a bit more clear ... I grabbed an RSS feed example, and did a quick test. The code below worked using a sample which contained all of the elements in your example above. I used REGEX_EXTRACT instead of SUBSTRING to get the pubdate, however.
--rss.pig
REGISTER piggybank.jar
items = LOAD 'rss.txt' USING org.apache.pig.piggybank.storage.XMLLoader('item') AS (item:chararray);
data = FOREACH items GENERATE REGEX_EXTRACT(item, '<link>(.*)</link>', 1) AS link:chararray,
REGEX_EXTRACT(item, '<title>(.*)</title>', 1) AS title:chararray,
REGEX_EXTRACT(item, '<description>(.*)</description>', 1) AS description:chararray,
REGEX_EXTRACT(item, '<pubDate>.*(\\d{2}\\s[a-zA-Z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubDate>', 1) AS pubdate:chararray;
dump data;
--rss.txt
<rss version="2.0">
<channel>
<title>News</title>
<link>http://www.hannonhill.com</link>
<description>Hannon Hill News</description>
<language>en-us</language>
<pubDate>Tue, 10 Jun 2003 04:00:00 GMT</pubDate>
<generator>Cascade Server</generator>
<webMaster>webmaster#hannonhill.com</webMaster>
<item>
<title>News Item 1</title>
<link>http://www.hannonhill.com/news/item1.html</link>
<description>Description of news item 1 here.</description>
<pubDate>Tue, 03 Jun 2003 09:39:21 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item1.html</guid>
</item>
<item>
<title>News Item 2</title>
<link>http://www.hannonhill.com/news/item2.html</link>
<description>Description of news item 2 here.</description>
<pubDate>Fri, 30 May 2003 11:06:42 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item2.html</guid>
</item>
<item>
<title>News Item 3</title>
<link>http://www.hannonhill.com/news/item3.html</link>
<description>Description of news item 3 here.</description>
<pubDate>Tue, 20 May 2003 08:56:02 GMT</pubDate>
<guid>http://www.hannonhill.com/news/item3.html</guid>
</item>
</channel>
</rss>
Results for rss.pig:
(http://www.hannonhill.com/news/item1.html,News Item 1,Description of news item 1 here.,03 Jun 2003 09:39:21)
(http://www.hannonhill.com/news/item2.html,News Item 2,Description of news item 2 here.,30 May 2003 11:06:42)
(http://www.hannonhill.com/news/item3.html,News Item 3,Description of news item 3 here.,20 May 2003 08:56:02)
ORIGINAL ANSWER:
There are several methods that would work here, so I'll cover two: SUBSTRING and REGEX_EXTRACT.
If your string length is constant, then you can use the builtin SUBSTRING function. Think of it like the cut command in Linux.
OUTPUT = FOREACH INPUT GENERATE SUBSTRING((chararray)$0, 4, 25);
Otherwise, you can use the builtin REGEX_EXTRACT to pull the string that you're looking for. Given the example, the easiest regex match that I came up with was to begin the string with the first digit, and end with the last digit, capturing all characters in between.
OUTPUT = FOREACH INPUT GENERATE REGEX_EXTRACT((chararray)$0, '([\d].*[\d])', 1);

Resources