How do I extract text from a web page with <br /> tags using Hpricot? - ruby

I'm trying to parse an HTML file using Hpricot and Ruby, but I'm having issues extracting "free floating" text which is not enclosed in tags like <p></p>.
require 'hpricot'
text = <<SOME_TEXT
Testing:<br />
line 1<br />
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>
SOME_TEXT
parsed = Hpricot(text)
parsed = parsed.search('//a[#href="http://www.somelink.com/foo/bar.html"]').first.following_siblings
puts parsed
I would expect the result to be
<br />
line 1<br />
line 2<br />
line 3<br />
line 4<br />
line 5<br />
<b>Here's some more text</b>
But I am getting
<br />
<br />
<br />
<br />
<br />
<br />
<b>Here's some more text</b>
How can I make Hpricot return line 1, line 2, etc?

Your first step is to read the following_siblings documentation:
Find sibling elements which follow the current one. Like the other “sibling” methods, this weeds out text and comment nodes.
Then you should use the Hpricot source to generalize how following_siblings works to get something that works like following_siblings but doesn't filter out non-container nodes:
parsed = Hpricot(text)
link = parsed.search('//a[#href="http://www.somelink.com/foo/bar.html"]').first
link_sibs = link.parent.children
what_you_want = link_sibs[link_sibs.index(link) + 1 ... link_sibs.length]
puts what_you_want
That's pretty much following_siblings with parent.children instead of parent.containers. Having access to the source code of the libraries you use is pretty handy and studying it is to be encouraged.

It's been a while since I've used Hpricot but here's some things I remember that might help:
The quick way to get all the text:
irb(main):023:0> print parsed.inner_text
Testing:
line 1
line 2
line 3
line 4
line 5
Here's some more text
The downside to that is you get the text embedded in tags too.
Similarly, we can search for all 'text()' nodes:
irb(main):033:0> puts (parsed / 'text()')
Testing:
line 1
[...]
line 5
So, we can do this:
irb(main):036:0> puts (parsed / 'text()')[2 .. -3]
line 1
line 2
line 3
line 4
line 5
or:
irb(main):037:0> (parsed / 'text()')[2 .. -3]
=> #<Hpricot::Elements["\n line 1", " \n line 2", "\n line 3", "\n line 4", "\n line 5", "\n "]>
or:
irb(main):039:0> (parsed / 'text()')[2 .. -3].map{ |t| t.inner_text.strip }
=> ["line 1", "line 2", "line 3", "line 4", "line 5", ""]
The main idea for grabbing data/text from a web page is look for landmarks you can use to navigate through the page. Often we can grab text from inside a <div> or <p> tag. If a page doesn't give you landmarks you have to use other tricks; Looking for a series of text nodes followed by <br> nodes maybe, or the five lines following an <a> tag with a certain href attribute. That's the fun and challenge of dealing with HTML.
In the back of my mind there's a nagging thought that there is a more elegant way to do this, but this seems to be working. Dig around on the Hpricot Challenge page for variations on themes on digging out content.

Related

Clarification of Nokogiri::NodeSet XML Content based on 'puts node' and 'puts node.inspect'

I rarely use xpath() but when I do I keep tripping myself up on interpreting content of Nokogiri::Nodesets and believe I now know where I have always gone wrong.
Simply put when I do a 'puts NodeSet' I have always assumed that I could search the Nodeset based on the returned XML. But the first tag returned does not appear to actually part of the node XML.
'puts n1' returns XML that has a SPAN as the first element of the XML, but if I then do an search n1.xpath('SPAN') or n1.xpath('SPAN/DIV') no nodes are found. n1.xpath('DIV') returns the output I expect and proves no SPAN tag in the XML.
The only way I can logically explain this to myself is if assume that the first xml tag of a 'puts node' is the "Node Name" and not part of the node XML. This works for me going forward but am I missing something that is going to bite me elsewhere.
CODE:
docxml = Nokogiri::XML(<<EOT)
<DIV><SPAN><DIV id='1'><H1>-H1-</H1><h1>-h1-</h1></DIV>
<DIV id='2'><H2>-H2-</H2> <h2>-h2-</h2></DIV>
<DIV id='3'><H3>-H3-</H3><h3>-h3-</h3></DIV>
</SPAN></DIV>
EOT
n0 = docxml.xpath('DIV')
n1 = n0.xpath('SPAN')
n2 = n1.xpath('DIV')
n3 = n2.xpath('*')
n4 = n3.xpath('*')
puts "n1:xpath('SPAN'): \n#{n1.xpath('SPAN')}\n#{'^'*80} \nn1 XML:\n#{n1}\n#{'^'*80}\
\nn1:inspect \n#{n1.inspect}\n#{'^'*80}\n"
OUTPUT:
=begin
n1:xpath('SPAN'):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
n1 XML:
<SPAN>
<DIV id="1"> <H1>-H1-</H1> <h1>-h1-</h1> </DIV>
<DIV id="2"> <H2>-H2-</H2> <h2>-h2-</h2> </DIV>
<DIV id="3"> <H3>-H3-</H3> <h3>-h3-</h3> </DIV>
</SPAN>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
n1:inspect
[#<Nokogiri::XML::Element:0x1c10964 name="SPAN"
children=[
#<Nokogiri::XML::Element:0x1c10820 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x18fff90 name="id" value="1">]
children=[#<Nokogiri::XML::Element:0x1c1064c name="H1" children=[#<Nokogiri::XML::Text:0x1c1ffe8 "-H1-">]>,
#<Nokogiri::XML::Element:0x1c10604 name="h1" children=[#<Nokogiri::XML::Text:0x1c1fdcc "-h1-">]>
]>,
#<Nokogiri::XML::Element:0x1c107d8 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x1c1fc10 name="id" value="2">]
children=[#<Nokogiri::XML::Element:0x1c105bc name="H2" children=[#<Nokogiri::XML::Text:0x1c1f874 "-H2-">]>,
#<Nokogiri::XML::Text:0x1c1f778 " ">,
#<Nokogiri::XML::Element:0x1c10574 name="h2" children=[#<Nokogiri::XML::Text:0x1c1f5f8 "-h2-">]
>]>,
#<Nokogiri::XML::Element:0x1c10790 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x1c1f43c name="id" value="3">]
children=[#<Nokogiri::XML::Element:0x1c1052c name="H3" children=[#<Nokogiri::XML::Text:0x1c1f0a0 "-H3-">]>,
#<Nokogiri::XML::Element:0x1c104e4 name="h3" children=[#<Nokogiri::XML::Text:0x1c1ee90 "-h3-">]
>]
>]
>]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
=end
Now that I have had some sleep this works for me.
'nodeset = xpath(tag1/tag2)' returns a 'nodeset' containing member node 'tag2'
'puts nodeset' displays the 'tag2' node member
'nodeset.xpath('*')' returns the content of 'tag2
'nodeset.xpath('tag2')' invalid as 'tag2' is not part of the content of 'tag2'

XPath: Matching a text between two similar tags

I'm trying to scrape a website with messy structure, the text I'm requiring is laying between the first 5 consecutive br tags (No more and no less, exactly 5) and the following 2 consecutive br tags.
It looks like this:
<p class="A">
"Some text"
<br>
"Some text"
<br>
<br>
"Some text"
<br>
<br>
<br>
<br>
<br>
"Required text"
<br>
"Required text"
<br>
"Required text"
<br>
<br>
</p>
Scrapy converts <br> tags to newline characters, so you can just extract the whole text and split it at 5 newline characters:
> text = sel.xpath('//text()').extract()
['\n"Some text"\n', '\n"Some text"\n', ...]
> values = ''.join(text).split('\n\n\n\n\n')[1]
'\n"Required text"\n\n"Required text"\n\n"Required text"\n\n\n'
> values.strip().split('\n\n')
['"Required text"', '"Required text"', '"Required text"']

Parsing with ruby

I am new to ruby and I have a school project were I am parsing a xml file and need to get data after certain tags. I can only use core ruby. No gems
pFile = File.open("myfile.mzML", "r")
regmsLvl = "ms level\" value=\""
pFile.each_line { |line|
scn = line.scan(/#{regmsLvl}(\d)/)
#what I want to do but doesn't work
if scn == 1
puts("Got it!")
end
#what I have to do to compare if == 1
if scn != nil
scn.each do |val|
if val[0].to_i == 1
puts("Got it!")
end
end
end
}
# a sample line that I am parsing is:
<cvParam cvRef="MS" accession="MS:1000511" name="ms level" value="1" />
This seems silly.
line.scans out put makes scn a 2d array. How can I just have it be a string that gets overridden each pass. Or how should I change this whole thing. Any suggestions are appreciated.
puts(scn) prints out the 1 but if I do scn == 1 or scn.to_i == 1 it never gets into the if. I have tried scn.pop and scn.pop.pop
I have added a section to show what I am trying to do now.
I need to check the ms level: if 1 then get scan start time and then the binary. This is the code that I am now working with.
xmlfile = File.new("afile.mzML")
xmldoc = Document.new(xmlfile)
root = xmldoc.root
puts "Root element : " + root.attributes["xmlns"]
xmldoc.elements.each("mzML/run/spectrumList/spectrum/cvParam"){
|e| if e.attributes["value"].to_i ==1
# Now I need to get start time: #
["mzML/run/spectrumList/spectrum/cvParam/scanList/scan/value"]
# and then
["mzML/run/spectrumList/spectrum/cvParam/binaryDataArrayList/binaryDataArray/binary"]
end
}
<run id="ru_0" defaultInstrumentConfigurationRef="ic_0" sampleRef="sa_0" defaultSourceFileRef="sf_ru_0">
<spectrumList count="3310" defaultDataProcessingRef="dp_sp_0">
<spectrum id="scan=8839" index="0" defaultArrayLength="171" dataProcessingRef="dp_sp_0">
<cvParam cvRef="MS" accession="MS:1000525" name="spectrum representation" />
<cvParam cvRef="MS" accession="MS:1000511" name="ms level" value="1" />
<cvParam cvRef="MS" accession="MS:1000294" name="mass spectrum" />
<cvParam cvRef="MS" accession="MS:1000130" name="positive scan" />
<scanList count="1">
<cvParam cvRef="MS" accession="MS:1000795" name="no combination" />
<scan>
<cvParam cvRef="MS" accession="MS:1000016" name="scan start time" value="5429.47" unitAccession="UO:0000010" unitName="second" unitCvRef="UO" />
</scan>
</scanList>
<binaryDataArrayList count="2">
<binaryDataArray encodedLength="1824">
<cvParam cvRef="MS" accession="MS:1000514" name="m/z array" unitAccession="MS:1000040" unitName="m/z" unitCvRef="MS" />
<cvParam cvRef="MS" accession="MS:1000523" name="64-bit float" />
<cvParam cvRef="MS" accession="MS:1000576" name="no compression" />
<binary>AAAAQBCdgkAAAACAP6KCQAAAAAA8pIJAAAAAYAWlgkAAAABgQ6aCQAAAAGCzp4JAAAAAQEaogkAAAACgDKqCQAAAAEAgqoJAAAAAwEOqgkAAAABAWKqCQAAAAGBErIJAAAAAIOetgkAAAABAMLCCQAAAAGDlsYJAAAAA4DeygkAAAACAw7SCQAAAACBauIJAAAAAwFC6gkAAAACAYb6CQAAAAIDnwYJAAAAAwDjHgkAAAAAATMyCQAAAAADnzIJAAAAAAArOgkAAAACgTc6CQAAAAKBqzoJAAAAAQJLPgkAAAACAVNCCQAAAAAAK0oJAAAAAIF7SgkAAAADABNSCQAAAAKAx1YJAAAAAYHXXgkAAAAAg3teCQAAAAOAf2oJAAAAAICbcgkAAAAAAx92CQAAAAKA03oJAAAAAIBXigkAAAABAO+KCQAAAAKCr5YJAAAAAYMnlgkAAAADgK+aCQAAAAKDq6YJAAAAAAC3qgkAAAACgNe6CQAAAAMCA74JAAAAAANL0gkAAAAAAUfiCQAAAAOCt+YJAAAAA4O75gkAAAACAPPqCQAAAAGBq/oJAAAAAwEQCg0AAAABAKAqDQAAAAAAoDoNAAAAA4G0Og0AAAADAZhKDQAAAACCBEoNAAAAAwIQWg0AAAABAjheDQAAAAMA+GoNAAAAAQIYag0AAAAAA7RyDQAAAAEB9HYNAAAAAwIseg0AAAADgbyKDQAAAAAAPJINAAAAAgEUlg0AAAACgYCaDQAAAAOBfKoNAAAAA4DAug0AAAADAZi+DQAAAAAA0MINAAAAAoFMwg0AAAAAgMjKDQAAAACA2NINAAAAAgDk2g0AAAAAg+DyDQAAAAOAfPoNAAAAAAKU/g0AAAAAgQUKDQAAAAKBVQoNAAAAAYNRHg0AAAAAgf0qDQAAAAICZSoNAAAAAIDFQg0AAAAAgM1KDQAAAAEBjUoNAAAAAoGNUg0AAAAAAZ1aDQAAAAABqWINAAAAAYHhZg0AAAACAfl2DQAAAAEAcXoNAAAAAICpfg0AAAADgw2GDQAAAAACmZ4NAAAAAQDRog0AAAABAiWqDQAAAAAAibYNAAAAAQHpug0AAAABAEnKDQAAAAABCcoNAAAAAoHxyg0AAAACgGXaDQAAAAMBDdoNAAAAAgJR2g0AAAAAgHHqDQAAAAEBGeoNAAAAAIHh6g0AAAABAl3qDQAAAAKCkfYNAAAAAYE5+g0AAAAAAm36DQAAAAEDigYNAAAAAQGWCg0AAAABAjYKDQAAAACClgoNAAAAA4ESGg0AAAABgYIaDQAAAAMDSh4NAAAAAYCqIg0AAAADAT4qDQAAAAACCioNAAAAAwJmOg0AAAABAnZKDQAAAAKDJlINAAAAAgHGWg0AAAABgl5eDQAAAAEB4mINAAAAA4B2eg0AAAADgKKCDQAAAAGAvooNAAAAAwJakg0AAAABAUaiDQAAAAGBgqoNAAAAAIBatg0AAAADAxa6DQAAAAKCosoNAAAAAICy6g0AAAAAAbrqDQAAAAACRuoNAAAAAAMa/g0AAAACgOsCDQAAAAABzwoNAAAAAIOTCg0AAAACADcWDQAAAAGB4xoNAAAAAQOfGg0AAAAAAvceDQAAAAEBZyoNAAAAA4OnKg0AAAAAgMs6DQAAAAOC/z4NAAAAAYInUg0AAAABgftaDQAAAAODC1oNAAAAAwJXXg0AAAAAAgdiDQAAAAKA/2oNAAAAAoILag0AAAABghtyDQAAAAGCm3INAAAAAAO7cg0AAAACgr9+DQAAAAGCY4oNAAAAAgDbkg0AAAABAN+WDQAAAAKBU5oNA</binary>
</binaryDataArray>
I think you were pretty close. Assuming you can use that REXML library (which looks like it's part of the core ruby library) you should be able to do this
require 'rexml/document'
xmlfile = File.new("afile.mzML")
xmldoc = REXML::Document.new(xmlfile)
root = xmldoc.root
start_time = nil
binary = nil
# get the ms level
ms_level = root.elements["spectrumList/spectrum/cvParam[#name='ms level']"].attributes["value"].to_i
if ms_level == 1
# get the scan start time
start_time = root.elements["spectrumList/spectrum/scanList/scan/cvParam[#name='scan start time']"].attributes["value"]
# get the binary
binary = root.elements["spectrumList/spectrum/binaryDataArrayList/binaryDataArray/binary"].text
end
p start_time # => "5429.47"
p binary # => that crazy long binary
This REXML tutorial is helpful: http://www.germane-software.com/software/rexml/docs/tutorial.html
Note, I made a few assumptions, like the elements would always exist, the ms level was always an int, the file structure is always the same. Those assumptions may not be true in your situation but this should be a start.

Converting a multi line string to an array in Ruby using line breaks as delimiters

I would like to turn this string
"P07091 MMCNEFFEG
P06870 IVGGWECEQHS
SP0A8M0 VVPVADVLQGR
P01019 VIHNESTCEQ"
into an array that looks like in ruby.
["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
using split doesn't return what I would like because of the line breaks.
This is one way to deal with blank lines:
string.split(/\n+/)
For example,
string = "P07091 MMCNEFFEG
P06870 IVGGWECEQHS
SP0A8M0 VVPVADVLQGR
P01019 VIHNESTCEQ"
string.split(/\n+/)
#=> ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS",
# "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
To accommodate files created under Windows (having line terminators \r\n) replace the regular expression with /(?:\r?\n)+/.
I like to use this as a pretty generic method for handling newlines and returns:
lines = string.split(/\n+|\r+/).reject(&:empty?)
string = "P07091 MMCNEFFEG
P06870 IVGGWECEQHS
SP0A8M0 VVPVADVLQGR
P01019 VIHNESTCEQ"
Using CSV::parse
require 'csv'
CSV.parse(string).flatten
# => ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
Another way using String#each_line :-
ar = []
string.each_line { |line| ar << line.strip unless line == "\n" }
ar # => ["P07091 MMCNEFFEG", "P06870 IVGGWECEQHS", "SP0A8M0 VVPVADVLQGR", "P01019 VIHNESTCEQ"]
Building off of #Martin's answer:
lines = string.split("\n").reject(&:blank?)
That'll give you only the lines that are valued
Split can take a parameter in the form of the character to use to split, so you can do:
lines = string.split("\n")
I think it should be noted that in some situations, line breaks can include not only newlines (\n) but also carriage returns (\r) and that there could potentially be any combination or quantity thereof. Let's take the following string for example:
str = "Useful Line 1 ....
Useful Line 2
Useful Line 3
Useful Line 4... \n
Useful Line 5\r \n
Useful Line 6\n\r
Useful Line 7\n\r\n\r
Useful Line 8 \r\n\r\n
Useful Line 9\r\r\r Useful Line 10\n\n\n\n\nUseful Line 11 \r Useful Line 12"
To deal with all instances of \n and \r, I would do the following to replace all instances of \r with \n using gsub, and then I would combine all consecutive instances of \n using squeeze(arg):
str.gsub("\r", "\n").squeeze("\n")
which would result in :
#=>
"Useful Line 1 ....
Useful Line 2
Useful Line 3
Useful Line 4...
Useful Line 5
Useful Line 6
Useful Line 7
Useful Line 8
Useful Line 9
Useful Line 10
Useful Line 11
Useful Line 12"
...which brings me to our next issue. Sometimes those extra line breaks contain unwanted whitespace and not truly blank or empty lines. To deal with not only line breaks but also unwanted empty lines, I would add the each_line, reject, and strip method like so:
str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.join
which would result in the desired string:
#=>
Useful Line 1 ....
Useful Line 2
Useful Line 3
Useful Line 4...
Useful Line 5
Useful Line 6
Useful Line 7
Useful Line 8
Usefule Line 9
Useful Line 10
Useful Line 11
Useful Line 12
Now more specifically to the OP, we could then simply use split("\n") to finish it all off (as was already mentioned by others):
str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.join.split("\n")
or we could simply skip straight to the desired array by replacing each_line with map and leaving off the unnecessary join like so:
str.gsub("\r", "\n").squeeze("\n").split("\n").map.reject{|x| x.strip == ""}
both of which would result in:
#=>
["Useful Line 1 ....", " Useful Line 2", "Useful Line 3", " Useful Line 4... ", "Useful Line 5", " Useful Line 6", "Useful Line 7", " Useful Line 8 ", "Usefule Line 9", " Useful Line 10", "Useful Line 11 ", " Useful Line 12"]
NOTE:
You may also want to strip off leading and trailing whitespace from each line in which case we could replace .join.split("\n") with .map(&:strip) like so:
str.gsub("\r", "\n").squeeze("\n").each_line.reject{|x| x.strip == ""}.map(&:strip)
or
str.gsub("\r", "\n").squeeze("\n").split("\n").map.reject{|x| x.strip == ""}.map(&:strip)
which would both result in:
#=>
["Useful Line 1 ....", "Useful Line 2", "Useful Line 3", "Useful Line 4...", "Useful Line 5", "Useful Line 6", "Useful Line 7", "Useful Line 8", "Usefule Line 9", "Useful Line 10", "Useful Line 11", "Useful Line 12"]

Should Nokogiri::XML.parse be creating separate Text nodes for linefeeds?

I have an XML document created by an outside tool:
<?xml version="1.0" encoding="UTF-8"?>
<suite>
<id>S1</id>
<name>First Suite</name>
<description></description>
<sections>
<section>
<name>section 1</name>
<cases>
<case>
<id>C1</id>
<title>Test 1.1</title>
<type>Other</type>
<priority>4 - Must Test</priority>
<estimate></estimate>
<milestone></milestone>
<references></references>
</case>
<case>
<id>C2</id>
<title>Test 1.2</title>
<type>Other</type>
<priority>4 - Must Test</priority>
<estimate></estimate>
<milestone></milestone>
<references></references>
</case>
</cases>
</section>
</sections>
</suite>
From irb, I do the following: (Output suppressed until final command)
> require('nokogiri')
> doc = Nokogiri::XML.parse(open('./test.xml'))
> test_case = doc.search('case').first
=> #<Nokogiri::XML::Element:0x3ff75851bc44 name="case" children=[#<Nokogiri::XML::Text:0x3ff75851b8fc "\n ">, #<Nokogiri::XML::Element:0x3ff75851b7bc name="id" children=[#<Nokogiri::XML::Text:0x3ff75851b474 "C1">]>, #<Nokogiri::XML::Text:0x3ff75851b1cc "\n ">, #<Nokogiri::XML::Element:0x3ff75851b078 name="title" children=[#<Nokogiri::XML::Text:0x3ff75851ad58 "Test 1.1">]>, #<Nokogiri::XML::Text:0x3ff75851aa9c "\n ">, #<Nokogiri::XML::Element:0x3ff75851a970 name="type" children=[#<Nokogiri::XML::Text:0x3ff75851a6c8 "Other">]>, #<Nokogiri::XML::Text:0x3ff7585191d8 "\n ">, #<Nokogiri::XML::Element:0x3ff7585190d4 name="priority" children=[#<Nokogiri::XML::Text:0x3ff758518d64 "4 - Must Test">]>, #<Nokogiri::XML::Text:0x3ff758518ad0 "\n ">, #<Nokogiri::XML::Element:0x3ff7585189a4 name="estimate">, #<Nokogiri::XML::Text:0x3ff758518670 "\n ">, #<Nokogiri::XML::Element:0x3ff758518558 name="milestone">, #<Nokogiri::XML::Text:0x3ff7585182b0 "\n ">, #<Nokogiri::XML::Element:0x3ff758518184 name="references">, #<Nokogiri::XML::Text:0x3ff758517ef0 "\n ">]>
This results in a number of children that look like the following:
#<Nokogiri::XML::Text:0x3ff758517ef0 "\n ">
I want to iterate through these XML nodes without having to do something like:
> real_nodes = test_case.children.reject{|n| n.node_name == 'text' && n.content.strip!.empty?}
I couldn't find a parse parameter in the Nokogiri docs to suppress the treating of newlines as separate nodes. Is there a way to do this during the parse instead of after?
Check the documentation. You can just do this:
doc = Nokogiri::XML.parse(open('./test.xml')) do |config|
config.noblanks
end
That will load the file without any empty nodes.
The text nodes are the result of pretty-printing the XML. The spec doesn't require whitespace between tags, and, for efficiency, a huge XML file could be stripped of inter-tag whitespace to save space and reduce transfer time, without sacrificing the data content.
This might show what's happening:
require 'nokogiri'
xml = '<foo></foo>'
Nokogiri::XML(xml).at('foo').child
=> nil
With no whitespace between the tags there's no text node either.
xml = '<foo>
</foo>'
Nokogiri::XML(xml).at('foo').child
=> #<Nokogiri::XML::Text:0x3fcee9436ff0 "\n">
doc.at('foo').child.class
=> Nokogiri::XML::Text
With whitespace for pretty-printing, the XML has a text node following the foo tag.

Resources