How to remove white space from HTML text - ruby

How do I remove spaces in my code? If I parse this HTML with Nokogiri:
<div class="address-thoroughfare mobile-inline-comma ng-binding">Kühlungsborner Straße
10
</div>
I get the following output:
Kühlungsborner Straße
10
which is not left-justified.
My code is:
address_street = page_detail.xpath('//div[#class="address-thoroughfare mobile-inline-comma ng-binding"]').text

Please try strip:
address_street = page_detail.xpath('//div[#class="address-thoroughfare mobile-inline-comma ng-binding"]').text.strip

Consider this:
require 'nokogiri'
doc = Nokogiri::HTML('<div class="address-thoroughfare mobile-inline-comma ng-binding">Kühlungsborner Straße
10
</div>')
doc.search('div').text
# => "Kühlungsborner Straße\n 10\n "
puts doc.search('div').text
# >> Kühlungsborner Straße
# >> 10
# >>
The given HTML doesn't replicate the problem you're having. It's really important to present valid input that duplicates the problem. Moving on....
Don't use xpath, css or search with text. You usually won't get what you expect:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div>
<span>foo</span>
<span>bar</span>
</div>
</body>
</html>
EOT
doc.search('span').class # => Nokogiri::XML::NodeSet
doc.search('span') # => [#<Nokogiri::XML::Element:0x3fdb6981bcd8 name="span" children=[#<Nokogiri::XML::Text:0x3fdb6981b5d0 "foo">]>, #<Nokogiri::XML::Element:0x3fdb6981aab8 name="span" children=[#<Nokogiri::XML::Text:0x3fdb6981a054 "bar">]>]
doc.search('span').text
# => "foobar"
Note that text returned the concatenated text of all nodes found.
Instead, walk the NodeSet and grab the individual node's text:
doc.search('span').map(&:text)
# => ["foo", "bar"]

Related

How to extract text from HTML without concatenating it using Nokogiri and Ruby [duplicate]

When I scrape several related nodes from HTML or XML to extract the text, all the text is joined into one long string, making it impossible to recover the individual text strings.
For instance:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT
doc.search('p').text # => "foobarbaz"
But what I want is:
["foo", "bar", "baz"]
The same happens when scraping XML:
doc = Nokogiri::XML(<<EOT)
<root>
<block>
<entries>foo</entries>
<entries>bar</entries>
<entries>baz</entries>
</block>
</root>
EOT
doc.search('entries').text # => "foobarbaz"
Why does this happen and how do I avoid it?
This is an easily solved problem that results from not reading the documentation about how text behaves when used on a NodeSet versus a Node (or Element).
The NodeSet documentation says text will:
Get the inner text of all contained Node objects
Which is what we're seeing happen with:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
<p>baz</p>
</body>
</html>
EOT
doc.search('p').text # => "foobarbaz"
because:
doc.search('p').class # => Nokogiri::XML::NodeSet
Instead, we want to get each Node and extract its text:
doc.search('p').first.class # => Nokogiri::XML::Element
doc.search('p').first.text # => "foo"
which can be done using map:
doc.search('p').map { |node| node.text } # => ["foo", "bar", "baz"]
Ruby allows us to write that more concisely using:
doc.search('p').map(&:text) # => ["foo", "bar", "baz"]
The same things apply whether we're working with HTML or XML, as HTML is a more relaxed version of XML.
A Node has several aliased methods for getting at its embedded text. From the documentation:
#content ⇒ Object
Also known as: text, inner_text
Returns the contents for this Node.

Properly separate String elements in an Array [duplicate]

This question already has answers here:
Nokogiri returning values as a string, not an array
(2 answers)
Closed 6 years ago.
I am trying to parse an HTML page using Nokogiri to get some companies names.
names = []
names << Nokogiri::HTML(mypage).css(".name a").text
My result is:
["MikeGetsLeadsUruBlondeLaunch LIVERynoRyderBoyer ProductionsStrangerxCerealLume CubeKatapyMacaulay Outdoor PromotionsFlixit ABMedia MosaicLiftCast.TVcool.mediaPeekKLIKseeStreamingo SolutionsPvgnaalughaUser"]
But what I'd like to get is:
["MikeGetsLeads", "Uru", "Blonde", "Launch LIVE", RynoRyderBoyer Productions", "Stranger", "xCereal", "Lume Cube", "Katapy", "Macaulay Outdoor Promotions", "Flixit AB", "Media Mosaic", "LiftCast.TV", "cool.media", "Peek", "KLIKsee", "Streamingo Solutions", "Pvgna", "alugha", "User"]
I tried to use .split but it does not give me the right result neither. On this page, each name belongs to a <div>so it's clearly separated in the HTML structure.
The HTML structure looks like this
<div class='name'>
MikeGetsLeads
</div>
The problem is, you are using text with a NodeSet, not with individual nodes. With a NodeSet all the text is concatenated into a single String. Per the NodeSet.inner_text AKA text documentation:
Get the inner text of all contained Node objects
and the actual code is:
def inner_text
collect(&:inner_text).join('')
end
whereas Node.content AKA text or inner_text
Returns the content for this Node
Meditate on this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<div>
<p>foo</p>
<p>bar</p>
</div>
EOT
doc.css('p').class # => Nokogiri::XML::NodeSet
doc.css('p').text # => "foobar"
Instead, you need to use text on individual nodes:
doc.css('p').map{ |n| n.class } # => [Nokogiri::XML::Element, Nokogiri::XML::Element]
doc.css('p').map{ |n| n.text } # => ["foo", "bar"]
The previous line can be simplified to:
doc.css('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.
require 'rubygems'
require 'nokogiri'
require 'pp'
names = []
mypage = File.open("myhtml.html", "r")
Nokogiri::HTML(mypage).css(".name a").each do |item|
names << item.text
end
pp names
returns:
["MikeGetsLeads", "MikeGetsLeads2", "MikeGetsLeads3"]

How to export HTML data to a CSV file

I am trying to scrape and make a CSV file from this HTML:
<ul class="object-props">
<li class="object-props-item price">
<strong>CHF 14'800.-</strong>
</li>
<li class="object-props-item milage">31'000 km</li>
<li class="object-props-item date">08.2012</li>
</ul>
I want to extract the price and mileage using:
require 'rubygems'
require 'nokogiri'
require 'CSV'
require 'open-uri'
url= "/tto.htm"
data = Nokogiri::HTML(open(url))
CSV.open('csv.csv', 'wb') do |csv|
csv << %w[ price mileage ]
price=data.css('.price').text
mileage=data.css('.mileage').text
csv << [price, mileage]
end
The result is not really what I'm expecting. Two columns are created, but how can I remove the characters like CHF and KM and why is the data of the mileage not displaying result?
My guess is that the text in the HTML includes units of measure; CHF for Swiss Francs for the price, and km for kilometers for the mileage.
You could add split.first or split.last to get the number without the unit of measure, e.g.:
2.3.0 :007 > 'CHF 100'.split.last
=> "100"
2.3.0 :008 > '99 km'.split.first
=> "99"
Removing/ignoring the unwanted text is not a Nokogiri problem, it's a String processing problem:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>
EOT
str = doc.at('strong').text # => "CHF 14'900.-"
At this point str contains the text of the <strong> node.
A simple regex will extract, which is the straightforward way to grab the data:
str[/[\d']+/] # => "14'900"
sub could be used to remove the 'CHF ' substring:
str.sub('CHF ', '') # => "14'900.-"
delete could be used to remove the characters C, H, F and :
str.delete('CHF ') # => "14'900.-"
tr could be used to remove everything that is NOT 0..9, ', . or -:
str.tr("^0-9'.-", '') # => "14'900.-"
Modify one of the above if you don't want ', . or -.
why are the data of the mileage not displaying
Because you have a mismatch between the CSS selector and the actual class parameter:
require 'nokogiri'
doc = Nokogiri::HTML('<li class="object-props-item milage">61'000 km</li>')
doc.at('.mileage').text # =>
# ~> NoMethodError
# ~> undefined method `text' for nil:NilClass
# ~>
# ~> /var/folders/yb/whn8dwns6rl92jswry5cz87dsgk2n1/T/seeing_is_believing_temp_dir20160428-96035-1dajnql/program.rb:5:in `<main>'
Instead it should be:
doc.css('.milage').text # => "61'000 km"
But that's not all that's wrong. There's a subtle problem waiting to bite you later.
css or search returns a NodeSet whereas at or at_css returns an Element:
doc.css('.milage').class # => Nokogiri::XML::NodeSet
doc.at('.milage').class # => Nokogiri::XML::Element
Here's what happens when text is passed a NodeSet containing multiple matching nodes:
doc = Nokogiri::HTML('<p>foo</p><p>bar</p>')
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"
doc.at('p').class # => Nokogiri::XML::Element
doc.at('p').text # => "foo"
When text is used with a NodeSet it returns the text of all nodes concatenated into a single string. This can make it really difficult to separate the text from one node from another. Instead, use at or one of the at_* equivalents to get the text from a single node. If you want to extract the text from each node individually and get an array use:
doc.search('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.
Finally, notice that your HTML sample isn't valid:
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>')
EOT
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>li class="object-props-item price"
# >> <strong>CHF 14'900.-</strong>
# >> </p>
# >> <li class="object-props-item milage">61'000 km</li>')
# >> </body></html>
Here's what happens:
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>')
EOT
doc.at('.price') # => nil
Nokogiri has to do a fix-up to make sense of the first line, so it wraps it in <p>. By doing so the .price class no longer exists so your code will fail again.
Fixing the tag results in a correct response:
doc = Nokogiri::HTML(<<EOT)
<li class="object-props-item price">
<strong>CHF 14'900.-</strong>
</li>
<li class="object-props-item milage">61'000 km</li>')
EOT
doc.at('.price').to_html # => "<li class=\"object-props-item price\">\n<strong>CHF 14'900.-</strong>\n</li>"
This is why it's really important to make sure your input is valid. Trying to duplicate your problem is difficult without it.

How to get Nokogiri inner_HTML object to ignore/remove escape sequences

Currently, I am trying to get the inner HTML of an element on a page using nokogiri. However I'm not just getting the text of the element, I'm also getting its escape sequences. Is there a way i can suppress or remove them with nokogiri?
require 'nokogiri'
require 'open-uri'
page = Nokogiri::HTML(open("http://the.page.url.com"))
page.at_css("td[custom-attribute='foo']").parent.css('td').css('a').inner_html
this returns => "\r\n\t\t\t\t\t\t\t\tTheActuallyInnerContentThatIWant\r\n\t"
What is the most effective and direct nokogiri (or ruby) way of doing this?
page.at_css("td[custom-attribute='foo']")
.parent
.css('td')
.css('a')
.text # since you need a text, not inner_html
.strip # this will strip a result
String#strip.
Sidenote: css('td a') is likely more efficient than css('td').css('a').
It's important to drill in to the closest node containing the text you want. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
</body>
</html>
EOT
doc.at('body').inner_html # => "\n <p>foo</p>\n "
doc.at('body').text # => "\n foo\n "
doc.at('p').inner_html # => "foo"
doc.at('p').text # => "foo"
at, at_css and at_xpath return a Node/XML::Element. search, css and xpath return a NodeSet. There's a big difference in how text or inner_html return information when looking at a Node or NodeSet:
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.at('p') # => #<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>
doc.search('p') # => [#<Nokogiri::XML::Element:0x3fd635cf36f4 name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf3514 "foo">]>, #<Nokogiri::XML::Element:0x3fd635cf32bc name="p" children=[#<Nokogiri::XML::Text:0x3fd635cf30dc "bar">]>]
doc.at('p').class # => Nokogiri::XML::Element
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.at('p').text # => "foo"
doc.search('p').text # => "foobar"
Notice that using search returned a NodeSet and that text returned the node's text concatenated together. This is rarely what you want.
Also notice that Nokogiri is smart enough to figure out whether a selector is CSS or XPath 99% of the time, so using the generic search and at for either type of selector is very convenient.

Nokogiri text node contents

Is there any clean way to get the contents of text nodes with Nokogiri? Right now I'm using
some_node.at_xpath( "//whatever" ).first.content
which seems really verbose for just getting text.
You want only the text?
doc.search('//text()').map(&:text)
Maybe you don't want all the whitespace and noise. If you want only the text nodes containing a word character,
doc.search('//text()').map(&:text).delete_if{|x| x !~ /\w/}
Edit: It appears you only wanted the text content of a single node:
some_node.at_xpath( "//whatever" ).text
Just look for text nodes:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>This is a text node </p>
<p> This is another text node</p>
</body>
</html>
EOT
doc.search('//text()').each do |t|
t.replace(t.content.strip)
end
puts doc.to_html
Which outputs:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<p>This is a text node</p>
<p>This is another text node</p>
</body></html>
BTW, your code example doesn't work. at_xpath( "//whatever" ).first is redundant and will fail. at_xpath will find only the first occurrence, returning a Node. first is superfluous at that point, if it would work, but it won't because Node doesn't have a first method.
I have <data><foo>bar</foo></bar>, how I get at the "bar" text without doing doc.xpath_at( "//data/foo" ).children.first.content?
Assuming doc contains the parsed DOM:
doc.to_xml # => "<?xml version=\"1.0\"?>\n<data>\n <foo>bar</foo>\n</data>\n"
Get the first occurrence:
doc.at('foo').text # => "bar"
doc.at('//foo').text # => "bar"
doc.at('/data/foo').text # => "bar"
Get all occurrences and take the first one:
doc.search('foo').first.text # => "bar"
doc.search('//foo').first.text # => "bar"
doc.search('data foo').first.text # => "bar"

Resources