File IO read by delimeter? - ruby

I have the following script that reads a file and then puts it in an array based on line ends with a </h1>. How do I read only the contents between <h1> and </h1>?
This is my script:
out_array = []
open('foo.html') do |f|
f.each('</h1>') do |record|
record.gsub!("\n", ' ')
out_array.push record
end
end
# print array
p out_array
This my html
</h1>
akwotdfg
<h1>
<h1>I am foo</h1>
<h1>
Stubborn quaz
</h1>
<h3>
iThis
is a reas
long one line shit
</h3>
<h1>I am foo</h1>
This is my output:
["</h1>", " akwotdfg <h1> <h1>I am foo</h1>", " <h1> Stubborn quaz </h1>", " <h3> iThis is a reas long one line shit </h3> <h1>I am foo</h1>", " "]

Please take a look of following code:
out_array = open('foo.html') do |f|
f.read.scan(/<h1>(.*)<\/h1>/)
end
puts out_array
execution result:
I am foo
I am foo
updated for multi-line scan:
out_array = open('tempdir/foo.html') do |f|
f.read.scan(/<h1>([^<]*?)<\/h1>/m)
end
out_array.map! {|e| e[0].strip}
p out_array
execution result:
["I am foo", "Stubborn quaz", "I am foo"]

Don't use regular expressions to deal with HTML or XML. For trivial content you manage it's possible, but your code becomes liable to break for anything that can change at someone else's bidding.
Instead use a parser, like Nokogiri:
require 'nokogiri'
html = '
</h1>
akwotdfg
<h1>
<h1>I am foo</h1>
<h1>
Stubborn quaz
</h1>
<h3>
iThis
is a reas
long one line
</h3>
<h1>I am foo</h1>
'
doc = Nokogiri::HTML(html)
h1_contents = doc.search('h1').map(&:text)
puts h1_contents
Which outputs:
# >>
# >> I am foo
# >>
# >> Stubborn quaz
# >>
# >>
# >> iThis
# >> is a reas
# >> long one line
# >>
# >> I am foo
# >> I am foo
# >>
# >> Stubborn quaz
# >>
# >> I am foo
Notice that Nokogiri is returning the content inside the <h3> block. This is correct/expected behavior because the HTML is malformed. Nokogiri fixes malformed HTML in an attempt to help retrieve usable content, but because there are many possible locations for the closing tag, Nokogiri inserts the closing tag at the last location that would be syntactically correct. Humans know to do it earlier, but this is software trying to be helpful.
This situation requires you to preprocess the HTML to make it correct. I'm using a single, simple, sub to fix the first <h1> found:
doc = Nokogiri::HTML(html.sub(/^(<h1>)$/, '\1</h1>'))
h1_contents = doc.search('h1').map(&:text)
puts h1_contents
# >> I am foo
# >>
# >> Stubborn quaz
# >> I am foo

Related

Ruby - gsub br tags to \n\n for API, but including any whitespace

I've got <br> tags in my client's data that I need to replace with '\n\n' in my Rails API for a React Native app.
Sometimes there are spaces before or after the <br> tag, or both.
I'm looking for a gsub to say "any <br> tag, and also include any whitespace before or after it, replace with '\n\n'.
Right now I'm doing:
module ApiHelper
def parse_newlines(string)
string = string.gsub('<br>', '\n\n')
string = string.gsub(' <br>', '\n\n')
string = string.gsub('<br> ', '\n\n')
string = string.gsub(' <br> ', '\n\n')
end
end
Is there something cleaner?
EDIT: Thanks all. I want to accept both Gavin's and the Tin Man's answers...Gavin because he gave me the down and dirty solution, but Tin Man for such a great/in depth explanation on a more robust way using Nokogiri...
2nd EDIT: I take it back. Tin man...using Nokogiri is actually much more readable. Your argument about using regex's in your comment is valid. In the end your code is easier to understand. Giving you the accepted answer, even though I am using Gavin's for now.
This'll do it:
module ApiHelper
def parse_newlines(string)
# Handles <br>, <br/>, <br />
string.gsub(/\s*<br\s*?\/?>\s*/, "\n\n")
end
end
# irb
> parse_newlines(" <br> ")
=> "\n\n"
> parse_newlines(" <br /> ")
=> "\n\n"
> parse_newlines("<br />")
=> "\n\n"
When messing with HTML or XML it's better to use a parser. I'd start with:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>this<br>is<br> some <br>
text<br>and<br >some<br/>more</p>
EOT
doc.search('br').each { |br| br.replace("\n\n") }
doc.to_html
# => "<p>this\n" +
# "\n" +
# "is\n" +
# "\n" +
# " some \n" +
# "\n" +
# "\n" +
# "text\n" +
# "\n" +
# "and\n" +
# "\n" +
# "some\n" +
# "\n" +
# "more</p>\n"
Whitespace in HTML displayed by a browser is gobbled by the browser so space runs, or multiple returns will be reduced to a single space or a single line unless you wrap it with <pre> tags or do something similar.
If you absolutely need to strip spaces before and after where you're inserting new-lines, I'd use an extra step:
require 'nokogiri'
doc = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<p>this<br>is<br> some <br>
text<br>and<br >some<br/>more</p>
EOT
doc.search('p').each do |p|
p.inner_html = p.inner_html.gsub(/ +</, '<').gsub(/> +/, '>')
end
doc.to_html
# => "<p>this<br>is<br>some<br>\n" +
# "text<br>and<br>some<br>more</p>\n"
doc.search('br').each { |br| br.replace("\n\n") }
doc.to_html
# => "<p>this\n" +
# "\n" +
# "is\n" +
# "\n" +
# "some\n" +
# "\n" +
# "\n" +
# "text\n" +
# "\n" +
# "and\n" +
# "\n" +
# "some\n" +
# "\n" +
# "more</p>\n"
Note: Technically, <br> is equivalent to a single "\n", not "\n\n". <p> would be two new-lines because that constitutes a paragraph.
You can try with:
string = 'Lorem <br> Ipsum'
puts string.gsub(/\s(<br>)\s/, '\n\n')
# => Lorem\n\nIpsum
puts string.gsub(/\s(<br>)\s/, "\n\n")
# Lorem
#
# Ipsum
And note the difference between '\n\n' and "\n\n".
module ApiHelper
def parse_newlines(string)
string.gsub(/\s*<br>\s*/, "\n\n")
end
end

How to remove white space from HTML text

How do I remove spaces in my code? If I parse this HTML with Nokogiri:
<div class="address-thoroughfare mobile-inline-comma ng-binding">Kühlungsborner Straße
10
</div>
I get the following output:
Kühlungsborner Straße
10
which is not left-justified.
My code is:
address_street = page_detail.xpath('//div[#class="address-thoroughfare mobile-inline-comma ng-binding"]').text
Please try strip:
address_street = page_detail.xpath('//div[#class="address-thoroughfare mobile-inline-comma ng-binding"]').text.strip
Consider this:
require 'nokogiri'
doc = Nokogiri::HTML('<div class="address-thoroughfare mobile-inline-comma ng-binding">Kühlungsborner Straße
10
</div>')
doc.search('div').text
# => "Kühlungsborner Straße\n 10\n "
puts doc.search('div').text
# >> Kühlungsborner Straße
# >> 10
# >>
The given HTML doesn't replicate the problem you're having. It's really important to present valid input that duplicates the problem. Moving on....
Don't use xpath, css or search with text. You usually won't get what you expect:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div>
<span>foo</span>
<span>bar</span>
</div>
</body>
</html>
EOT
doc.search('span').class # => Nokogiri::XML::NodeSet
doc.search('span') # => [#<Nokogiri::XML::Element:0x3fdb6981bcd8 name="span" children=[#<Nokogiri::XML::Text:0x3fdb6981b5d0 "foo">]>, #<Nokogiri::XML::Element:0x3fdb6981aab8 name="span" children=[#<Nokogiri::XML::Text:0x3fdb6981a054 "bar">]>]
doc.search('span').text
# => "foobar"
Note that text returned the concatenated text of all nodes found.
Instead, walk the NodeSet and grab the individual node's text:
doc.search('span').map(&:text)
# => ["foo", "bar"]

How to export HTML data to a CSV file

I am trying to scrape and make a CSV file from this HTML:
<ul class="object-props">
<li class="object-props-item price">
<strong>CHF 14'800.-</strong>
</li>
<li class="object-props-item milage">31'000 km</li>
<li class="object-props-item date">08.2012</li>
</ul>
I want to extract the price and mileage using:
require 'rubygems'
require 'nokogiri'
require 'CSV'
require 'open-uri'
url= "/tto.htm"
data = Nokogiri::HTML(open(url))
CSV.open('csv.csv', 'wb') do |csv|
csv << %w[ price mileage ]
price=data.css('.price').text
mileage=data.css('.mileage').text
csv << [price, mileage]
end
The result is not really what I'm expecting. Two columns are created, but how can I remove the characters like CHF and KM and why is the data of the mileage not displaying result?
My guess is that the text in the HTML includes units of measure; CHF for Swiss Francs for the price, and km for kilometers for the mileage.
You could add split.first or split.last to get the number without the unit of measure, e.g.:
2.3.0 :007 > 'CHF 100'.split.last
=> "100"
2.3.0 :008 > '99 km'.split.first
=> "99"
Removing/ignoring the unwanted text is not a Nokogiri problem, it's a String processing problem:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>
EOT
str = doc.at('strong').text # => "CHF 14'900.-"
At this point str contains the text of the <strong> node.
A simple regex will extract, which is the straightforward way to grab the data:
str[/[\d']+/] # => "14'900"
sub could be used to remove the 'CHF ' substring:
str.sub('CHF ', '') # => "14'900.-"
delete could be used to remove the characters C, H, F and :
str.delete('CHF ') # => "14'900.-"
tr could be used to remove everything that is NOT 0..9, ', . or -:
str.tr("^0-9'.-", '') # => "14'900.-"
Modify one of the above if you don't want ', . or -.
why are the data of the mileage not displaying
Because you have a mismatch between the CSS selector and the actual class parameter:
require 'nokogiri'
doc = Nokogiri::HTML('<li class="object-props-item milage">61'000 km</li>')
doc.at('.mileage').text # =>
# ~> NoMethodError
# ~> undefined method `text' for nil:NilClass
# ~>
# ~> /var/folders/yb/whn8dwns6rl92jswry5cz87dsgk2n1/T/seeing_is_believing_temp_dir20160428-96035-1dajnql/program.rb:5:in `<main>'
Instead it should be:
doc.css('.milage').text # => "61'000 km"
But that's not all that's wrong. There's a subtle problem waiting to bite you later.
css or search returns a NodeSet whereas at or at_css returns an Element:
doc.css('.milage').class # => Nokogiri::XML::NodeSet
doc.at('.milage').class # => Nokogiri::XML::Element
Here's what happens when text is passed a NodeSet containing multiple matching nodes:
doc = Nokogiri::HTML('<p>foo</p><p>bar</p>')
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"
doc.at('p').class # => Nokogiri::XML::Element
doc.at('p').text # => "foo"
When text is used with a NodeSet it returns the text of all nodes concatenated into a single string. This can make it really difficult to separate the text from one node from another. Instead, use at or one of the at_* equivalents to get the text from a single node. If you want to extract the text from each node individually and get an array use:
doc.search('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.
Finally, notice that your HTML sample isn't valid:
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>')
EOT
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <p>li class="object-props-item price"
# >> <strong>CHF 14'900.-</strong>
# >> </p>
# >> <li class="object-props-item milage">61'000 km</li>')
# >> </body></html>
Here's what happens:
doc = Nokogiri::HTML(<<EOT)
li class="object-props-item price"
<strong>CHF 14'900.-</strong>
<li class="object-props-item milage">61'000 km</li>')
EOT
doc.at('.price') # => nil
Nokogiri has to do a fix-up to make sense of the first line, so it wraps it in <p>. By doing so the .price class no longer exists so your code will fail again.
Fixing the tag results in a correct response:
doc = Nokogiri::HTML(<<EOT)
<li class="object-props-item price">
<strong>CHF 14'900.-</strong>
</li>
<li class="object-props-item milage">61'000 km</li>')
EOT
doc.at('.price').to_html # => "<li class=\"object-props-item price\">\n<strong>CHF 14'900.-</strong>\n</li>"
This is why it's really important to make sure your input is valid. Trying to duplicate your problem is difficult without it.

Deleting between tags and using variables in regex in `gsub`

My #outbound_text looks something like this:
<CREATE-EVENT>\n\t\t\t\t<COLLECTION>PAM</COLLECTION>\n\t\t\t\t<EVENT-TYPE>survey_answer</EVENT-TYPE>\n\t\t\t\t<JSON-STRING>\n\t\t\t\t\t{\n\t\t\t\t\t question1:done,\n\t\t\t\t\t question2:done,\n\t\t\t\t\t question3:done,\n\t\t\t\t\t question4:done,\n\t\t\t\t\t question5:done,\n\t\t\t\t\t question6:done\n\t\t\t\t\t}\n\t\t\t\t</JSON-STRING>\n\t\t\t</CREATE-EVENT>\n\n\t\t\t\n <EMAIL>\n <ADDRESS>bot_client_id</ADDRESS>\n <SUBJECT>PAM responses for Wednesday October 07</SUBJECT>\n <BODY>\nHi, there
I want to remove everything between <CREATE-EVENT> and </CREATE-EVENT>.
I tried the following, where tag is"CREATE-EVENT":
open_tag = "<" + tag + ">"
close_tag = "</" + tag + ">"
#outbound_text.gsub!(/#{open_tag}/(.*)\/#{close_tag}/, '')
The following is what variable substitution into a regex looks like:
/#{open_tag}.*#{close_tag}/, ...
Pretend that the opening / and closing / of the regex are double quote marks and have at it.
Here's a full example:
tag = 'CREATE-EVENT'
open_tag = "<#{tag}>"
close_tag = "</#{tag}>"
any_text = ".*"
html_tag = /#{open_tag}
#{any_text}
#{close_tag}/xm
#outbound_text = %q{
hello
<CREATE-EVENT>
<COLLECTION>PAM</COLLECTION>
<EVENT-TYPE>
</CREATE-EVENT>
world
}
p #outbound_text.gsub!(html_tag, '')
--output:--
"\nhello\n \nworld\n"
When dealing with XML or HTML, don't use regular expressions unless the markup is extremely trivial and you own the task of generating it. Odds are very good that your code will break with a small change to the incoming data. Read "Match All Occurrences of a Regex", which tries to explain the issues of using patterns to parse XML and HTML.
Instead, use something more resiliant, a parser. Here's how I'd do it:
xml = <<EOT
<CREATE-EVENT>
<COLLECTION>PAM</COLLECTION>
<EVENT-TYPE>survey_answer</EVENT-TYPE>
<JSON-STRING>
{
question1:done,
question2:done,
question3:done,
question4:done,
question5:done,
question6:done
}
</JSON-STRING>
</CREATE-EVENT>
<EMAIL>
<ADDRESS>bot_client_id</ADDRESS>
<SUBJECT>PAM responses for Wednesday October 07</SUBJECT>
<BODY/>
</EMAIL>
EOT
require 'nokogiri'
doc = Nokogiri::XML::DocumentFragment.parse('<root>' + xml + '</root>')
Your XML example isn't syntactically correct because it's missing a root node and has unterminated <EMAIL> nodes so I added </EMAIL> and wrap xml with <root> when parsing it. In real life you'd pass the entire XML string, assuming it is valid XML using:
doc = Nokogiri::XML(xml)
Once it's parsed into a DOM, I can use:
doc.at('CREATE-EVENT').children.remove
to remove the child nodes of <CREATE-EVENT>, resulting in:
puts doc.to_xml
# >> <root><CREATE-EVENT/>
# >> <EMAIL>
# >> <ADDRESS>bot_client_id</ADDRESS>
# >> <SUBJECT>PAM responses for Wednesday October 07</SUBJECT>
# >> <BODY/>
# >> </EMAIL>
# >> </root>
At this point <CREATE-EVENT/> is now empty.
If you want to substitute something into that node it's equally easy:
word = 'bar'
doc.at('CREATE-EVENT').children = "<foo>#{ word }</foo>"
which results in:
# >> <root><CREATE-EVENT><foo>bar</foo></CREATE-EVENT>
# >> <EMAIL>
# >> <ADDRESS>bot_client_id</ADDRESS>
# >> <SUBJECT>PAM responses for Wednesday October 07</SUBJECT>
# >> <BODY/>
# >> </EMAIL>
# >> </root>
There are very few times I'd ever use sub or gsub to change HTML or XML. Instead I'd grab a parser first. It might not be as fast, but it's a lot more robust solution, which translates to being able to sleep through the night a lot more often.
You can read more about using Nokogiri by searching Stack Overflow (nokogiri), or the internet.
#outbound_text.gsub(/<CREATE-EVENT>(.*)<\/CREATE-EVENT>/m, '\1')
#=> "\n\t\t\t\t<COLLECTION>PAM</COLLECTION>\n\t\t\t\t<EVENT-TYPE>
# survey_answer</EVENT-TYPE>\n\t\t\t\t<JSON-STRING>\n\t\t\t\t\t
# {\n\t\t\t\t\t question1:done,\n\t\t\t\t\t question2:done,
# \n\t\t\t\t\t question3:done,\n\t\t\t\t\t question4:done,
# \n\t\t\t\t\t question5:done,\n\t\t\t\t\t question6:done
# \n\t\t\t\t\t}\n\t\t\t\t</JSON-STRING>\n\t\t\t\n\n\t\t\t\n <EMAIL>\n
# <ADDRESS>bot_client_id</ADDRESS>\n <SUBJECT>PAM
# responses for Wednesday October 07</SUBJECT>\n <BODY>\nHi, there"
I've broken the return string so it can be seen more easily. The problem is that you forgot /m (multiline) at the end of the regex.

Need clarification with 'each-do' block in my ruby code

Given an html file:
<div>
<div class="NormalMid">
<span class="style-span">
"Data 1:"
1
2
</span>
</div>
...more divs
<div class="NormalMid">
<span class="style-span">
"Data 20:"
20
21
22
23
</span>
</div>
...more divs
</div
Using these SO posts as reference:
How do I integrate these two conditions block codes to mine in Ruby?
and
How to understand this Arrays and loops in Ruby?
My code:
require 'nokogiri'
require 'pp'
require 'open-uri'
data_file = 'site.htm'
file = File.open(data_file, 'r')
html = open(file)
page = Nokogiri::HTML(html)
page.encoding = 'utf-8'
rows = page.xpath('//div[#class="NormalMid"]')
details = rows.collect do |row|
detail = {}
[
[row.children.first.element_children,row.children.first.element_children],
].each do |part, link|
data = row.children[0].children[0].to_s.strip
links = link.collect {|item| item.at_xpath('#href').to_s.strip}
detail[data.to_sym] = links
end
detail
end
details.reject! {|d| d.empty?}
pp details
The output:
[{:"Data 1:"=>
["http://www.site.com/data/1",
"http://www.site.com/data/2"]},
...
{:"Data 20 :"=>
["http://www.site.com/data/20",
"http://www.site.com/data/21",
"http://www.site.com/data/22",
"http://www.site.com/data/20",]},
...
}]
Everything is going good, exactly what I wanted.
BUT if you change these lines of code:
detail = {}
[
[row.children.first.element_children,row.children.first.element_children],
].each do |part, link|
to:
detail = {}
[
[row.children.first.element_children],
].each do |link|
I get the output of
[{:"Data 1:"=>
["http://www.site.com/data/1"]},
...
{:"Data 20 :"=>
["http://www.site.com/data/20"]},
...
}]
Only the first anchor href is stored in the array.
I just need some clarification on why its behaving that way because the argument part in the argument list is not being used, I figure I didn't need it there. But my program doesn't work correctly if I delete the corresponding row.children.first.element_children as well.
What is going on in the [[obj,obj],].each do block? I just started ruby a week ago, and I'm still getting used to the syntax, any help will be appreciated. Thank You :D
EDIT
rows[0].children.first.element_children[0] will have the output
Nokogiri::XML::Element:0xcea69c name="a" attributes=[#<Nokogiri::XML::Attr:0xcea648
name="href" value="http://www.site.com/data/1">] children[<Nokogiri::XML::Text:0xcea1a4
"1">]>
puts rows[0].children.first.element_children[0]
1
You made your code overly complicated. Looking at your code,it seems you are trying to get something like below:
require 'nokogiri'
doc = Nokogiri::HTML::Document.parse <<-eotl
<div>
<div class="NormalMid">
<span class="style-span">
"Data 1:"
1
2
</span>
</div>
<div class="NormalMid">
<span class="style-span">
"Data 20:"
20
21
22
23
</span>
</div>
</div
eotl
rows = doc.xpath("//div[#class='NormalMid']/span[#class='style-span']")
val = rows.map do |row|
[row.at_xpath("./text()").to_s.tr('"','').strip,row.xpath(".//#href").map(&:to_s)]
end
Hash[val]
# => {"Data 1:"=>["http://site.com/data/1", "http://site.com/data/2"],
# "Data 20:"=>
# ["http://site.com/data/20",
# "http://site.com/data/21",
# "http://site.com/data/22",
# "http://site.com/data/23"]}
What is going on in the [[obj,obj],].each do block?
Look the below 2 parts:
[[1],[4,5]].each do |a|
p a
end
# >> [1]
# >> [4, 5]
[[1,2],[4,5]].each do |a,b|
p a, b
end
# >> 1
# >> 2
# >> 4
# >> 5

Resources