Ruby Web Scrape (Nokogiri) - cleanup - ruby

I'm experimenting on how to scrape a website for data.
This is what I've put together after a few days of research, however, the output from Nokogiri is not as "clean" as I would expect. When I print my array, I get a lot of line-break "/n" in the output.
require 'httparty'
require 'nokogiri'
require 'open-uri'
require 'pry'
require 'csv'
# Assigning the page to scrape
page = HTTParty.get('http://www.realtor.com/realestateandhomes-search/Atlanta_GA/type-single-family-home/price-na-500000')
# Transform the http response into a Nokogiri in order to parse it
parse_page = Nokogiri::HTML(page)
# Create an empty array for property details
details_array = []
parse_page.css('div.srp-item-body').map do |d|
property_details = d.text
details_array.push(property_details)
end
Pry.start(binding)
While in Pry, if I display details_array or address_array, output looks like:
[2] pry(main)> details_array
=> ["\n \n \n \n 2265 Tanglewood Cir NE,\n Atlanta,\n GA\n 30345\n \n \n\n \n Dresden East\n \n \n\n $289,900\n \n \n \n 3 bd\n 2 ba\n 1,566 sq ft\n
0.3 acres lot\n \n \n \n \n Single Family Home\n \n \n \n \n
Brokered by Re/Max Town And Country\n \n \n
\n \n \n Brokered by \n Re/Max
Town And Country\n \n \n \n ", "\n \n
\n \n 2141 Dunwoody Gln,\n
Atlanta,\n GA\n 30338\n \n \n\n
\n \n $469,900\n \n \n
\n 4 bd\n 3 ba\n 2,850 sq
ft\n 0.3 acres lot\n 2 car\n
\n \n \n \n Single Family Home\n
\n \n \n \n Brokered by
Buckhead Home Realty Llc\n \n \n \n
\n \n Brokered by \n Buckhead Home
Realty Llc\n \n \n \n ", "\n \n
\n \n 1048 Martin St SE,\n
Atlanta,\n GA\n 30315\n \n \n\n
\n Intown South\n Peoplestown\n \n \n
\n $164,900\n \n \n \n
5 bd\n 3 ba\n 2,376 sq ft\n
7,405 sq ft lot\n \n \n \n \n
Single Family Home\n \n \n \n \n
Brokered by Greenlet Llc\n \n \n \n
\n \n Brokered by \n Greenlet Llc\n
\n \n \n ", "\n \n \n \n
1048 Martin St SE,\n Atlanta,\n GA\n
30315\n \n \n\n \n Intown South\n
Peoplestown\n \n \n \n $164,900\n
\n \n \n 5 bd\n 3
ba\n 2,055 sq ft\n 7,584 sq ft lot\n
\n \n \n \n Single Family Home\n
\n \n \n \n Brokered by
Greenlet, Llc\n \n \n \n \n
\n Brokered by \n Greenlet, Llc\n \n
\n \n ", "\n \n \n \n
1991 Woodbine Ter NE,\n Atlanta,\n GA\n
30329\n \n \n\n \n Sagamore Hills\n
\n \n \n $299,900\n \n \n
\n 3 bd\n 1+ ba\n 1,449
sq ft\n 0.8 acres lot\n \n \n
\n \n Single Family Home\n \n \n
\n :

It looks like you're not digging into the document far enough with your selector. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div>
<p>foo</p>
<p>bar</p>
</div>
</body>
</html>
EOT
doc.search('div').map(&:text) # => ["\n foo\n bar\n "]
When looking at the text of a parent tag you'll get both the text nodes used to format the HTML, plus the text of the desired <p> node.
If you drill down to the actual nodes you want and then get their text you'll remove the inter-tag formatting:
doc.search('div p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.

Related

Ruby regexp for replace some sequence

How I can convert string -
text = "test test1 \n \n \n \n \n \n \n \n \n \n \n \n \n \n test2 \n"
to
test test1 \n\n\n\n\n\n\n\n\n\n\n\n\n\n test2\n
I tried use next - text.gsub(/\s\n/, '\n'), but it added additional slash -
test test1\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n test2\\n
Use double quotes, instead of single:
text.gsub(/\s\n/, "\n")
With single quotes, \n has the meaning of \ and n, one after another. With double, it is interpreted as new line.
I expect that either the space after "test1" is to be removed as well or the space after "test2" is not to be removed. #ndn assumed the former was intended. If the second interpretation applies, you could do the following:
r = /
(?<=\n) # match \n in a positive lookbehind
\s # match a whitespace character
(?=\n) # match \n in a positive lookahead
/x # extended/free-spacing regex definition mode
text.gsub(r,"")
#=> "test test1 \n\n\n\n\n\n\n\n\n\n\n\n\n\n test2 \n"
or:
text.gsub(/\n\s(?=\n)/, "\n")

Ruby split keep the delimiter before the string

I have the following string :
a = '% abc \n %% abcd \n %% efgh\n '
I would like the ouput to be
['% abc \n', '%% abcd \n', '%% efgh \n']
If I have
b = '%% abc \n %% efg \n %% ijk \n]
I would like the output to be
['%% abc \n', '%% efg \n', '%% ijk \n']
I use b.split('%%').collect!{|v| '%%' + v } and it works fine for case 2.
but it doesn't work for case 1.
I saw some post of using 'scan' or 'split' to keep the delimiter if its after the string
For example : 'a; b; c' becomes ['a;', 'b;' ,'c']
But I want the opposite ['a', ';b', ';c']
There need not be space between \n and %% since \n depicts a new line.
A solution i made was
sel = '% asd \n %% asf sdaf \n %% adsasd asdf asd asf ';
delimiter = '%%';
indexOfPercent = test_string.index("%%")
if(indexOfPercent == 0)
result = (test_string || '').split(delimiter).reject(&:empty?).collect! {|v| delimiter + v}
else
result = (test_string.slice(test_string.index("%%")..-1) || '').split(delimiter).reject(&:empty?).collect! {|v| delimiter + v}
result.unshift(sel[0.. indexOfPercent-1])
end
(?<=\\n)\s*(?=%%)
You can split on the space using lookarounds.See demo.
https://regex101.com/r/fM9lY3/7
You could do it this way
def splitter(s)
#reject(&:empty) added to handle trailing space in a
s.lines.map{|n| n.lstrip.chomp(' ')}.reject(&:empty?)
end
#double quotes used to keep ruby from changing
# \n to \\n
a = "% abc \n %% abcd \n %% efgh\n "
b = "b = '%% abc \n %% efg \n %% ijk \n"
splitter(a)
#=> ["% abc \n", "%% abcd \n", "%% efgh\n"]
splitter(b)
#=> ["%% abc \n", "%% efg \n", "%% ijk \n"]
String#lines will partition the string right after the newline character by default. (This will return an Array. Then we call Array#map and pass in each matching string. This string then calls lstrip to remove the leading space and chomp(' ') to remove the trailing space without removing the \n. Then we reject any empty strings as would be the case in variable a because of the trailing space.
You can also use
a.split(/\\n\s?/).collect{|e| "#{e}\\n"}
a.split(/\\n\s?/)
# ["% abc ", "%% abcd ", "%% efgh"]
.collect{|e| "#{e}\\n"}
# will append \n
# ["% abc \\n", "%% abcd \\n", "%% efgh\\n"]

How to use Ruby's gsub function to replace excessive '\n' on a string

I have this string:
string = "SEGUNDA A SEXTA\n05:24 \n05:48\n06:12\n06:36\n07:00\n07:24\n07:48\n\n08:12 \n08:36\n09:00\n09:24\n09:48\n10:12\n10:36\n11:00 \n11:24\n11:48\n12:12\n12:36\n13:00\n13:24\n13:48 \n14:12\n14:36\n15:00\n15:24\n15:48\n16:12\n16:36 \n17:00\n17:24\n17:48\n18:12\n18:36\n19:00\n19:48 \n20:36\n21:24\n22:26\n23:15\n00:00\n"
And I'd like to replace all \n\n occurrences to only one \n and if it's possible I'd like to remove also all " " (spaces) between the numbers and the newline character \n
I'm trying to do:
string.gsub(/\n\n/, '\n')
but it is replacing \n\n by \\n
Can anyone help me?
The real reason is because single quoted sting doesn't escape special characters (like \n).
string.gsub(/\n/, '\n')
It replaces one single character \n with two characters '\' and 'n'
You can see the difference by printing the string:
[302] pry(main)> puts '\n'
\n
=> nil
[303] pry(main)> puts "\n"
=> nil
[304] pry(main)> string = '\n'
=> "\\n"
[305] pry(main)> string = "\n"
=> "\n"
I think you're looking for:
string.gsub( / *\n+/, "\n" )
This searches for zero or more spaces followed by one or more newlines, and replaces the match with a single newline.

RegEx words match

element on the page has needed content that i'm trying to pull
here's the element.content after a parse with Nokogiri
["\n \n \n \n itemId[0]=1234;\n \n \n \n \n \n \n \n My Project: First Edition\n \n ", "\n \n \n \n itemId[1]=2345;\n \n \n \n \n \n \n \n My Second Edition\n \n ", "\n \n \n \n itemId[2]=1234;\n \n \n \n \n \n \n \n Third\n \n \n"]
I was able to get the RegEx for the itemId[0]=1234 which is (/itemId.\d+..\d{4}/) but I'm totally stuck on how to grab the names of the content. Any advice? Perhaps I can just parse with Ruby through HTML?
Given a string like this:
s= "\n \n \n \n itemId[0]=1234;\n \n \n \n \n \n \n \n My Project: First Edition\n \n "
You could do this:
m = s.match(/(itemId\[\d+\]=\d+);(.*)/m)
item = m[1]
# itemId[0]=1234
name = m[2].strip
# My Project: First Edition
Basically you pull out the itemId... part using (more or less) or existing expression, grab the rest of the string ((.*)) in multi-line mode (/m, so that . matches a newline), and then strip off the offending whitespace outside the regex using strip. You don't have to build one unreadable regex that does everything you need, post-processing a match result is allowed and sometimes even encouraged.
I suggest you use split to find all non-empty lines.
str.split(/\s*\n\s*/)
should do the trick.

How to add string "\n" literally at the end of each line in Ruby?

Here is a string str:
str = "line1
line2
line3"
We would like to add string "\n" to the end of each line:
str = "line1 \n
line2 \n
line3 \n"
A method is defined:
def mod_line(str)
s = ""
str.each_line do |l|
s += l + '\\n'
end
end
The problem is that '\n' is a line feed and was not added to the end of the str even with escape \. What's the right way to add '\n' literally to each line?
String#gsub/String#gsub! plus a very simple regular expression can be used to achieve that:
str = "line1
line2
line3"
str.gsub!(/$/, ' \n')
puts str
Output:
line1 \n
line2 \n
line3 \n
The platform-independent solution:
str.gsub(/\R/) { " \\n#{$~}" }
It will search for line-feeds/carriage-returns and replace them with themselves, prepended by \n.
\n needs to be interpreted as a special character. You need to put it in double quotes.
"\n"
Your attempt:
'\\n'
only escapes the backslash, which is actually redundant. With or without escaping on the backslash, it gives you a backslash followed by the letter n.
Also, your method mod_line returns the result of str.each_line, which is the original string str. You need to return the modified string s:
def mod_line(str)
...
s
end
And by the way, be aware that each line of the original string already has "\n" at the end of each line, so you are adding the second "\n" to each line (making it two lines).
This is the closest I got to it.
def mod_line(str)
s = ""
str.each_line do |l|
s += l
end
p s
end
Using p instead of puts leaves the \n on the end of each line.

Resources