element on the page has needed content that i'm trying to pull
here's the element.content after a parse with Nokogiri
["\n \n \n \n itemId[0]=1234;\n \n \n \n \n \n \n \n My Project: First Edition\n \n ", "\n \n \n \n itemId[1]=2345;\n \n \n \n \n \n \n \n My Second Edition\n \n ", "\n \n \n \n itemId[2]=1234;\n \n \n \n \n \n \n \n Third\n \n \n"]
I was able to get the RegEx for the itemId[0]=1234 which is (/itemId.\d+..\d{4}/) but I'm totally stuck on how to grab the names of the content. Any advice? Perhaps I can just parse with Ruby through HTML?
Given a string like this:
s= "\n \n \n \n itemId[0]=1234;\n \n \n \n \n \n \n \n My Project: First Edition\n \n "
You could do this:
m = s.match(/(itemId\[\d+\]=\d+);(.*)/m)
item = m[1]
# itemId[0]=1234
name = m[2].strip
# My Project: First Edition
Basically you pull out the itemId... part using (more or less) or existing expression, grab the rest of the string ((.*)) in multi-line mode (/m, so that . matches a newline), and then strip off the offending whitespace outside the regex using strip. You don't have to build one unreadable regex that does everything you need, post-processing a match result is allowed and sometimes even encouraged.
I suggest you use split to find all non-empty lines.
str.split(/\s*\n\s*/)
should do the trick.
Related
I'm experimenting on how to scrape a website for data.
This is what I've put together after a few days of research, however, the output from Nokogiri is not as "clean" as I would expect. When I print my array, I get a lot of line-break "/n" in the output.
require 'httparty'
require 'nokogiri'
require 'open-uri'
require 'pry'
require 'csv'
# Assigning the page to scrape
page = HTTParty.get('http://www.realtor.com/realestateandhomes-search/Atlanta_GA/type-single-family-home/price-na-500000')
# Transform the http response into a Nokogiri in order to parse it
parse_page = Nokogiri::HTML(page)
# Create an empty array for property details
details_array = []
parse_page.css('div.srp-item-body').map do |d|
property_details = d.text
details_array.push(property_details)
end
Pry.start(binding)
While in Pry, if I display details_array or address_array, output looks like:
[2] pry(main)> details_array
=> ["\n \n \n \n 2265 Tanglewood Cir NE,\n Atlanta,\n GA\n 30345\n \n \n\n \n Dresden East\n \n \n\n $289,900\n \n \n \n 3 bd\n 2 ba\n 1,566 sq ft\n
0.3 acres lot\n \n \n \n \n Single Family Home\n \n \n \n \n
Brokered by Re/Max Town And Country\n \n \n
\n \n \n Brokered by \n Re/Max
Town And Country\n \n \n \n ", "\n \n
\n \n 2141 Dunwoody Gln,\n
Atlanta,\n GA\n 30338\n \n \n\n
\n \n $469,900\n \n \n
\n 4 bd\n 3 ba\n 2,850 sq
ft\n 0.3 acres lot\n 2 car\n
\n \n \n \n Single Family Home\n
\n \n \n \n Brokered by
Buckhead Home Realty Llc\n \n \n \n
\n \n Brokered by \n Buckhead Home
Realty Llc\n \n \n \n ", "\n \n
\n \n 1048 Martin St SE,\n
Atlanta,\n GA\n 30315\n \n \n\n
\n Intown South\n Peoplestown\n \n \n
\n $164,900\n \n \n \n
5 bd\n 3 ba\n 2,376 sq ft\n
7,405 sq ft lot\n \n \n \n \n
Single Family Home\n \n \n \n \n
Brokered by Greenlet Llc\n \n \n \n
\n \n Brokered by \n Greenlet Llc\n
\n \n \n ", "\n \n \n \n
1048 Martin St SE,\n Atlanta,\n GA\n
30315\n \n \n\n \n Intown South\n
Peoplestown\n \n \n \n $164,900\n
\n \n \n 5 bd\n 3
ba\n 2,055 sq ft\n 7,584 sq ft lot\n
\n \n \n \n Single Family Home\n
\n \n \n \n Brokered by
Greenlet, Llc\n \n \n \n \n
\n Brokered by \n Greenlet, Llc\n \n
\n \n ", "\n \n \n \n
1991 Woodbine Ter NE,\n Atlanta,\n GA\n
30329\n \n \n\n \n Sagamore Hills\n
\n \n \n $299,900\n \n \n
\n 3 bd\n 1+ ba\n 1,449
sq ft\n 0.8 acres lot\n \n \n
\n \n Single Family Home\n \n \n
\n :
It looks like you're not digging into the document far enough with your selector. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<div>
<p>foo</p>
<p>bar</p>
</div>
</body>
</html>
EOT
doc.search('div').map(&:text) # => ["\n foo\n bar\n "]
When looking at the text of a parent tag you'll get both the text nodes used to format the HTML, plus the text of the desired <p> node.
If you drill down to the actual nodes you want and then get their text you'll remove the inter-tag formatting:
doc.search('div p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.
How I can convert string -
text = "test test1 \n \n \n \n \n \n \n \n \n \n \n \n \n \n test2 \n"
to
test test1 \n\n\n\n\n\n\n\n\n\n\n\n\n\n test2\n
I tried use next - text.gsub(/\s\n/, '\n'), but it added additional slash -
test test1\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n test2\\n
Use double quotes, instead of single:
text.gsub(/\s\n/, "\n")
With single quotes, \n has the meaning of \ and n, one after another. With double, it is interpreted as new line.
I expect that either the space after "test1" is to be removed as well or the space after "test2" is not to be removed. #ndn assumed the former was intended. If the second interpretation applies, you could do the following:
r = /
(?<=\n) # match \n in a positive lookbehind
\s # match a whitespace character
(?=\n) # match \n in a positive lookahead
/x # extended/free-spacing regex definition mode
text.gsub(r,"")
#=> "test test1 \n\n\n\n\n\n\n\n\n\n\n\n\n\n test2 \n"
or:
text.gsub(/\n\s(?=\n)/, "\n")
I have a string that looks like this.
mystring="The Body of a\r\n\t\t\t\tSpider"
I want to replace all the \r, \n, \t etc with a whitespace.
The code I wrote for this is :
mystring.gsub(/\\./, " ")
But this isn't doing anything to the string.
Help.
\r, \n and \t are escape sequences representing carriage return, line feed and tab. Although they are written as two characters, they are interpreted as a single character:
"\r\n\t".codepoints #=> [13, 10, 9]
Because it is such a common requirement, there's a shortcut \s to match all whitespace characters:
mystring.gsub(/\s/, ' ')
#=> "The Body of a Spider"
Or \s+ to match multiple whitespace characters:
mystring.gsub(/\s+/, ' ')
#=> "The Body of a Spider"
/\s/ is equivalent to /[ \t\r\n\f]/
String#tr is designed for stream symbol substitution. It appears to be a bit quickier, than String#gsub:
mystring.tr "\r", ' '
It hasan insplace version also (this will replace all carriage returns, line feed and spaces with space):
mystring.tr! "\s\r\n\t\f", ' '
Stefen's Answer is really very Cool as always comeup with very short and clean solutions. But here what I tried to remove all special characters. [Posted as just optional solution] ;)
> a = "The Body of a\r\n\t\t\t\tSpider"
=> "The Body of a\r\n\t\t\t\tSpider"
> a.gsub(/[^0-9A-Za-z]/, ' ')
=> "The Body of a Spider"
you can use strip , then add a space to your string
mystring.strip . " "
If you literally has \r\n\t in your string:
mystring="The Body of a\r\n\t\t\t\tSpider"
mystring.split(/[\r\t\n]/)
I have this string:
string = "SEGUNDA A SEXTA\n05:24 \n05:48\n06:12\n06:36\n07:00\n07:24\n07:48\n\n08:12 \n08:36\n09:00\n09:24\n09:48\n10:12\n10:36\n11:00 \n11:24\n11:48\n12:12\n12:36\n13:00\n13:24\n13:48 \n14:12\n14:36\n15:00\n15:24\n15:48\n16:12\n16:36 \n17:00\n17:24\n17:48\n18:12\n18:36\n19:00\n19:48 \n20:36\n21:24\n22:26\n23:15\n00:00\n"
And I'd like to replace all \n\n occurrences to only one \n and if it's possible I'd like to remove also all " " (spaces) between the numbers and the newline character \n
I'm trying to do:
string.gsub(/\n\n/, '\n')
but it is replacing \n\n by \\n
Can anyone help me?
The real reason is because single quoted sting doesn't escape special characters (like \n).
string.gsub(/\n/, '\n')
It replaces one single character \n with two characters '\' and 'n'
You can see the difference by printing the string:
[302] pry(main)> puts '\n'
\n
=> nil
[303] pry(main)> puts "\n"
=> nil
[304] pry(main)> string = '\n'
=> "\\n"
[305] pry(main)> string = "\n"
=> "\n"
I think you're looking for:
string.gsub( / *\n+/, "\n" )
This searches for zero or more spaces followed by one or more newlines, and replaces the match with a single newline.
How do I replace all non-word chars (\W) that are also not space characters (\s)?
This is the desired functionality:
"the (quick)! brown \n fox".gsub(regex, "#")
=>
"the #quick## brown \n fox"
"the (quick)! brown \n fox".gsub(/[^\w\s]/, "#")
By making the regex replace anything that is NOT a word character OR a space character.
I think you need a regex like this one:
/[^\w\s]/
When you add a circumflex ^ to the start of a character set, it negates the expression so that anything except characters in the set are matched.