#scan suddenly returns an empty array - ruby

I am creating a scraper for articles from www.dev.to, which should read in the title, author and body of the article. I am using #scan to get rid of white space and other characters after the author name. At first i assumed the author name would consist of first name and last name, then realized some only have one name listed. Now that I changed the regex accordingly, the method stopped working and #scan returns an empty array. How can I fix this?
def scrape_post(path)
url = "https://dev.to/#{path}"
html_content = open(url).read
doc = Nokogiri::HTML(html_content)
doc.search('.article-wrapper').each do |element|
title = element.search('.crayons-article__header__meta').search('h1').text.strip
author_raw = element.search('.crayons-article__subheader').text.strip
author = author_raw.scan(/\A\w+(\s|\w)\w+/).first
body = doc.at_css('div#article-body').text.strip
#post = Post.new(id: #next_id, path: path, title: title, author: author, body: body, read: false)
end
#post
end
Example of input data:
path = rahxuls/preventing-copying-text-in-a-webpage-4acg
Expected output:
title = "Preventing copying text in a webpage 😁"
author_raw = "Rahul\n \n\n \n Nov 6\n\n\n ・2 min read"
author = "Rahul"

From the scan docs.
If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
By adding the parentheses to the middle of your regex, you created a capturing group. Scan will return whatever that group captures. In the example you gave, it will be 'u'.
"Rahul\n \n\n \n Nov 6\n\n\n ・2 min read".scan(/\A\w+(\s|\w)\w+/) #=> [["u"]]
The group can be marked as non-capturing to return to your old implementation
"Rahul\n \n\n \n Nov 6\n\n\n ・2 min read".scan(/\A\w+(?:\s|\w)\w+/) #=> ["Rahul"]
# ^
Or you can add a named capture group to what you actually want to extract.
"Rahul\n \n\n \n Nov 6\n\n\n ・2 min read".match(/\A(?<name>\w+(\s|\w)\w+)/)[:name] #=> "Rahul"

Related

How to find text across HTML tag boundaries?

I have HTML like this:
<div>Lorem ipsum <b>dolor sit</b> amet.</div>
How can I find a plain text based match for my search string ipsum dolor in this HTML? I need the start and end XPath node pointers for the match, plus character indexes to point inside these start and stop nodes. I use Nokogiri to work with the DOM, but any solution for Ruby is fine.
Difficulty:
I can't node.traverse {|node| … } through the DOM and do a plain text search whenever a text node comes across, because my search string can cross tag boundaries.
I can't do a plain text search after converting the HTML to plain text, because I need the XPath indexes as result.
I could implement it myself with basic tree traversal, but before I do I'm asking if there is a Nokogiri function or trick to do it more comfortably.
You could do something like:
doc.search('div').find{|div| div.text[/ipsum dolor/]}
In the end, we used code as follows. It is shown for the example given in the question, but also works in the generic case of arbitrary-depth HTML tag nesting. (Which is what we need.)
In addition, we implemented it in a way that can ignore excess (≥2) whitespace characters in a row. Which is why we have to search for the end of the match and can't just use the length of the search string / quote and the start of the match position: the number of whitespace characters in the search string and search match might differ.
doc = Nokogiri::HTML.fragment("<div>Lorem ipsum <b>dolor sit</b> amet.</div>")
quote = 'ipsum dolor'
# (1) Find search string in document text, "plain text in plain text".
quote_query =
quote.split(/[[:space:]]+/).map { |w| Regexp.quote(w) }.join('[[:space:]]+')
start_index = doc.text.index(/#{quote_query}/i)
end_index = start_index+doc.text[/#{quote_query}/i].size
# (2) Find XPath values and character indexes for our search match.
#
# To do this, walk through all text nodes and count characters until
# encountering both the start_index and end_index character counts
# of our search match.
start_xpath, start_offset, end_xpath, end_offset = nil
i = 0
doc.xpath('.//text() | text()').each do |x|
 offset = 0
 x.text.split('').each do
   if i == start_index
     e = x.previous
     sum = 0
     while e
       sum+= e.text.size
       e = e.previous
     end
     start_xpath = x.path.gsub(/^\?/, '').gsub(
/#{Regexp.quote('/text()')}.*$/, ''
)
     start_offset = offset+sum
   elsif i+1 == end_index
     e = x.previous
     sum = 0
     while e
       sum+= e.text.size
       e = e.previous
     end
     end_xpath = x.path.gsub(/^\?/, '').gsub(
/#{Regexp.quote('/text()')}.*$/, ''
)
     end_offset = offset+1+sum
   end
   offset+=1
   i+=1
 end
end
At this point, we can retrieve the desired XPath values for the start and stop of the search match (and in addition, character offsets pointing to the exact character inside the XPath designated element for the start and stop of the search match). We get:
puts start_xpath
/div
puts start_offset
6
puts end_xpath
/div/b
puts end_offset
5

Regex to extract last number portion of varying URL

I'm creating a URL parser and have three kind of URLs from which I would like to extract the number portion from the end of the URL and increment the extracted number by 10 and update the URL. I'm trying to use regex to extract but I'm new to regex and having trouble.
These are three URL structures of which I'd like to increment the last number portion of:
Increment last number 20 by 10:
http://forums.scamadviser.com/site-feedback-issues-feature-requests/20/
Increment last number 50 by 10:
https://forums.questionablecontent.net/index.php/board,1.50.html
Increment last number 30 by 10:
https://forums.comodo.com/how-can-i-help-comodo-please-we-need-you-b39.30/
With \d+(?!.*\d) regex, you will get the last digit chunk in the string. Then, use s.gsub with a block to modify the number and put back to the result.
See this Ruby demo:
strs = ['http://forums.scamadviser.com/site-feedback-issues-feature-requests/20/', 'https://forums.questionablecontent.net/index.php/board,1.50.html', 'https://forums.comodo.com/how-can-i-help-comodo-please-we-need-you-b39.30/']
arr = strs.map {|item| item.gsub(/\d+(?!.*\d)/) {$~[0].to_i+10}}
Note: $~ is a MatchData object, and using the [0] index we can access the whole match value.
Results:
http://forums.scamadviser.com/site-feedback-issues-feature-requests/30/
https://forums.questionablecontent.net/index.php/board,1.60.html
https://forums.comodo.com/how-can-i-help-comodo-please-we-need-you-b39.40/
Try this regex:
\d+(?=(\/)|(.html))
It will extract the last number.
Demo: https://regex101.com/r/zqUQlF/1
Substitute back with this regex:
(.*?)(\d+)((\/)|(.html))
Demo: https://regex101.com/r/zqUQlF/2
this regex matches only the last whole number in each URL by using a lookahead (which 'sees' patterns but doesn't eat any characters):
\d+(?=\D*$)
online demo here.
Like this:
urls = ['http://forums.scamadviser.com/site-feedback-issues-feature-requests/20/', 'https://forums.questionablecontent.net/index.php/board,1.50.html', 'https://forums.comodo.com/how-can-i-help-comodo-please-we-need-you-b39.30/']
pattern = /(\d+)(?=[^\d]+$)/
urls.each do |url|
url.gsub!(pattern) {|m| m.to_i + 10}
end
puts urls
You can also test it online here: https://ideone.com/smBJCQ

best way to find substring in ruby using regular expression

I have a string https://stackverflow.com. I want a new string that contains the domain from the given string using regular expressions.
Example:
x = "https://stackverflow.com"
newstring = "stackoverflow.com"
Example 2:
x = "https://www.stackverflow.com"
newstring = "www.stackoverflow.com"
"https://stackverflow.com"[/(?<=:\/\/).*/]
#⇒ "stackverflow.com"
(?<=..) is a positive lookbehind.
If string = "http://stackoverflow.com",
a really easy way is string.split("http://")[1]. But this isn't regex.
A regex solution would be as follows:
string.scan(/^http:\/\/(.+)$/).flatten.first
To explain:
String#scan returns the first match of the regex.
The regex:
^ matches beginning of line
http: matches those characters
\/\/ matches //
(.+) sets a "match group" containing any number of any characters. This is the value returned by the scan.
$ matches end of line
.flatten.first extracts the results from String#scan, which in this case returns a nested array.
You might want to try this:
#!/usr/bin/env ruby
str = "https://stackoverflow.com"
if mtch = str.match(/(?::\/\/)(/S)/)
f1 = mtch.captures
end
There are two capturing groups in the match method: the first one is a non-capturing group referring to your search pattern and the second one referring to everything else afterwards. After that, the captures method will assign the desired result to f1.
I hope this solves your problem.

Parse file, find a string and store next values

I need to parse a file according to different rules.
The file contains several lines.
I go through the file line by line. When I find a specific string, I have to store the data present in the next lines until a specific character is found.
Example of file:
start {
/* add comment */
first_step {
sub_first_step {
};
sub_second_step {
code = 50,
post = xxx (aaaaaa,
bbbbbb,
cccccc,
eeeeee),
number = yyyy (fffffff,
gggggg,
jjjjjjj,
ppppppp),
};
So, in this case:
File.open(#file_to_convert, "r").each_line do |line|
In "line" I have my current line. I need to:
1) find when the line contains the string "xxx"
if line.include?("union") then
Correct?
2) store the next values (e.g.: aaaa, bbbb, ccccc,eeee) in an array until I find the character ")". This highlights that the section is finished.
I think we I reach the line with the string "xxxx" I have to iterate the next lines inside the block "if".
Try this:
file_contents = File.read(#file_to_convert)
lines = file_contents[/xxx \(([^)]+)\)/, 1].split
# => ["aaaaaa,", "bbbbbb,", "cccccc,", "eeeeee"]
The regex (xxx \(([^)]+)\)) takes all the text after xxx ( until the next ), and split splits it into its items.
It think this is what you are looking for:
looking = true
results = []
File.open(#file_to_convert, "r").each_line do |line|
if looking
if line.include?("xxx")
looking = false
results << line.scan(/\(([^,]*)/x)
end
else
if line.include?(")")
results << line.strip.delete('),')
break
else
results << line.strip.delete(',')
end
end
end
puts results

Regex to match value only once in text value

I am dealing with a dirty data source that has some key value pairs I have to extract. for example:
First Name = John Last Name = Smith Home Phone = 555-333-2345 Work Phone = Email = john.doe#email.com Zip From = 11772 Zip To = 11782 First Name = John First Name = John
To extract the First Name, I am using this regular expression:
/First Name = ([a-zA-Z]*)/
How do I prevent multiple matches in the case where the First Name is duplicated as shown above?
Here is a version of this on Rubular.
match will only get the first match (you would use scan to get all):
str.match(/First Name = ([a-zA-Z]*)/).captures.first
#=> "John"
(given your string is in str)
[] will also give you the first match:
str[/First Name = ([a-zA-Z]*)/, 1]
The 1 means the first capture group
/^First Name = ([a-zA-Z]*)/
this will work too. just add ^ to indicate start of line

Resources