How to produce xpath link with characters from [a-z] - xpath

I am scraping data from the website and I need to iterate over pages, but instead of a counter they have an alphabetical index
http://funny2.com/jokesb.htm'
http://funny2.com/jokesc.htm')
...
But I can't figure out how to include the [a-z] iterator. I tried
http://funny2.com/jokes^[a-z]+$.htm'
which didn't work.

XPath doesn't support regular expressions. However as Scrapy built atop lxml it supports some EXSLT extensions, particularly re extension. You can use operations from EXSLT prepending them with corresponding namespace like this:
response.xpath('//a[re:test(#href, "jokes[a-z]+\.htm")]/#href')
Docs: https://doc.scrapy.org/en/latest/topics/selectors.html?highlight=selector#using-exslt-extensions
If you need just to extract the links, use LinkExtractor with regexp:
LinkExtractor(allow=r'/jokes[a-z]+\.htm').extract_links(response)

You can iterate through every letter in the alphabet and format that letter into some url template:
from string import ascii_lowercase
# 'abcdefghijklmnopqrstuvwxyz'
from char in ascii_lowercase:
url = "http://funny2.com/jokes{}.htm".format(char)
In scrapy context, you need to find a way to increment character in the url. You can find it with regex, figure out the next character in alphabet and put it into the current url, something like:
import re
from string import ascii_lowercase
def parse(self, response):
current_char = re.findall('jokes(\w).htm', response.url)
next_char = ascii_lowercase[current_char] + 1
next_char = ascii_lowercase[next_char]
next_url = re.sub('jokes(\w).htm', 'jokes{}.htm'.format(next_char), response.url)
yield Request(next_url, self.parse2)

Related

Force quotes in yaml

I have a ruby hash that is something like this:
myhash = { title: 'http://google.com'}
I'm trying to add this to a yaml file like this:
params['myhash'] = myhash
File.open('config.yaml', 'w') do |k|
k.write params.to_yaml
end
The problem is that YAML is removing the quotes around the links even though they are needed (they contain ':').
According to several questions on Stackoverflow, YAML should only remove the quotes when they are not needed.
I found a Solution, but it's really ugly and I prefer not to use it if there was another solution.
I suppose that yaml should be including the quotes in this case. Is there any reason why it's not doing this?
Note: the links are dynamically created
Quotes aren't necessary for your example string. From the specs:
Normally, YAML insists the “:” mapping value indicator be separated from the value by white space. A benefit of this restriction is that the “:” character can be used inside plain scalars, as long as it is not followed by white space.
For example:
h = { value1: 'quotes: needed', value2: 'quotes:not needed' }
puts h.to_yaml
Results in:
---
:value1: 'quotes: needed'
:value2: quotes:not needed
After couple hours i found it's easier in python.
usage: python quotes.py *.yml
This script use literal format if string has '\n'.
Use ruamel to replace yaml lib, yaml lib seems not handle some UTF-8 entry
from ruamel import yaml
import io
import sys
class quote_or_literal(unicode):
pass
def str_presenter(dumper, data):
if data.count("\n"): # check for multiline string
return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
else:
return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')
yaml.add_representer(quote_or_literal, str_presenter)
def quote_dict(d):
new = {}
for k, v in d.items():
if isinstance(v, dict):
v = quote_dict(v)
else:
v = quote_or_literal(v)
new[k] = v
return new
def ensure_quotes(path):
with io.open(path, 'r', encoding='utf-8') as stream:
a = yaml.load(stream, Loader=yaml.Loader)
a = quote_dict(a)
with io.open(path, 'w', encoding='utf-8') as stream:
yaml.dump(a, stream, allow_unicode=True,
width=1000, explicit_start=True)
if __name__ == "__main__":
for path in sys.argv[1:]:
ensure_quotes(path)

Regex to extract last number portion of varying URL

I'm creating a URL parser and have three kind of URLs from which I would like to extract the number portion from the end of the URL and increment the extracted number by 10 and update the URL. I'm trying to use regex to extract but I'm new to regex and having trouble.
These are three URL structures of which I'd like to increment the last number portion of:
Increment last number 20 by 10:
http://forums.scamadviser.com/site-feedback-issues-feature-requests/20/
Increment last number 50 by 10:
https://forums.questionablecontent.net/index.php/board,1.50.html
Increment last number 30 by 10:
https://forums.comodo.com/how-can-i-help-comodo-please-we-need-you-b39.30/
With \d+(?!.*\d) regex, you will get the last digit chunk in the string. Then, use s.gsub with a block to modify the number and put back to the result.
See this Ruby demo:
strs = ['http://forums.scamadviser.com/site-feedback-issues-feature-requests/20/', 'https://forums.questionablecontent.net/index.php/board,1.50.html', 'https://forums.comodo.com/how-can-i-help-comodo-please-we-need-you-b39.30/']
arr = strs.map {|item| item.gsub(/\d+(?!.*\d)/) {$~[0].to_i+10}}
Note: $~ is a MatchData object, and using the [0] index we can access the whole match value.
Results:
http://forums.scamadviser.com/site-feedback-issues-feature-requests/30/
https://forums.questionablecontent.net/index.php/board,1.60.html
https://forums.comodo.com/how-can-i-help-comodo-please-we-need-you-b39.40/
Try this regex:
\d+(?=(\/)|(.html))
It will extract the last number.
Demo: https://regex101.com/r/zqUQlF/1
Substitute back with this regex:
(.*?)(\d+)((\/)|(.html))
Demo: https://regex101.com/r/zqUQlF/2
this regex matches only the last whole number in each URL by using a lookahead (which 'sees' patterns but doesn't eat any characters):
\d+(?=\D*$)
online demo here.
Like this:
urls = ['http://forums.scamadviser.com/site-feedback-issues-feature-requests/20/', 'https://forums.questionablecontent.net/index.php/board,1.50.html', 'https://forums.comodo.com/how-can-i-help-comodo-please-we-need-you-b39.30/']
pattern = /(\d+)(?=[^\d]+$)/
urls.each do |url|
url.gsub!(pattern) {|m| m.to_i + 10}
end
puts urls
You can also test it online here: https://ideone.com/smBJCQ

Is there an easy way to find a substring that matches a pattern within a string and extract it?

I know this is open-ended, but I'm not sure how to go about it.
Say I have the string "FDBFBDFLDJVHVBDVBD" and want to find every sub-string that starts with something like "BDF" and ends with either "EFG" or "EDS", is there an easy way to do this?
You can use re.finditer
>>> import re
>>> s = "FDBFBDFLDJVHVBDVBDBDFEFGEDS"
>>> print [s[a.start(): a.end()] for a in re.finditer('BDF', s)]
['BDF', 'BDF']
find every sub-string that starts with something like "BDF" and ends with either "EFG" or "EDS"
It is a job for a regular expression. To extract all such substrings as a list:
import re
substrings = re.findall(r'BDF.*?E(?:FG|DS)', text)
If a substring might contain newlines then pass flags=re.DOTALL.
Example:
>>> re.findall(r'BDF.*?E(?:FG|DS)', "FDBFBDFLDJVHVBDVBDBDFEFGEDS")
['BDFLDJVHVBDVBDBDFEFG']
.*? is not greedy and therefore the shortest substrings are selected. Remove ?, to get the longest match instead.
Seeing as there is no regex expert here yet, I will propose this solution (BTW I added "BDFEFGEDS" to the end of your string so it would give some results):
import re
s = "FDBFBDFLDJVHVBDVBDBDFEFGEDS"
endings = ['EFG', 'EDS']
matches = []
for ending in endings:
match = re.findall(r'(?=(BDF.*{0}))'.format(ending), s)
matches.extend(match)
print matches
giving the result:
['BDFLDJVHVBDVBDBDFEFG', 'BDFEFG', 'BDFLDJVHVBDVBDBDFEFGEDS', 'BDFEFGEDS']

How to extract href from a tag using ruby regex?

I have this link which i declare like this:
link = "H.R.11461"
The question is how could I use regex to extract only the href value?
Thanks!
If you want to parse HTML, you can use the Nokogiri gem instead of using regular expressions. It's much easier.
Example:
require "nokogiri"
link = "H.R.11461"
link_data = Nokogiri::HTML(link)
href_value = link_data.at_css("a")[:href]
puts href_value # => https://www.congress.gov/bill/93rd-congress/house-bill/11461
You should be able to use a regular expression like this:
href\s*=\s*"([^"]*)"
See this Rubular example of that expression.
The capture group will give you the URL, e.g.:
link = "H.R.11461"
match = /href\s*=\s*"([^"]*)"/.match(link)
if match
url = match[1]
end
Explanation of the expression:
href matches the href attribute
\s* matches 0 or more whitespace characters (this is optional -- you only need it if the HTML might not be in canonical form).
= matches the equal sign
\s* again allows for optional whitespace
" matches the opening quote of the href URL
( begins a capture group for extraction of whatever is matched within
[^"]* matches 0 or more non-quote characters. Since quotes inside HTML attributes must be escaped this will match all characters up to the end of the URL.
) ends the capture group
" matches the closing quote of the href attribute's value
In order to capture just the url you can do this:
/(href\s*\=\s*\\\")(.*)(?=\\)/
And use the second match.
http://rubular.com/r/qcqyPv3Ww3

How to parse a resource for the ID in ruby

I have a relative URI / resource:
"/v1/threads/110408889879497140/"
I want to just parse out the ID (the final number in this string).
Hoping something other than regex :)
a = "/v1/threads/110408889879497140/"
a.split('/').last
you can also do it with rpartition:
"/v1/threads/110408889879497140/".rpartition('threads/').last.chop
Use scan with regex:
a.scan(/\d{5,}/)
If you want to isolate numbers in a string without regex, you can use the fact that numbers have ASCII range from 48 to 57 and do something like:
a = "/v1/threads/110408889879497140/"
a.each_char{ |c| a.delete!(c) unless c.ord.between?(48, 57) }
p a #=> 1110408889879497140
A URL is just a protocol designation followed by a file path, so use File.basename which was designed to work with file paths:
File.basename("/v1/threads/110408889879497140/")
# => "110408889879497140"

Resources