Scrapy xpath scraping meta

Scrapy xpath scraping meta - xpath

I'm scraping with scrapy this url: http://quotes.toscrape.com/
it works great when I do:
response.xpath("//meta[#itemprop='keywords']/#content").extract()
response.xpath("//meta[#itemprop='keywords'][1]/#content").extract_first()
but when I try to get the second meta from that list of metas using the index
response.xpath("//meta[#itemprop='keywords'][2]/#content").extract_first()
it doesn't work.
What am I missing?
Thanks!

You need to wrap the expression before index in parenthesis:
Instead of:
"//meta[#itemprop='keywords'][2]/#content"
It should be:
"(//meta[#itemprop='keywords'])[2]/#content"
This is needed because you have parameter operators in your xpath.
You can test this:
$ scrapy shell "http://quotes.toscrape.com/"
In [1]: response.xpath("//meta[#itemprop='keywords'][2]/#content").extract_first()
In [2]: response.xpath("(//meta[#itemprop='keywords'])[2]/#content").extract_first()
Out[2]: 'abilities,choices'

Related

Selecting id attribute using xpath/scrapy

I am trying to select the user name from the following forum url.
However, when I use the following in the scrapy shell:
admin:~/workspace/scrapper (master) $ scrapy shell "https://bitcointalk.org/index.php?action=profile;u=22232"
In [1]: response.xpath('//*[#id='bodyarea']/table/tbody/tr/td/table/tbody/tr[2]/td[1]/table/tbody/tr[1]/td[2]')
File "<ipython-input-4-abe70514018b>", line 1
response.xpath('//*[#id='bodyarea']/table/tbody/tr/td/table/tbody/tr[2]/td[1]/table/tbody/tr[1]/td[2]')
^
SyntaxError: invalid syntax
However, in Chrome the selector works fine.
Any suggestions what I am doing wrong?
I appreciate your replies!

This is because of quotes inconsistent usage. Note that you're using single quotes both for XPath and string inside XPath.
Use either
'//*[#id="bodyarea"]/table...'
or
"//*[#id='bodyarea']/table..."

extracting data from txt file?

Extract data from a text file, the file consists of the following, say:
<img src="a.jpg" alt="abc" height="12px" width="12px">
<div class="ab3" id="1122">
<img src="b.jpg" alt="abc" height="12px" width="12px">
<div class=cd5" id="9876">
I want to extract the "id" value from the above shown text file...
the output should be:
1122
9876
I tried using findstr, find etc(DOS-COMMANDS), but not able to find the perfect regular expression for the same,
any other way is there, any help?

I agree with #izogfif, you should consider some other tools for this task.
But, to answer what you asked for, I got this regex:
id="[0-9]+"
It will give you output like this:
id="1122"
id="9876"
From there you can save those results (or use a pipe, however you do that in DOS), and then this regex:
[0-9]*
Will give you this output:
1122
9876

Use the following code:
( id=")[^"]*"
This will match any Id's value.
You can replace id with any attribute you are searching for.

String substitution in Puppet?

Is it possible to do a string substitution/transformation in Puppet using a regular expression?
If $hostname is "web1", I want $hostname_without_number to be "web". The following isn't valid Puppet syntax, but I think I need something like this:
$hostname_without_number = $hostname.gsub(/\d+$/, '')

Yes, it is possible.
Check the puppet function reference: http://docs.puppetlabs.com/references/2.7.3/function.html
There's a regular expression substitution function built in. It probably calls the same underlying gsub function.
$hostname_without_number = regsubst($hostname, '\d+$', '')
Or if you prefer to actually call out to Ruby, you can use an inline ERB template:
$hostname_without_number = inline_template('<%= hostname.gsub(/\d+$/, "") %>')

In this page:
https://blog.kumina.nl/2010/03/puppet-tipstricks-testing-your-regsubst-replacings-2/comment-page-1/
it is quite well explained and there is a fantastic trick for testing your regular expressions with irb.
Whith this link and the answer of freiheit I could resolve my problem with substitution of '\' for '/'.
$programfiles_sinbackslash = regsubst($env_programfiles,'\','/','G')

There is a way to parse from smarty?

folks! At php I have such code:
'homeSize' => Image::getSize('home')))
So, there is a way to parse it from smarty?
For example:
'largeSize' => Image::getSize('large')

Using the modifier "getimagesize" you can get all required info about an image.
I wrote an example on how to split the result in to separate variables here:
http://www.i-do-this.com/blog/49/Getting-the-image-dimensions-in-smarty

So, I have done it myself - to parse from Smarty: {assign var='largeSize' value=Image::getSize('large')}

Ruby Typhoeus Request: url with quotes

I'm having a problem doing a request using Typhoeus as my query needs to have quotation marks into it.
If the URl is
url = "http://app.com/method.json?'my_query'"
everything works fine. However, the method I'm trying to run only returns the results I want if the query is the following (i've tested it in browser):
url2 = "http://app.com/method.json?"my_query""
When running
Typhoeus::Request.get(url2)
I get (URI::InvalidURIError)
Escaping quotes with "\" does not work. How can I do this?
Thanks

You should be properly encoding your URI with URI.encode or CGI.escape, doing so will get you proper URLs like this:
http://app.com/method.json?%27my_query%27 # Single quotes
http://app.com/method.json?%22my_query%22 # Double quotes

Try:
require 'uri'
URI.encode('"foo"')
=> "%22foo%22"

Passing json, quotes etc in GET request is tricky. In Ruby 2+ we can use Ruby's URI module's 'escape' method.
> URI.escape('http://app.com/method.json?agent={"account":
{"homePage":"http://demo.my.com","name":"Senior Leadership"}}')
But I suggest use it as POST request and pass it as a message body.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Scrapy xpath scraping meta - xpath

Related

Selecting id attribute using xpath/scrapy

extracting data from txt file?

String substitution in Puppet?

There is a way to parse from smarty?

Ruby Typhoeus Request: url with quotes

Categories

Resources