scrapy using xpath selector by class result in syntax error

scrapy using xpath selector by class result in syntax error - xpath

I try to scrape this webpage https://books.toscrape.com/index.html using scrapy
def parse(self, response):
all_books = response.xpath("//article")
for book in all_books:
book_title = book.xpath(".//h3/a/#title").get()
book_price = book.xpath(".//div[#class="product_price"]/p[#class="price_color"]/text()").get()
print(book_title)
print(book_price)
will result in:
book_price = book.xpath(".//div[#class="product_price"]/p[#class="price_color"]/text()").get()
^
SyntaxError: invalid syntax
this is very strange because this is a standard xpath selector I copied from Chrome inspection tool (which is working there) and I've copied it 1:1 like the instructor did in his lesson, but only I get the error, that did I wrong?

You need to change outer quotes type to single quote (or change inner quotes, or escape inner quotes like this \"):
book_price = book.xpath('.//div[#class="product_price"]/p[#class="price_color"]/text()').get()

As a side note, you could just use :
'.//p[#class="price_color"]/text()').get()
The use of the 2 predicates [ ] is not needed, and using the above will make your code shorter

Related

How to have ruby conditionally check if variables exist in a string?

So I have a string from a rendered template that looks like
"Dear {{user_name}},\r\n\r\nThank you for your purchase. If you have any questions, we are happy to help.\r\n\r\n\r\n{{company_name}}\r\n{{company_phone_number}}\r\n"
All those variables like {{user_name}} are optional and do not need to be included but I want to check that if they are, they have {{ in front of the variable name. I am using liquid to parse and render the template and couldn't get it to catch if the user only uses 1 (or no) opening brackets. I was only able to catch the proper number of closing brackets. So I wrote a method to check that if these variables exist, they have the correct opening brackets. It only works, however, if all those variables are found.
here is my method:
def validate_opening_brackets?(template)
text = %w(user_name company_name company_phone_number)
text.all? do |variable|
next unless template.include? variable
template.include? "{{#{variable}"
end
end
It works, but only if all variables are present. If, for example, the template created by the user does not include user_name, then it will return false. I've also done this loop using each, and creating a variable outside of the block that I assign false if the conditions are not met. I would really, however, like to get this to work using the all? method, as I can just return a boolean and it's cleaner.

If the question is about how to rewrite the all? block to make it return true if all present variable names have two brackets before them and false otherwise then you could use something like this:
def validate_opening_brackets?(template)
variables = %w(user_name company_name company_phone_number)
variables.all? do |variable|
!template.include?(variable) || template.include?("{{#{variable}")
end
end

TL;DR
There are multiple ways to do this, but the easiest way I can think of is to simply prefix/postfix a regular expression with the escaped characters used by Mustache/Liquid, and using alternation to check for each of your variable names within the template variable characters (e.g. double curly braces). You can then use String#scan and then return a Boolean from Enumerable#any? based on the contents of the Array returned by from #scan.
This works with your posted example, but there may certainly be other use cases where you need a more complex solution. YMMV.
Example Code
This solution escapes the leading and trailing { and } characters to avoid having them treated as special characters, and then interpolates the variable names with | for alternation. It returns a Boolean depending on whether templated variables are found.
def template_string_has_interpolations? str
var_names = %w[user_name company_name company_phone_number]
regexp = /\{\{#{var_names.join ?|}\}\}/
str.scan(regexp).any?
end
Tested Examples
template_string_has_interpolations? "Dear {{user_name}},\r\n\r\nThank you for your purchase. If you have any questions, we are happy to help.\r\n\r\n\r\n{{company_name}}\r\n{{company_phone_number}}\r\n"
#=> true
template_string_has_interpolations? "Dear Customer,\r\n\r\nThank you for your purchase. If you have any questions, we are happy to help.\r\n\r\n\r\nCompany, Inc.\r\n(555) 555-5555\r\n"
#=> false

Xpath is correct but no result after scraping

I am trying to crawl all the name of the cities of the following web:
https://www.zomato.com/directory.
I have tried to used the following xpath.
python
#1st approach:
def parse(self,response):
cities_name = response.xpath('//div//h2//a/text()').extract_first()
items['cities_name'] = cities_name
yield items
#2nd approach:
def parse(self,response):
for city in response.xpath("//div[#class='col-l-5 col-s-8 item pt0 pb5
ml0']"):
l = ItemLoader(item = CountryItem(),selector = city)
l.add_xpath("cities_name",".//h2//a/text()")
yield l.load_item()
yield city
Actual result: Crawl 0 pages and scrape 0 items
Expected: Adelaide, Ballarat etc

First thing to note:
Your xpath is a bit too specific. Css classes in html don't always have reliable order. class1 class2 could end up being class2 class1 or even have some broken syntax involved like trailing spaces: class1 class2.
When you direct match your xpath to [#class="class1 class2"] there's a high chance that it will fail. Instead you should try to use contains function.
Second:
You have a tiny error in your cities_name xpath. In html body its a>h2>text and in your code it's reversed h2>a>text
So that being said I managed to get it working with these css and xpath selectors:
$ parsel "https://www.zomato.com/directory"
> p.mb10>a>h2::text +first
Adelaide
> p.mb10>a>h2::text +len
736
> -xpath
switched to xpath
> //p[contains(#class,"mb10")]/a/h2/text() +first
Adelaide
> //p[contains(#class,"mb10")]/a/h2/text() +len
736
parselcli - https://github.com/Granitosaurus/parsel-cli

You have a wrong XPath:
def parse(self,response):
for city_node in response.xpath("//h2"):
l = ItemLoader(item = CountryItem(), selector = city_node)
l.add_xpath("city_name", ".//a/text()")
yield l.load_item()

The main reason you are not getting any result from that page is because the html elements of that site are not well-formed. You can get the results using html5lib parser. I tried with different parsers but the one I just mentioned did the trick. The following is how you can do it. I used css selector, though.
import scrapy
from bs4 import BeautifulSoup
class ZomatoSpider(scrapy.Spider):
name = "zomato"
start_urls= ['https://www.zomato.com/directory']
def parse(self, response):
soup = BeautifulSoup(response.text, 'html5lib')
for item in soup.select(".row h2 > a"):
yield {"name":item.text}

simplified regex for modifying a string in ruby

Here is my original string:
"Chassis ID TLV\n\tMAC: 00:xx:xx:xx:xx:xx\nPort ID TLV\n\tIfname: Ethernet1/3\nTime to Live TLV\n\t120"
and i want the string to be formatted as :
"Chassis ID TLV;00:xx:xx:xx:xx:xx\nPort ID TLV;Ethernet1/3\nTime to Live TLV;120"
so i used following ruby string functions to do it:
y = x.gsub(/\t[a-zA-Z\d]+:/,"\t")
y = y.gsub(/\t /,"\t")
y = y.gsub("\n\t",";")
so i am looking for a one liner to do the above. since i am not used to regex, i tried doing it sequentially. i am messing it up when i try to do all of them together.

Replace the following construct
[\n\r]\t(?:\w+: )?
with ;, see a demo on regex101.com.

I'd tackle it as a few smaller steps:
input = "Chassis ID TLV\n\tMAC: 00:xx:xx:xx:xx:xx\nPort ID TLV\n\tIfname: Ethernet1/3\nTime to Live TLV\n\t120"
input.split(/\n\t?/).map { |s| s.sub(/\A[^:]+\:\s*/, '') }.join(';')
# => "Chassis ID TLV;00:xx:xx:xx:xx:xx;Port ID TLV;Ethernet1/3;Time to Live TLV;120"
That way you have control over each element instead of being entirely dependent on the regular expression to do it as one shot.

Spring EL - "There is still more data in the expression"

Is it possible to parse an expression like the one shown below, where I call a method and would like to have text after the result of that method call?
String expression = "obj.someMethod()'test'";
ExpressionParser parser = new SpelExpressionParser();
Expression expression = parser.parseExpression(expression);
When I run code like the one below I get the following error:
org.springframework.expression.spel.SpelParseException: EL1041E:(pos 23): After parsing a valid expression, there is still more data in the expression: 'test'
If I remove the 'test' string, then it parsers and evaluates correctly.

Even if it is Expression language it is based on the core programming language and follows with its rules.
So, if you use + operator to concat method result with the string in Java, you should do the same in SpEL:
"obj.someMethod() + 'test'"
is correct answer for you.
As well as you can use :
"obj.someMethod().concat('test')"
if someMethod() returns String, of course.

What is Ruby doing with gsub here?

I'm working on converting code from Ruby to Node.js. I came across these lines at the end of a function and I'm curious what the original developers were trying to accomplish:
url = url.gsub "member_id", "member_id__hashed"
url = url.gsub member_id, member_id_hashed
url
I'm assuming that url at the end is Ruby's equivalent to return url;
as for the lines with gsub, from what I've found online that's the wrong syntax, right? Shouldn't it be:
url = url.gsub(var1, var2)?
If it is correct, why are they calling it twice, once with quotes and once without?

gsub does a global substitute on a string. If I had to guess, the URL might be in the form of
http://somewebsite.com?member_id=123
If so, the code has the following effect:
url.gsub "member_id", "member_id__hashed"
# => "http://somewebsite.com?member_id__hashed=123"
Assuming member_id = "123", and member_id_hashed is some hashed version of the id, then the second line would replace "123" with the hashed version.
url.gsub member_id, member_id_hashed
# => "http://somewebsite.com?member_id__hashed=abc"
So you're going from http://somewebsite.com?member_id=123 to http://somewebsite.com?member_id__hashed=abc
Documentation: https://ruby-doc.org/core-2.6/String.html#method-i-gsub

I'm assuming that the url at the end is Ruby's equivalent to return url;
If that code is part of a method or block, indeed, the line url is the value returned by the method. This is because by default a method in Ruby returns the value of the last expression that was evaluated in the method. The keyword return can be used (as in many other languages) to produce an early return of a method, with or without a return value.
that's the wrong syntax, right? shouldn't it be
url = url.gsub(var1, var2)?
The arguments used to invoke a method in Ruby may stay in parentheses but they may, as well, be listed after the method name, without parentheses.
Both:
url = url.gsub var1, var2
and
url = url.gsub(var1, var2)
are correct and they produce the same result.
The convention in Ruby is to not put parentheses around method arguments but this is not always possible. One such case is when one of the arguments is a call of another method with arguments.
The parentheses are then used to make everything clear both for the interpreter and the readers of the code.
If it is correct, why are they calling it twice, once with quotes and once without?
There are two calls of the same method, with different arguments:
url = url.gsub "member_id", "member_id__hashed"
The arguments of url.gsub are the literal strings "member_id" and "member_id__hashed".
url = url.gsub member_id, member_id_hashed
This time the arguments are the variables member_id and member_id_hashed.
This works the same in JavaScript and many other languages that use double quotes to enclose the string literals.
String#gsub is a method of class String that does search & replace in a string and returns a new string. It's name is short of "global substitute" (it replaces all occurrences). To replace only the first occurrence use String#sub.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

scrapy using xpath selector by class result in syntax error - xpath

You need to change outer quotes type to single quote (or change inner quotes, or escape inner quotes like this \"): book_price = book.xpath('.//div[#class="product_price"]/p[#class="price_color"]/text()').get()

As a side note, you could just use : './/p[#class="price_color"]/text()').get() The use of the 2 predicates [ ] is not needed, and using the above will make your code shorter

Related

How to have ruby conditionally check if variables exist in a string?

Xpath is correct but no result after scraping

simplified regex for modifying a string in ruby

Spring EL - "There is still more data in the expression"

What is Ruby doing with gsub here?

Categories

Resources