Xpath Contains OR definition clarification - xpath

While sifting through websites with Scrapy, I had a general question regarding the usage of Xpath's contain and or to be able to create more concise statements.
For example, when looking for the link 'contacts' in both Spanish and English websites I've been using the following Xpath query:
//*[contains(text(), 'Contact') or contains(text(), 'Contact Us') or contains(text(), 'Contactar') or contains(text(),'Contacto') or contains(text(), 'Contáctenos')]
The issue I am facing is that the query above might have some ambiguities. For example, if a website had a 'Contacto' and a 'Contactenos' link, it isn't very clear to me which one would be returned. I have tried changing the order, but there seems to be no difference regarding which one will be returned like most or statements I am used to working with. Does anyone know how the 'or' keyword runs?

Related

Syntax for scraping double quotes in rapidminer (XPATH)

I'm having trouble using xpath in Rapidminer when trying to retrieve reviews form the google play store. The problem seems to be that these reviews are in double quotes and I can't get rapidminer to spit out the text...only blanks. I have a number of other xpath queries that are working fine for other commands where i use divs and span etc. I'm able to get things to work on google spreadsheet for this query through =importXML, but not in rapidminer.
This is what I have in XPATH:
//*[#class='review-text']")
So I added a /text() to the end and still nothing. I have played around with adding //div instead of //* and have used h:/span also. I'm kind of hoping there's a special syntax for retrieving quotes that i'm unaware of?
Here is the HTML i'm looking to scrape in the image below:
https://i.stack.imgur.com/dl6I8.png
Please see my comment below on further failed tests. Thanks.

xpath and scrapy not getting text into a paragraph with multiple attributes

I am trying to write a web scraper using scrapy and xpath but I am experiencing a frustrating problem.
I need the text in a paragraph which has HTML
<p class="list-details__item__date" id="match-date">04.03.2017 - 15:00</p>
I might be wrong, but since the p has an id attribute, it should be referable simply using
response.xpath('//p[#id="match-date"]/text()').extract()
Anyway this won't work.
I know a little of xpath and I was able to write scrapers in the past, but this one is giving me troubles. I tried many solutions, but no one seems to work
response.xpath('//p[contains(#class, "list-details__item__date") and contains(#id,"match-date")]/text()').extract()
response.xpath('//p[#class="list-details__item__date" and #id="match-date"]/text()').extract()
I also tried using "contains" as stated in many answers, but it did not work as well. This might be a stupid mistake I am doing...it would be great if someone could help me!
Thank you so much
Maybe match-date is loaded via AJAX/JS ... Please disable Javascript in your browser and then see if match-date is there or not.
Also for seek of easiness, use CSS Selectors instead of xPaths.
response.css('#match-date::text').extract()
EDIT:
To get value of data-dt attribute do this
response.css('#match-date::attr(data-dt)').extract()
OR XPath
response.xpath('//p[#id="match-date"]/#data-dt').extract()

How to search for a substring (WHERE column LIKE '%foo%')

I'm reading parse API documentation at https://parse.com/docs/rest/guide#queries and can't find how to search by a substring. SQL equivalent would be:
... WHERE column_name LIKE "%foo%"
There's a bunch of options such as &gt, &lt, &in, and similar, but there's no option for LIKE. It's pretty common use case... What am I missing?
We've found something that looks like an undocumented feature.
There's a $regex lookup it's not mentioned in the official API reference. It allows for matching by regular expressions which solved the problem for us.
We believe it should be documented here:
https://parse.com/docs/rest/guide/#queries-query-constraints
But apparently it isn't.
In your case, I think you should follow Parse's tutorial here: http://blog.parse.com/learn/engineering/implementing-scalable-search-on-a-nosql-backend/
It is very helpful in terms of search. The main ideas are:
Separate the texts into single words and store as in array
Then perform search on the new field using containedIn(or $in as in REST API) query.
There's also a question regarding this too: How to make a "like" query in Parse.com
EDIT: New blog link: https://web.archive.org/web/20150416171914/http://blog.parse.com/2013/03/19/implementing-scalable-search-on-a-nosql-backend/

Import Internal Error during ImportXML with Google Spreadsheet

I am trying to import some data (Market Capitalization) from Bloomberg website to my Google spreadsheet, but Google gives me Import Internal Error.
=INDEX(ImportXml("http://www.bloomberg.com/quote/7731:JP","//*[#id='quote_main_panel']/div[1]/div[1]/div[3]/table/tbody/tr[7]/td"),1,1)
I really do not know what causes this problem, but I used to overcome it playing with the xpath query. This time I couldn't find a xpath query which works.
Does anybody know the reason of this error, or how can I make it work?
I am not familiar with Google Spreadsheet, but I think there is simply a superfluous closing parenthesis in your code.
Replace
=INDEX(ImportXml("http://www.bloomberg.com/quote/7731:JP"),"//*[#id='quote_main_panel']/div[1]/div[1]/div[3]/table/tbody/tr[7]/td"),1,1)
with
=INDEX(ImportXml("http://www.bloomberg.com/quote/7731:JP","//*[#id='quote_main_panel']/div[1]/div[1]/div[3]/table/tbody/tr[7]/td"),1,1)
Also, are you sure it's ImportXml and not ImportXML?
If this does not solve your problem, you have to explain what exactly you are looking for in the HTML.
Edit
Applying the Xpath expression you show to the HTML source, I get the following result:
<td xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" class="company_stat">641,807.15</td>
Is this what you would have expected? If yes, then XPath is not at fault and the problem lies somewhere else. If not, then please describe what you are looking for and I'll try to find a suitable XPath expression.
Second Edit
The following formula works fine for me:
=ImportXML("http://www.bloomberg.com/quote/7731:JP","//table[#class='key_stat_data']//tr[7]/td")
Resulting cell value:
641,807.15
The XPath expression now looks for a particular table (since there are only 3 tables in the HTML and all of them have unique class attribute values).
EDIT
The reason why your intial path expression does not work is that it contains tbody, see this excellent answer for more information. Credit for this goes to #JensErat.

How to do SQL IN like query in hibernate search

A simulating scenario is:
Search for books whose content contains "success" AND author is in a list of passed names(could be thousands of).
I looked into filter:
http://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#query-filter
Seams like hibernate search has no native support of this.
What is recommended approach for this problem? I think I am not alone.
Thanks for any inputs.
Let me post my current solution.
Get the search results with minimal projections for the keywords, and loop through the results to get only matching ones from the IN list.
I am not using filter.
Open to other alternatives once convinced.
If you look here http://lucene.apache.org/java/2_4_1/queryparsersyntax.html (at the end "Field Grouping"), you can write a query with something like :
content:success AND author:("firstname" "secondname" "thirdname" ...)

Resources