xpath and scrapy not getting text into a paragraph with multiple attributes - xpath

I am trying to write a web scraper using scrapy and xpath but I am experiencing a frustrating problem.
I need the text in a paragraph which has HTML
<p class="list-details__item__date" id="match-date">04.03.2017 - 15:00</p>
I might be wrong, but since the p has an id attribute, it should be referable simply using
response.xpath('//p[#id="match-date"]/text()').extract()
Anyway this won't work.
I know a little of xpath and I was able to write scrapers in the past, but this one is giving me troubles. I tried many solutions, but no one seems to work
response.xpath('//p[contains(#class, "list-details__item__date") and contains(#id,"match-date")]/text()').extract()
response.xpath('//p[#class="list-details__item__date" and #id="match-date"]/text()').extract()
I also tried using "contains" as stated in many answers, but it did not work as well. This might be a stupid mistake I am doing...it would be great if someone could help me!
Thank you so much

Maybe match-date is loaded via AJAX/JS ... Please disable Javascript in your browser and then see if match-date is there or not.
Also for seek of easiness, use CSS Selectors instead of xPaths.
response.css('#match-date::text').extract()
EDIT:
To get value of data-dt attribute do this
response.css('#match-date::attr(data-dt)').extract()
OR XPath
response.xpath('//p[#id="match-date"]/#data-dt').extract()

Related

Getting a xPath from XML document

I am trying to get some values from an online XML document, but I cannot find the right xpath to navigate to those values. I want to import these values into a Google Spreadsheet document, which requires me to get the exact xpath.
The website is this one, and I am trying to get the information for "WillPay" information from MeetingInfo Venue=S1, Races RaceNo=1, Pools PoolInfo Pool=WIN, in OddsInfo.
For now, the value of "Number=1" should be 3350 (or something close to this, it changes quite often), and I would like to load all of these values onto the google spreadsheet document.
What I've tried is locating the xpath of all of it, and tried to my best attempt to get
"/AOSBS_XML/Meetings/MeetingInfo/Races/Pools/PoolInfo/OddsSet/OddsInfo/#WillPay"
but it doesn't work.
I've been stuck on this problem for months now and I've been avoiding it, but realised I can't anymore because it's hindering my work. Please help.
Thanks!
-Brandon
Try using this xpath expression:
//MeetingInfo[#Venue="S1"]/Races//RaceInfo[#RaceNo="1"]//Pools//PoolInfo[#Pool="WIN"]//OddsSet//OddsInfo[#Number="1"]/#WillPay
An alternative :
//OddsInfo[#WillPay][ancestor::PoolInfo[#Pool='WIN'] and ancestor::RaceInfo[#RaceNo='1'] and ancestor::MeetingInfo[#Venue='S1']]

Block for xPath request

I need help with an xPath request, in importXML. I am absolutely not a pro in the field.
I had a type request:
//*[#id="search"]/div[1]/a/#href
That i had recovered in the field, research on the societe.com page.
The page having changed i tried a lot of thing, the ID would be i think now : input_search, but despite that i tried a lot of things, I can't get the right code.
Could you guide me on this problem?
Thank you.
EDIT : Here is the way in which i recuperate the info. CompagnieName is just a example, can be change with any compagnie. I think that the XPath line is not correct, but i cannot find what to change, problem with div or other...
The Xpath you showed works if you search for a company that actually exists.
However, if you want the complete result list you may want to try that URL instead:
https://www.societe.com/cgi-bin/liste?nom=XX
and this XPath:
//*[#id="liste"]/a/#href

is there a specific way to write xpaths into rapidminer for web crawling

I have tried so many options, over many days to try and extract data. I don't know where I am going wrong.
for example, I am on the website reviewcentre.com and am looking at car selling site reviews.
I am struggling badly to retrieve information, most of my xpaths appear incorrect.
Where can I best learn how to do this properly, I have spent days at this.
https://www.reviewcentre.com/car_dealers/we_buy_any_car_-_wwwwebuyanycarcom-review_14068020
I know how to copy xpaths, but when it comes to rapidminer, I can't extract the data.
I know I am doing it wrong, but I don't know what's right unfortunately.
examples include
//*[#id="ReviewTitle-14068020"]
h:html/h:head/h:title/text()
this one works!
//*[#id="ReviewBox-14068020"]/div[1]/div[2]/p[2]/span
I have no problem it appears retrieving the xpath from the website, but using it for extracting data on rapidminer is not working at all..Would really appreciate if anyone can point me in the right direction.
Obviously, you don't want to use unique IDs in your xpaths.
Make sure you have understood the concept of xml namespaces, too.

Using wildcards in Selenium IDE

I'm somewhat new to automation, and am learning everything auto-didactically, so forgive me if my terminology is a bit off. I've searched hi and low for an answer to this question, and I can't seem to find anything. I presume it's my small vocabulary when it comes to this stuff... anyway...
I'm attempting to write a test that performs all the actions necessary to complete a tutorial by using the recorder. However, for one particular step, the element ID changes. For example, the ID I'm trying to click is this:
//li[#id='message_661119']/div[2]/div[2]/a/img
However, for each new user that is performing the tutorial "quest", the number of the id changes.
Is there anyway to get Selenium to recognize, or use, wildcards? Example:
//li[#id='message_******']/div[2]/div[2]/a/img
Of course, the example above does not work.
Any advice would be immensely helpful. Thank you!!
You can use starts-with() for this:
//li[starts-with(#id, 'message_')]/div[2]/div[2]/a/img
It's one of the examples mentioned in Locating Techniques in Selenium's docs for starts-with().
In Target field of the command in Selenium IDE where you can see message_123123 click on a dropdownlist and choose an option which is related to xpath:idRelative or if this one doesn't work then try another options which do not include that annoying message_123123 so this way you'll identify webpage element by it's location but not id. I solved my issue this way

Problems with Xalan using XPATH (unclosed tags)

Greetings,
I'm facing a problem with the following tech-stack: JWebUnit -> HtmlUnit -> Xalan.
I'm trying to find an element by XPATH, but the HTML document is pretty malformed.
Xalan stops finding elements when I reach the /body element on XPATH. I believe it's because the document contains two <body> tags and one being unclosed.
Everything works for /html/head or /html. But when I try /html/body (or /html/body[1], //body[1], or anything inside those tags) I get only null from Xalan.
Is there any way to get around with that? I just can't change the html document istself. Thank you kindly for your attention.
Best regards,
Thiago
HtmlUnit must be using something to convert HTML to XML. Perhaps you can tell it to use jsoup or tagsoup, which are very tolerant of messy HTML?
You might as well also write code to just dump the XML tree to a file so you can see what's in it.

Resources