Getting a xPath from XML document - xpath

I am trying to get some values from an online XML document, but I cannot find the right xpath to navigate to those values. I want to import these values into a Google Spreadsheet document, which requires me to get the exact xpath.
The website is this one, and I am trying to get the information for "WillPay" information from MeetingInfo Venue=S1, Races RaceNo=1, Pools PoolInfo Pool=WIN, in OddsInfo.
For now, the value of "Number=1" should be 3350 (or something close to this, it changes quite often), and I would like to load all of these values onto the google spreadsheet document.
What I've tried is locating the xpath of all of it, and tried to my best attempt to get
"/AOSBS_XML/Meetings/MeetingInfo/Races/Pools/PoolInfo/OddsSet/OddsInfo/#WillPay"
but it doesn't work.
I've been stuck on this problem for months now and I've been avoiding it, but realised I can't anymore because it's hindering my work. Please help.
Thanks!
-Brandon

Try using this xpath expression:
//MeetingInfo[#Venue="S1"]/Races//RaceInfo[#RaceNo="1"]//Pools//PoolInfo[#Pool="WIN"]//OddsSet//OddsInfo[#Number="1"]/#WillPay

An alternative :
//OddsInfo[#WillPay][ancestor::PoolInfo[#Pool='WIN'] and ancestor::RaceInfo[#RaceNo='1'] and ancestor::MeetingInfo[#Venue='S1']]

Related

Block for xPath request

I need help with an xPath request, in importXML. I am absolutely not a pro in the field.
I had a type request:
//*[#id="search"]/div[1]/a/#href
That i had recovered in the field, research on the societe.com page.
The page having changed i tried a lot of thing, the ID would be i think now : input_search, but despite that i tried a lot of things, I can't get the right code.
Could you guide me on this problem?
Thank you.
EDIT : Here is the way in which i recuperate the info. CompagnieName is just a example, can be change with any compagnie. I think that the XPath line is not correct, but i cannot find what to change, problem with div or other...
The Xpath you showed works if you search for a company that actually exists.
However, if you want the complete result list you may want to try that URL instead:
https://www.societe.com/cgi-bin/liste?nom=XX
and this XPath:
//*[#id="liste"]/a/#href

Confused about XPath Syntax

Problem Summary:
Hi, I'm trying to learn to use the Scrapy Framework for python (available at https://scrapy.org). I'm following along with a tutorial I found here: https://www.scrapehero.com/scrape-alibaba-using-scrapy/, but I was going to use a different site for practice rather than just copy them on Alibaba. My goal is to get game data from https://www.mlb.com/scores.
So I need to use Xpath to tell the spider which parts of the html to scrape, (I'm about halfway down on that tutorial page on the scrapehero site, at the "Construct Xpath selectors for the product list" section). Problem is I'm having a hell of a time figuring out what syntax should actually be to get the pieces I want? I've been going over xpath examples all morning trying to figure out the right syntax but I haven't been able to get it.
Background info:
So what I want is- from https://www.mlb.com/scores, I want an xpath() command which will return an array with all the games displayed.
Following along with the tutorial, what I understand about how to do this is I'd want to inspect the elements from the webpage, determine their class/id, and specific that in the xpath command.
I've tried a lot of variations to get the data but all are returning empty arrays.
I don't really have any training in XPath so I'm not sure if my syntax is just off somewhere or what, but I'd really appreciate any help on getting this command to return the objects I'm looking for. Thanks for taking the time to read this.
Code:
Here are some of the attempts that didn't work:
response.xpath("//div[#class='g5-component--mlb-scores__game-wrapper']")
response.xpath("//div[#class='g5-component]")
response.xpath("//li[#class='mlb-scores__list-item mlb-scores__list-item--game']")
response.xpath("//li[#class='mlb-scores__list-item']")
response.xpath("//div[#!data-game-pk-id > 0]")'
response.xpath("//div[contains(#class, 'g5-component')]")
Expected Results and Actual Results
I want an XPath command that returns an array containing a selector object for each game on the mlb.com/scores page.
So far I've been able to get generic returns that aren't actually what I want (I can get a selector that returns the whole page by just leaving out the predicates, but whenever I try to specify I end up with an empty array).
So for all my attempts I either get the wrong objects or an empty array.
You need to always check HTML source code (Ctrl+U in a browser) for the data you need. For MLB page you'll find that content you are want to parse is loaded dynamically using JavaScript.
You can try to use Scrapy-Splash to get target content from your start_urls or you can find direct HTTP request used to get information you want (using Network tab of Chrome Developer Tools) and parse JSON:
https://statsapi.mlb.com/api/v1/schedule?sportId=1,51&date=2019-06-26&gameTypes=E,S,R,A,F,D,L,W&hydrate=team(leaders(showOnPreview(leaderCategories=[homeRuns,runsBattedIn,battingAverage],statGroup=[pitching,hitting]))),linescore(matchup,runners),flags,liveLookin,review,broadcasts(all),decisions,person,probablePitcher,stats,homeRuns,previousPlay,game(content(media(featured,epg),summary),tickets),seriesStatus(useOverride=true)&useLatestGames=false&language=en&leagueId=103,104,420

xpath and scrapy not getting text into a paragraph with multiple attributes

I am trying to write a web scraper using scrapy and xpath but I am experiencing a frustrating problem.
I need the text in a paragraph which has HTML
<p class="list-details__item__date" id="match-date">04.03.2017 - 15:00</p>
I might be wrong, but since the p has an id attribute, it should be referable simply using
response.xpath('//p[#id="match-date"]/text()').extract()
Anyway this won't work.
I know a little of xpath and I was able to write scrapers in the past, but this one is giving me troubles. I tried many solutions, but no one seems to work
response.xpath('//p[contains(#class, "list-details__item__date") and contains(#id,"match-date")]/text()').extract()
response.xpath('//p[#class="list-details__item__date" and #id="match-date"]/text()').extract()
I also tried using "contains" as stated in many answers, but it did not work as well. This might be a stupid mistake I am doing...it would be great if someone could help me!
Thank you so much
Maybe match-date is loaded via AJAX/JS ... Please disable Javascript in your browser and then see if match-date is there or not.
Also for seek of easiness, use CSS Selectors instead of xPaths.
response.css('#match-date::text').extract()
EDIT:
To get value of data-dt attribute do this
response.css('#match-date::attr(data-dt)').extract()
OR XPath
response.xpath('//p[#id="match-date"]/#data-dt').extract()

Getting Cell Contents Via XPath for ImportXML()

I am trying to scrape data from https://www.snpedia.com/index.php/Rs7136259 to create an automated database of genomic information using google sheets.
I would like to retrieve the odds ratio contained in a table on the page. I have tried to figure out the XPath, but nothing I do works. I copied as XPath from InspectElement but that's returning a #N/A error. The information I am trying to scrape is the "Odds Ratio".
My current query:
=importxml(J2,"//*div[#id="mw-content-text"]/table/tr[7]/td")
Thanks for your input. I have searched the other links but could not figure it out. Sorry for being so green.
As noted in the comments, *div is not valid XPath. Another problem is that you have double quotes inside of double quotes, which is also invalid.
It looks like this works:
=importxml(J2,"//*[#id='mw-content-text']/table/tr[7]/td")

Google spreadsheet ImportXML Error:"the XPath query did not return any data"

I continue to get this error when I try to run this XPath query
//div[#iti='0']
on this link (flight search from google)
https://www.google.com/flights/#search;f=LGW;t=JFK;d=2014-05-22;r=2014-05-26
I get something like this:
=ImportXML("https://www.google.fr/flights/#search;f=jfk;t=lgw;d=2014-02-22;r=2014-02-26";"//div[#iti='0']")
I verified and the XPath is correct (I get the answer wanted using XPath helper, the answer wanted are the data relative to the first flight selected).
I guess that it is a problem of syntax, but I tried more or less all the combinations of lower/uppercase, punctuation (replacing ; , ' ") and I tried to link the URI and the XPath query stored in cells, but nothing works.
Any help will be appreciated.
As a matter of fact, maybe it is a bug on the new google sheets or they have changed how the function works. I've activated mine and when I try to use the ImportXML it simply wont work. Since I have some old sheets here (on the old mechanism) they still work normally. If I copy and paste the script from the old to the new one it simply doesn't get any data.
Here a example:
=ImportXML("http://www.nytimes.com/pages/todayspaper/index.html";"//div[#class='columnGroup first']//h3")
If I run this on the old mechanism it works fine, but if I run the same on the new mechanism, first it will exchange my ";" for a "," and then it will bring a "#N/A" with a warning of "Error: Imported XML content cannot be parsed".
Edit (05/05/2015):
I am happy to say that I tested this function again today on the new spreadsheets and they've fixed it. I was checking that every two months and now finally they have solved this issue. The example I've added above is now returning information.
I'm sorry, but you won't be able to easily parse Google result pages. The reason your function throws an error is because the content of the page you see in your browser is generated by javascript, and Google spreadsheet doesn't execute js.
Your ImportXML has the right syntax, it doesn't return anything because the node you're looking for isn't there (importXML Parse Error).
You will have to find another source if you want these result in your spreadsheet. For info some libraries already parse the usual result page (http://www.seerinteractive.com/blog/google-scraper-in-google-docs-update for example, if it still works), but I doubt finding one for your special case will be easy.
This gives the answer (importXML Parse Error), but it's not entirely obvious.
ImportXML doesn't load Javascript. When you're building ImportXML queries on Google results, make sure you're testing against a version of the page that has Javascript turned off. You can do this using the Chrome DevTools.
(But I agree that ImportXML is fickle, idiosyncratic, and generally rage-inducing).

Resources