What am I doing wrong with this Xpath Query? - xpath

I've been having a play with some Xpath queries but I just can't get this one.
Here's the current string: "/html/body/div/div[8]/table/tr/td[2]/a"
It's showing the information below, but I need to grab "Australia" or node 5. I've tried last() and selecting a node on the a but no luck.
Anyone able to help?

The following seem to work
/html/body/div/div[8]/table/tr[3]/td[2]/a
You seemed to be on the wrong row. But will the structure always be this static? Maybe you should try to look for something "better" in the page, such as the href containing "country" so be somewhat more resilient to structure changes.

Related

Getting a xPath from XML document

I am trying to get some values from an online XML document, but I cannot find the right xpath to navigate to those values. I want to import these values into a Google Spreadsheet document, which requires me to get the exact xpath.
The website is this one, and I am trying to get the information for "WillPay" information from MeetingInfo Venue=S1, Races RaceNo=1, Pools PoolInfo Pool=WIN, in OddsInfo.
For now, the value of "Number=1" should be 3350 (or something close to this, it changes quite often), and I would like to load all of these values onto the google spreadsheet document.
What I've tried is locating the xpath of all of it, and tried to my best attempt to get
"/AOSBS_XML/Meetings/MeetingInfo/Races/Pools/PoolInfo/OddsSet/OddsInfo/#WillPay"
but it doesn't work.
I've been stuck on this problem for months now and I've been avoiding it, but realised I can't anymore because it's hindering my work. Please help.
Thanks!
-Brandon
Try using this xpath expression:
//MeetingInfo[#Venue="S1"]/Races//RaceInfo[#RaceNo="1"]//Pools//PoolInfo[#Pool="WIN"]//OddsSet//OddsInfo[#Number="1"]/#WillPay
An alternative :
//OddsInfo[#WillPay][ancestor::PoolInfo[#Pool='WIN'] and ancestor::RaceInfo[#RaceNo='1'] and ancestor::MeetingInfo[#Venue='S1']]

Block for xPath request

I need help with an xPath request, in importXML. I am absolutely not a pro in the field.
I had a type request:
//*[#id="search"]/div[1]/a/#href
That i had recovered in the field, research on the societe.com page.
The page having changed i tried a lot of thing, the ID would be i think now : input_search, but despite that i tried a lot of things, I can't get the right code.
Could you guide me on this problem?
Thank you.
EDIT : Here is the way in which i recuperate the info. CompagnieName is just a example, can be change with any compagnie. I think that the XPath line is not correct, but i cannot find what to change, problem with div or other...
The Xpath you showed works if you search for a company that actually exists.
However, if you want the complete result list you may want to try that URL instead:
https://www.societe.com/cgi-bin/liste?nom=XX
and this XPath:
//*[#id="liste"]/a/#href

Problem finding correct Selector for response.xpath or response.css in Scrapy on Coinmarketcap

i´d like to loop over the top20 exchanges on coinmarketcap to crawl the tables, e.g. https://coinmarketcap.com/exchanges/fatbtc/
Now i spent a few hours in finding the Selector, e.g. for Price
In Scrapy Shell i tried ... and many more, but all not working:
from Addon XPath Helper:
response.xpath('/html/body/div[#id='__next']/div[#class='cmc-app-wrapper.cmc-app-wrapper--env-prod.sc-1mezg3x-0.fUoBLh']/div[#class='container.cmc-main-section']/div[#class='cmc-main-section__content']/div[#class='cmc-exchanges.sc-1tluhf0-0.wNRWa']/div[#class='cmc-details-panel-table.sc-3klef5-0.cSzKTI']/div[#class='cmc-markets-listing.lgsxp9-0.eCrwnv']/div[#class='cmc-table.sc-1yv6u5n-0.dNLqEp']/div[#class='cmc-table__table-wrapper-outer']/div/table/tbody/tr[#class='cmc-table-row.sc-1ebpa92-0.kQmhAn'][1]/td[#class='cmc-table__cell.cmc-table__cell--sortable.cmc-table__cell--right.cmc-table__cell--sort-by__price']').getall()
from Chrome Inspector:
response.xpath('/td[#class='cmc-table__cell.cmc-table__cell--sortable.cmc-table__cell--right.cmc-table__cell--sort-by__price']').getall()
from Chrome Inspector copy XPath:
:
response.xpath('//*[#id="__next"]/div/div[2]/div[1]/div[2]/div[2]/div/div[2]/div[3]/div/table/tbody/tr[1]/td[5]').extract()
I´m using the Chrome Inspector and since today an addon called "Xpath helper" for showing the Selectors, but i still don´t really understand what i´m doing there :(. I´d really appreciate any idea how to access that data and to give me a better understanding in finding these selectors.
Pretty easy (I used position() to skip table header):
for row in response.xpath('//table[#id="exchange-markets"]//tr[position() > 1]'):
price = row.xpath('.//span[#class="price"]/text()').get()
# price = row.xpath('.//span[#class="price"]/#data-usd').get() #if you need to be more precise
XPATHs are basically //tagname[#attribute='value'] from HTML.
For your site, you can loop over names with //table[#id='exchange-markets']//tr/td[2]/a
and get prices with //table[#id='exchange-markets']//tr/td[5]
where we are basically saying to look within the table rows on column 5.

Google sheets importxml weird import - Can't get the correct path to elements

I'm trying to get some data from this website https://etfdb.com/etf/VOO/with IMPORTXML. Unfortunately, I was not able to scrape a particular element of the page but rather I got data only from these two functions
=IMPORTXML("https://etfdb.com/etf/VOO","//*")
=IMPORTXML("https://etfdb.com/etf/VOO","/html")
I tried to see if the browser is only loading data through JS but after disabling it the site loaded correctly, so I don't think JS might be the problem here.
How come after running a simple function like this, I get an error saying the scraped content is empty?
//span[contains(text(),'Tracks This Index:')]/following-sibling::span
EDIT: added spreadsheet with desired output https://docs.google.com/spreadsheets/d/1Zn0fQwenYZo6u4jP0yZ7J-NCzyzRnqabR3CDUz8jP3E/edit?usp=sharing
How about this answer?
Issue:
Unfortunately, the value cannot be retrieved with the xpath of //span[contains(text(),'Tracks This Index:')]/following-sibling::span from the HTML data of the URL. For example, even when //span is used, #N/A is returned. The reason of this issue is mentioned by Rubén's answer.
Workaround:
Here, I would like to propose a workaround. Please think of this as just one of several answers. In this workaround, the value you want is retrieved from all values from body. Although each tag in the body cannot be retrieved, //body can be retrieved. And fortunately, the value you want is included in the value from //body. The flow of this workaround is as follows.
Retrieve values from the xpath of //body.
Retrieve the value you want by the regular expression.
Sample formula:
=TEXTJOIN("",TRUE,IFNA(ARRAYFORMULA(TRIM(REGEXEXTRACT(IMPORTXML(A1,"//body"),"Tracks This Index: (\w.+)"))),""))
In this sample, the cell "A1" has the URL of https://etfdb.com/etf/VOO.
After the value of //body was retrieved, the value is retrieved by the regular expression.
The important point of this workaround is the methodology. I think that there are various formulas for retrieving the value. So please think of above sample formula as just one of them.
Result:
Note:
If you use above formula for other URL, an error might occur. Please be careful this.
References:
IMPORTXML
REGEXEXTRACT
ARRAYFORMULA
IFNA
TEXTJOIN
If this was not the direction you want, I apologize.
This is partial answer.
The problem occurs because https://etfdb.com/etf/VOO/ isn't a valid XHTML file.
Some failures:
Use of <hr> instead of <hr/>
Use of <br> instead of <br/>
The above failures cause that IMPORTXML can't parse below sibling tags.

Can't query a node with xpath query

I am having some difficulty querying a node in an xml document. the document is http://ods.od.nih.gov/api/index.aspx?resourcename=BotanicalBackground&readinglevel=Health%20Professional
i am trying to get the text of the first node.
i have tried these queries and none of them seemed to work.
*[name()='ImageURL']
//captionedimage[1]
//Factsheet/RelatedImages/captionedimage[1]/ImageURL/text()
//RelatedImages/*[1]
greatly appreciate any help.
Your three last XPATH seem to be working (you can quickly check it out at http://www.xpathtester.com/test or http://www.freeformatter.com/xpath-tester.html). The problem should be linked to the environment you use.
When I tried them on scrapy the uppercases XPATH retrieved nothing, only //factsheet/relatedimages/captionedimage[1]/imageurl/text() seemed to be working. Sadly, this behavior is surprising to me and I have no idea why it acted that way. But you should definitely try and gather more info on the environment you're using.
Try this...
./Factsheet/RelatedImages/[local-name() = 'captionedimage' and position()=1]/[local-name() = 'ImageURL']

Resources