Import Internal Error during ImportXML with Google Spreadsheet - xpath

I am trying to import some data (Market Capitalization) from Bloomberg website to my Google spreadsheet, but Google gives me Import Internal Error.
=INDEX(ImportXml("http://www.bloomberg.com/quote/7731:JP","//*[#id='quote_main_panel']/div[1]/div[1]/div[3]/table/tbody/tr[7]/td"),1,1)
I really do not know what causes this problem, but I used to overcome it playing with the xpath query. This time I couldn't find a xpath query which works.
Does anybody know the reason of this error, or how can I make it work?

I am not familiar with Google Spreadsheet, but I think there is simply a superfluous closing parenthesis in your code.
Replace
=INDEX(ImportXml("http://www.bloomberg.com/quote/7731:JP"),"//*[#id='quote_main_panel']/div[1]/div[1]/div[3]/table/tbody/tr[7]/td"),1,1)
with
=INDEX(ImportXml("http://www.bloomberg.com/quote/7731:JP","//*[#id='quote_main_panel']/div[1]/div[1]/div[3]/table/tbody/tr[7]/td"),1,1)
Also, are you sure it's ImportXml and not ImportXML?
If this does not solve your problem, you have to explain what exactly you are looking for in the HTML.
Edit
Applying the Xpath expression you show to the HTML source, I get the following result:
<td xmlns="http://www.w3.org/1999/xhtml" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" class="company_stat">641,807.15</td>
Is this what you would have expected? If yes, then XPath is not at fault and the problem lies somewhere else. If not, then please describe what you are looking for and I'll try to find a suitable XPath expression.
Second Edit
The following formula works fine for me:
=ImportXML("http://www.bloomberg.com/quote/7731:JP","//table[#class='key_stat_data']//tr[7]/td")
Resulting cell value:
641,807.15
The XPath expression now looks for a particular table (since there are only 3 tables in the HTML and all of them have unique class attribute values).
EDIT
The reason why your intial path expression does not work is that it contains tbody, see this excellent answer for more information. Credit for this goes to #JensErat.

Related

Getting a xPath from XML document

I am trying to get some values from an online XML document, but I cannot find the right xpath to navigate to those values. I want to import these values into a Google Spreadsheet document, which requires me to get the exact xpath.
The website is this one, and I am trying to get the information for "WillPay" information from MeetingInfo Venue=S1, Races RaceNo=1, Pools PoolInfo Pool=WIN, in OddsInfo.
For now, the value of "Number=1" should be 3350 (or something close to this, it changes quite often), and I would like to load all of these values onto the google spreadsheet document.
What I've tried is locating the xpath of all of it, and tried to my best attempt to get
"/AOSBS_XML/Meetings/MeetingInfo/Races/Pools/PoolInfo/OddsSet/OddsInfo/#WillPay"
but it doesn't work.
I've been stuck on this problem for months now and I've been avoiding it, but realised I can't anymore because it's hindering my work. Please help.
Thanks!
-Brandon
Try using this xpath expression:
//MeetingInfo[#Venue="S1"]/Races//RaceInfo[#RaceNo="1"]//Pools//PoolInfo[#Pool="WIN"]//OddsSet//OddsInfo[#Number="1"]/#WillPay
An alternative :
//OddsInfo[#WillPay][ancestor::PoolInfo[#Pool='WIN'] and ancestor::RaceInfo[#RaceNo='1'] and ancestor::MeetingInfo[#Venue='S1']]

Google sheets importxml weird import - Can't get the correct path to elements

I'm trying to get some data from this website https://etfdb.com/etf/VOO/with IMPORTXML. Unfortunately, I was not able to scrape a particular element of the page but rather I got data only from these two functions
=IMPORTXML("https://etfdb.com/etf/VOO","//*")
=IMPORTXML("https://etfdb.com/etf/VOO","/html")
I tried to see if the browser is only loading data through JS but after disabling it the site loaded correctly, so I don't think JS might be the problem here.
How come after running a simple function like this, I get an error saying the scraped content is empty?
//span[contains(text(),'Tracks This Index:')]/following-sibling::span
EDIT: added spreadsheet with desired output https://docs.google.com/spreadsheets/d/1Zn0fQwenYZo6u4jP0yZ7J-NCzyzRnqabR3CDUz8jP3E/edit?usp=sharing
How about this answer?
Issue:
Unfortunately, the value cannot be retrieved with the xpath of //span[contains(text(),'Tracks This Index:')]/following-sibling::span from the HTML data of the URL. For example, even when //span is used, #N/A is returned. The reason of this issue is mentioned by Rubén's answer.
Workaround:
Here, I would like to propose a workaround. Please think of this as just one of several answers. In this workaround, the value you want is retrieved from all values from body. Although each tag in the body cannot be retrieved, //body can be retrieved. And fortunately, the value you want is included in the value from //body. The flow of this workaround is as follows.
Retrieve values from the xpath of //body.
Retrieve the value you want by the regular expression.
Sample formula:
=TEXTJOIN("",TRUE,IFNA(ARRAYFORMULA(TRIM(REGEXEXTRACT(IMPORTXML(A1,"//body"),"Tracks This Index: (\w.+)"))),""))
In this sample, the cell "A1" has the URL of https://etfdb.com/etf/VOO.
After the value of //body was retrieved, the value is retrieved by the regular expression.
The important point of this workaround is the methodology. I think that there are various formulas for retrieving the value. So please think of above sample formula as just one of them.
Result:
Note:
If you use above formula for other URL, an error might occur. Please be careful this.
References:
IMPORTXML
REGEXEXTRACT
ARRAYFORMULA
IFNA
TEXTJOIN
If this was not the direction you want, I apologize.
This is partial answer.
The problem occurs because https://etfdb.com/etf/VOO/ isn't a valid XHTML file.
Some failures:
Use of <hr> instead of <hr/>
Use of <br> instead of <br/>
The above failures cause that IMPORTXML can't parse below sibling tags.

xpath and scrapy not getting text into a paragraph with multiple attributes

I am trying to write a web scraper using scrapy and xpath but I am experiencing a frustrating problem.
I need the text in a paragraph which has HTML
<p class="list-details__item__date" id="match-date">04.03.2017 - 15:00</p>
I might be wrong, but since the p has an id attribute, it should be referable simply using
response.xpath('//p[#id="match-date"]/text()').extract()
Anyway this won't work.
I know a little of xpath and I was able to write scrapers in the past, but this one is giving me troubles. I tried many solutions, but no one seems to work
response.xpath('//p[contains(#class, "list-details__item__date") and contains(#id,"match-date")]/text()').extract()
response.xpath('//p[#class="list-details__item__date" and #id="match-date"]/text()').extract()
I also tried using "contains" as stated in many answers, but it did not work as well. This might be a stupid mistake I am doing...it would be great if someone could help me!
Thank you so much
Maybe match-date is loaded via AJAX/JS ... Please disable Javascript in your browser and then see if match-date is there or not.
Also for seek of easiness, use CSS Selectors instead of xPaths.
response.css('#match-date::text').extract()
EDIT:
To get value of data-dt attribute do this
response.css('#match-date::attr(data-dt)').extract()
OR XPath
response.xpath('//p[#id="match-date"]/#data-dt').extract()

Getting Cell Contents Via XPath for ImportXML()

I am trying to scrape data from https://www.snpedia.com/index.php/Rs7136259 to create an automated database of genomic information using google sheets.
I would like to retrieve the odds ratio contained in a table on the page. I have tried to figure out the XPath, but nothing I do works. I copied as XPath from InspectElement but that's returning a #N/A error. The information I am trying to scrape is the "Odds Ratio".
My current query:
=importxml(J2,"//*div[#id="mw-content-text"]/table/tr[7]/td")
Thanks for your input. I have searched the other links but could not figure it out. Sorry for being so green.
As noted in the comments, *div is not valid XPath. Another problem is that you have double quotes inside of double quotes, which is also invalid.
It looks like this works:
=importxml(J2,"//*[#id='mw-content-text']/table/tr[7]/td")

Import table using IMPORTXML xpath

I am trying to import a table from the following website
http://financials.morningstar.com/valuation/price-ratio.html?t=MOS&region=usa&culture=en-US
using google spreadsheet function ImportXML, I have problems with the xpath I found this one for for the table I am looking for:
//*[#id="valuation_history_table"].
and I am using this formula=importXML(A4,"//*[#id="valuation_history_table"]")
but I get the following error msg:
Error
Formula parse error.
Could you please help me?
Your getting a formula parse error because you used double quotes in your xpath - also you wouldnt be able to pull it in anyway - if you want a much lighter , less complicated endpoint use this url instead (all the parameters are basically the same) but you can use xpath //tr and get the whole table
=IMPORTXML("http://financials.morningstar.com/valuate/valuation-history.action?&t=XNYS:MOS&region=usa&culture=en-US&cur=&type=price-earnings","//tr")

Resources