XPath query for importing web content - xpath

Can anyone please suggest how to import '500112' and 'SBIN' from https://www.moneycontrol.com/india/stockpricequote/banks-public-sector/statebankindia/SBI in Google spreadsheet using importData or importXML functions?

Try using
=IMPORTXML(A1,"//ctag[#class='mob-hide']//span") #where A1 is the url
this should get you both.
adding, for example:
=IMPORTXML(A1,"//ctag[#class='mob-hide']//span[1]")
at the end should output just
500112
Edit:
Since the question was asked, the site started using dynamically loaded data which GS can't handle. Using the tools in your browser's Developer tab, you can find out that the target data is loaded from a different site (see below) and that it is in json format.
So you need to use GS's importJSON() function for that:
A1 = https://priceapi.moneycontrol.com/pricefeed/bse/equitycash/SBI
A2 =importJSON(A1)
Make sure there's enough space on the sheet to expand the output. Once you do, you'll find the two target items under the columns Data Bseid and Data Nseid, probably in columns AL and AM.

Related

Google Sheets ImportXML issues

I have a google sheet that I'm trying to automate as much as possible for my WoW Raid group. What I'm trying to do here is parse some data from WoW's armory to automatically pull a persons item level.
I am having issues pulling from WoW's website directly (https://worldofwarcraft.com/en-us/character/us/sargeras/Beansy), but I can pull the item level from another site (https://raider.io/characters/us/sargeras/beansy). The only difference I can spot is that one site I can pull from a [ /div/span/b clas"text-white" ] and from WoW the information is directly in [ /div/class="media-text" ]
WOW Formula =IMPORTXML(C32,"//*[#id='character-profile-mount']/div/div/div[2]/div/div[1]/div[1]/div/div[2]/div[1]/a[1]/div/div[2]")
Raider IO Formula =IMPORTXML(C31,"//*[#id='content']/div/div/div/div[2]/div[1]/div[1]/section/div/div[1]/div/span/b")
WOW Inspect Element <div class="Media-text">184 ilvl</div>
Raider IO Inspect Element <b class="text-white">184</b>
Above are the respective formula's and elements I've used. Raider IO's pulls properly and outputs 184 as it's information. However WoW's does not pull properly and outputs N/A Google Sheets Output Screencap
Does anyone have any ideas on why this might be happening?
Thanks in advance!
I think that the https://worldofwarcraft.com/en-us/character/us/sargeras/Beansy prepares the values using Javascript. For example, when the HTML without using Javascript is retrieved from this URL, Media-text cannot be found in the retrieved HTML. On the other hand, https://raider.io/characters/us/sargeras/beansy has the values in the HTML without using Javascript. I thought that the difference is due to this.
But in order to retrieve the value of 184 from URL of the former, when I saw the HTML without using Javascript, I noticed that the value is included in the metadata. So when the value of 184 is retrieved from the metadata, the sample formula is as follows.
Sample formula:
=REGEXEXTRACT(IMPORTXML(A1,"//meta[#name='description']/#content"),"(\d+) ilvl")
In this formula, the URL of https://worldofwarcraft.com/en-us/character/us/sargeras/Beansy is put to the cell "A1".
Result:
Also, as additional modification, about your =IMPORTXML(URL,"//*[#id='content']/div/div/div/div[2]/div[1]/div[1]/section/div/div[1]/div/span/b"), in this case, the xpath might be able to be modified simple a little as follows.
Modified formula:
=IMPORTXML(A1,"//span[contains(text(),'Item Level')]/b[#class='text-white']")
In this formula, the URL of https://raider.io/characters/us/sargeras/beansy is put to the cell "A1".
Result:
References:
IMPORTXML
REGEXEXTRACT

Morningstar xpath return empty in Google Sheet (Imported content is empty) [duplicate]

This question already has answers here:
Scraping data to Google Sheets from a website that uses JavaScript
(2 answers)
Closed last month.
I am trying to pull a number from the Morningstar "Cash Flow" page an arbitrary stock ticker using XPath. I have the tested the XPath on the morningstar website by an XPath tester and it returned desired values. However, when I want to use this value in a google sheet, it returns #N/A (Imported content is empty.).
=IMPORTXML("http://financials.morningstar.com/cash-flow/cf.html?t=fb&region=usa&culture=en-US", "//div[#id='data_tts1']/div")
I did a bit of research on this and find out that data in such websites generated dynamically and downloads the content in stages, Therefore, page needs to be loaded first to be able to pull any data out of it!
I'm wondering if there is any solution to this issue?
You help would much be appreciated.
it's empty as it should be because the content you are trying to scrape is of JavaScript origin. Google Sheets does not support imports of JS elements. you can always test this by disabling JS for a given site and only what's left can be scraped:
It might be possible. But you have to prepare a custom sheet to extract the data. Use IMPORTDATA to parse the .json which contains the data :
http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t=XNAS:FB&region=usa&culture=en-US&cur=&reportType=cf&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=672024&callback=jsonp1585016592836&_=1585016593002
AFAIK, you couldn't import directly the .csv version (specific headers needed, so curl or other specific tools would be required).
http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t=XNAS:FB&region=usa&culture=en-US&cur=&reportType=cf&period=12&dataType=A&order=asc&columnYear=5&curYearPart=1st5year&rounding=3&view=raw&r=764423&denominatorView=raw&number=3
Since this .json is very special (contains html tags), i don't think a custom script for GoogleSheets could import it correctly. So once the .json is loaded in GoogleSheets, TRANSPOSE the rows to columns and use formulas to locate your data (target the cells which contain data_s1 and data_s2 for example). Use CONCAT to merge the cells of interest. Then split the result into columns (use a custom separator). SEARCH for the data you want and clean the results with SUBSTITUTE. The method is dirty but i think it could be automated for the whole process.

Importing book names from goodreads.com into Google Sheets with ImportXML gives "Import Internal Error" sometimes

I have a formula that fetches names of books from goodreads.com:
=IMPORTXML("https://www.goodreads.com/book/show/" & gr_id; "//*[#id='bookTitle']")
where gr_id is a column containing ids of the books. For example when gr_id=23848607, it fetches from URL https://www.goodreads.com/book/show/23848607 and the result is "Warheart".
The formula worked fine some time ago. I did not change anything and now I noticed it stopped working for some of the books (still working for others). Instead of the name of the book now it gives N/A with "Import Internal Error" hint. The ids that do not work are:
48332548
35906922
How to make it work for all books?
There were many questions posted about "Import Internal Error" problems. I tried some solutions including copying the formula to a fresh sheet, but it did not work.
Update: I tried the following different XPath formulas instead of "//*[#id='bookTitle']".
"//h1[#id='bookTitle']"
"//h1"
Those different XPath formulas worked the same as the original XPath formula. They worked correctly for the same ids that the original one did and produced N/As for the same ids that the original one did.
Update: I just re-checked and all my formulas worked correctly for all gr_ids (I had not changed anything since the time when they did not work.) May be someone knows how to prevent them from stopping working in the future.
Update: I re-checked once again. Of all gr_ids only this one was showing N\A now: 35906922. I created an example spreadsheet, because my working spreadsheet contains too many unrelated details, but the problem did not appear in the example spreadsheet. I went back to my working spreadsheet and reloaded it - and the problem disappeared in my working spreadsheet too. Then I added more test data in the example spreadsheet and the following new example gr_ids showed N\A:
48213012
48213092
I tried to make a copy of the example spreadsheet to see if it fixes the problem. The behavior in the copy example spreadsheet was identical to the original example spreadsheet - the problem only with two gr_ids specified above.
if you run full IMPORTXML on those two IDs you can see it won't return anything at all:
=IMPORTXML("https://www.goodreads.com/book/show/48213012-fathers-and-sons", "//*")
which means that Google Sheets can't reach the XML content for some reason (could be something similar to https://stackoverflow.com/a/24891676/5632629)
therefore we can try to read the source code directly with IMPORTDATA where we can find around 70 elements with the same information so we pick one, isolate it and remove HTML tags. then we just wrap the prior formula in IFERROR and force the formula to take a 2nd look if it fails first time. the result is like this:
=IFERROR(IMPORTXML("https://www.goodreads.com/book/show/"&A:A, "//*[#id='bookTitle']"),
REGEXEXTRACT(QUERY(ARRAY_CONSTRAIN(
IMPORTDATA("https://www.goodreads.com/book/show/"&A:A), 100, 1),
"select Col1 where Col1 contains '</title>'"), ">(.*) by"))
IMPORTXML() seems to be unreliable. I decided not to use it, because I did not find an acceptable solution to my problem. Instead of using IMPORTXML() I exported my books from goodreads.com to csv file (there is such a feature of goodreads.com) and then imported the csv file into my spreadsheet. This is not be an perfect solution, because I need to re-import every time I need to update the books, but at least it works.

Google spreadsheet ImportXML Error:"the XPath query did not return any data"

I continue to get this error when I try to run this XPath query
//div[#iti='0']
on this link (flight search from google)
https://www.google.com/flights/#search;f=LGW;t=JFK;d=2014-05-22;r=2014-05-26
I get something like this:
=ImportXML("https://www.google.fr/flights/#search;f=jfk;t=lgw;d=2014-02-22;r=2014-02-26";"//div[#iti='0']")
I verified and the XPath is correct (I get the answer wanted using XPath helper, the answer wanted are the data relative to the first flight selected).
I guess that it is a problem of syntax, but I tried more or less all the combinations of lower/uppercase, punctuation (replacing ; , ' ") and I tried to link the URI and the XPath query stored in cells, but nothing works.
Any help will be appreciated.
As a matter of fact, maybe it is a bug on the new google sheets or they have changed how the function works. I've activated mine and when I try to use the ImportXML it simply wont work. Since I have some old sheets here (on the old mechanism) they still work normally. If I copy and paste the script from the old to the new one it simply doesn't get any data.
Here a example:
=ImportXML("http://www.nytimes.com/pages/todayspaper/index.html";"//div[#class='columnGroup first']//h3")
If I run this on the old mechanism it works fine, but if I run the same on the new mechanism, first it will exchange my ";" for a "," and then it will bring a "#N/A" with a warning of "Error: Imported XML content cannot be parsed".
Edit (05/05/2015):
I am happy to say that I tested this function again today on the new spreadsheets and they've fixed it. I was checking that every two months and now finally they have solved this issue. The example I've added above is now returning information.
I'm sorry, but you won't be able to easily parse Google result pages. The reason your function throws an error is because the content of the page you see in your browser is generated by javascript, and Google spreadsheet doesn't execute js.
Your ImportXML has the right syntax, it doesn't return anything because the node you're looking for isn't there (importXML Parse Error).
You will have to find another source if you want these result in your spreadsheet. For info some libraries already parse the usual result page (http://www.seerinteractive.com/blog/google-scraper-in-google-docs-update for example, if it still works), but I doubt finding one for your special case will be easy.
This gives the answer (importXML Parse Error), but it's not entirely obvious.
ImportXML doesn't load Javascript. When you're building ImportXML queries on Google results, make sure you're testing against a version of the page that has Javascript turned off. You can do this using the Chrome DevTools.
(But I agree that ImportXML is fickle, idiosyncratic, and generally rage-inducing).

Import data from URL

The St. Louis Federal Reserve Bank has a great set of data available on a variety of their web pages, such as:
http://research.stlouisfed.org/fred2/series/OILPRICE/downloaddata?cid=32217
http://www.federalreserve.gov/releases/h10/summary/default.htm
http://research.stlouisfed.org/fred2/series/DGS20
The data sets get updated, some as often as daily. I tend to have an interest in the daily data (see the above settings on the URLS)
I'd like to import these kinds of price or rate data streams (accessible as CSV or Excel files at the above URLs) directly into Mathematica.
I've looked at the documentation on Importing[] but I find scant documentation (actually none) on how to go about something like this.
It looks like I need to navigate to the pages, send some data to select specific files and formats, trigger the download, then access the downloaded data from my own machine. Even better if I could access the data directly from the sites.
I had hoped Wolfram Alpha might make this sort thing easy, but I haven't had any success.
FinancialData[] would seem natural for this sort of thing, but I don't see anyway to do it. Financial data has lots of features, but I don't see a way yo get this sort of thing.
Does anyone have any experience with this or can someone point me in the right direction?
You can Import directly from a URL. For example, the data from federalreserve.gov can be obtained and visualized as follows.
url = "http://www.federalreserve.gov/datadownload/Output.aspx?";
url = url<>"rel=H10&series=a660e724c705cea4b7bd1d1b85789862&lastObs=&";
url = url<>"from=&to=&filetype=csv&label=include&layout=seriescolumn";
data = Import[url, "CSV"];
DateListPlot[data[[7 ;;]], Joined -> True]
I broke up url for convenience, since it's so long. I had to examine the contents of data before I knew exactly how to plot it - a step that is typically necessary. I'm sure that the data from stlouisfed.org can be obtained in a similar way, but it requires the use of an API with key to access it.
As Mark said, you can get the data directly from a URL. Your oil data can be imported from a different URL than you had:
http://research.stlouisfed.org/fred2/data/OILPRICE.txt
With that URL, you can do this:
oil = Import["http://research.stlouisfed.org/fred2/data/OILPRICE.txt",
"Table", "HeaderLines" -> 12, "DateStringFormat" -> {"Year", "Month", "Day"}];
DateListPlot[oil, Joined -> True, PlotRange -> All]
Note that "HeaderLines"->12 option strips off the header text in the first 12 lines (you have to count the header lines to know how many to remove). I've also specified the date format.
To find that URL, do as you did before, but click on a data series and then choose View Data from the menu on the left when you see the chart.
The documentation has a short example on extracting data out of a webpage:
http://reference.wolfram.com/mathematica/howto/CleanUpDataImportedFromAWebsite.html
Of course, what actually needs to be done will vary significantly from page to page.
discussion on how to do this with your API key here:
http://library.wolfram.com/infocenter/MathSource/7583/
the function is based on the API documentation. I haven't looked at the code for a couple of years and from memory I put it together rather quickly but I have used it regularly for over 2 years without problems. Here is an example for monthly non seasonally adjusted retail sales from early 1992 to now:
wolfram alpha also uses FRED data so you could use that as an alternative to direct import but it is more tricky to get the query right. I prefer to use FRED directly. Also from memory the data is only available on alpha the day after the release, which is not what you would typically want.

Resources