How to enumerate all xpath of elements displayed on a specified URL or webpage based on exact matched or unique objects?
Could someone provide any sample or reference?
I’m not sure if this is what you mean but an xpath of
“//*”
Should return all elements on your webpage in a list you can iterate.
Related
Normally, one would use an XPath query to obtain a certain value or node. In my case, I'm doing some web-scraping with google spreadsheets, using the importXML function to update automatically some values. Two examples are given below:
=importxml("http://www.creditagricoledtvm.com.br/";"(//td[#class='xl7825385'])[9]")
=importxml("http://www.bloomberg.com/quote/ELIPCAM:BZ";"(//span)[32]")
The problem is that the pages I'm scraping will change every now and then and I understand very little about XML/XPath, so it takes a lot of trial and error to get to a node. I was wondering if there is any tool I could use to point to an element (either in the page or in its code) that would provide an appropriate query.
For example, in the second case, I've noticed the info I wanted was in a span node (hence (//span)), so I printed all of them in a spreadsheet and used the line count to find the [32] index. This takes long to load, so it's pretty inconvenient. Also, I don't even remember how I've figured the //td[#class='xl7825385'] query. Thus why I'm wondering if there is more practical method of pointing to page elements.
Some clues :
Learning XPath basics is still useful. W3Schools is a good starting point.
https://www.w3schools.com/xml/xpath_intro.asp
Otherwise, built-in dev tools of your browser can help you to generate absolute XPath. Select an element, right-click on it then >Copy>Copy XPath.
https://developers.google.com/web/tools/chrome-devtools/open
Browser extensions like Chropath can generate absolute or relative XPath for you.
https://autonomiq.io/chropath/
I'm trying to get some data from this website https://etfdb.com/etf/VOO/with IMPORTXML. Unfortunately, I was not able to scrape a particular element of the page but rather I got data only from these two functions
=IMPORTXML("https://etfdb.com/etf/VOO","//*")
=IMPORTXML("https://etfdb.com/etf/VOO","/html")
I tried to see if the browser is only loading data through JS but after disabling it the site loaded correctly, so I don't think JS might be the problem here.
How come after running a simple function like this, I get an error saying the scraped content is empty?
//span[contains(text(),'Tracks This Index:')]/following-sibling::span
EDIT: added spreadsheet with desired output https://docs.google.com/spreadsheets/d/1Zn0fQwenYZo6u4jP0yZ7J-NCzyzRnqabR3CDUz8jP3E/edit?usp=sharing
How about this answer?
Issue:
Unfortunately, the value cannot be retrieved with the xpath of //span[contains(text(),'Tracks This Index:')]/following-sibling::span from the HTML data of the URL. For example, even when //span is used, #N/A is returned. The reason of this issue is mentioned by Rubén's answer.
Workaround:
Here, I would like to propose a workaround. Please think of this as just one of several answers. In this workaround, the value you want is retrieved from all values from body. Although each tag in the body cannot be retrieved, //body can be retrieved. And fortunately, the value you want is included in the value from //body. The flow of this workaround is as follows.
Retrieve values from the xpath of //body.
Retrieve the value you want by the regular expression.
Sample formula:
=TEXTJOIN("",TRUE,IFNA(ARRAYFORMULA(TRIM(REGEXEXTRACT(IMPORTXML(A1,"//body"),"Tracks This Index: (\w.+)"))),""))
In this sample, the cell "A1" has the URL of https://etfdb.com/etf/VOO.
After the value of //body was retrieved, the value is retrieved by the regular expression.
The important point of this workaround is the methodology. I think that there are various formulas for retrieving the value. So please think of above sample formula as just one of them.
Result:
Note:
If you use above formula for other URL, an error might occur. Please be careful this.
References:
IMPORTXML
REGEXEXTRACT
ARRAYFORMULA
IFNA
TEXTJOIN
If this was not the direction you want, I apologize.
This is partial answer.
The problem occurs because https://etfdb.com/etf/VOO/ isn't a valid XHTML file.
Some failures:
Use of <hr> instead of <hr/>
Use of <br> instead of <br/>
The above failures cause that IMPORTXML can't parse below sibling tags.
I'm trying to find the most reliable way to determine which element in a resource a given search parameter refers to. So far I process the xpath expression and hope to find a match, but this seems hacky.
Is there a standard or more consistent way to determine what element(s) within a resource a search parameter should use?
Not right now - parsing the XPath is it. This is an active point of discussion in the standards group and will probably result in some modification to the SearchParameter resource (I hope!).
I Have two objects in same page but with different locations(tabs), I want to verify those objects each a part ...
i cant uniquely any of objects because the have same properties.
These objects clearly are unique to a point because they have completely different text, this means that you will be able to create an object to match only one of them. My suggestion would be to look for the object by using its text property, one of them will always have "Top Ranking" the other you wil need to turn into a regular expression for the text and will be something "Participants (\d+)".
I am assuming that this next answer is unlikely to be possible so saved it for after the answer you are likely to use but the best solution would of course be to get someone with access to give these elements ids for you to search for. This will in the long term be much easier for you to maintain and not using text will allow this test to run in any language.
Manaysah, do these objects have different indexes? Use the object spy and determine which index they have, the ordinal identifier index may be a solution to your problem. You could also try adding an innertext object property if possible, using a wildcard for the number inside the () as it appears dynamic.
try using xpath for the objects...xpath will definitely be different
I am trying to parse one element from a website that is inside of a table. This is the exact xpath expression that I use:
[xpathParser search:#"/table[1]/tr[2]/td[1]"];
However, when I run the program, my string comes up empty. I'm wondering if the site is blocking me from parsing, or whether my expression is correct. If it helps, this is the site, and the piece I am trying to parse is the element Atlantic.
http://cluster.leaguestat.com/download.php?client_code=ahl&file_path=daily-report/daily-report.html
There are several 'atlantic' sections on the page, not sure what you mean by the element Atlantic. Your xpath expression might not be correct, as the 'tr' is not a direct descendant of table (there is a tbody in between). You might want to try //table/tbody/tr[2]/td[1], as well as the xpath checker firefox plugin to test expressions.