Xpath importxml google spreadheet

Xpath importxml google spreadheet - xpath

I'm tryng to do web scraping by importXML in Google Spreadsheet, reading the content in this page:
http://ddp.usach.cl/procesos-de-seleccion-internos
What I need to do is select the list below "Lista de Procesos, and separate it by rows. I went to the page, inspected and copy the XPath
//*[#id="node-page-442"]/div[1]/div/div/div/p[5]
Resulting in this code:
=importxml("http://ddp.usach.cl/node/442";"//*[#id='node-page-442']/div[1]/div/div/div/p[7]/text()")
However, when I try to load it I get an error #N/A
"Imported content is empty"

One path to get the nodes following the h4 element with the content "Lista de Procesos" is
//article[#id='node-page-442']/div[contains(#class, 'content')]/div[contains(#class, 'field-name-body')]/div[#class='field-items']/div[contains(#class,'field-item')]/h4[contains(text(), 'Lista de Procesos')]/following-sibling::*
The retrieved children are not structured, but complete. If you can use XSLT-2.0, you could structure them by using for-each-group with group-starting-with='strong'. But this is only one possibility.
The expression could be reduced to the simple term:
//h4[contains(text(),'Lista de Procesos')]/following-sibling::*
Maybe this suits your needs better.

Related

How to get list of all exchanges with xpath to google sheets

Try to get a list of cryptocurrency exchanges from coingecko 2nd page into my google sheet.
To get a result like:
Tokenize
Bibox
Vebitcoin
...
Try to make it with.
IMPORTXML("https://www.coingecko.com/en/exchanges?page=2", "//*[contains(text(),' exchange')]")
As a result, get Error:
Imported content is empty.

How about this modified xpath?
Modified xpath:
//span/a[contains(#href,'/en/exchanges/')]
and
//span[#class='pt-2 flex-column']/a[contains(#href,'/en/exchanges/')]
Modified formula
=IMPORTXML(A1,"//span/a[contains(#href,'/en/exchanges/')]")
In this case, the URL of https://www.coingecko.com/en/exchanges?page=2 is put to the cell "A1".
Result:
Note:
The list of cryptocurrency exchanges can be retrieved by the modified xpath. But in this case, it seems that the values of Tokenize, Bibox and Vebitcoin are not included.
Reference:
IMPORTXML

To get the list of all exchanges, you can also use the following formula :
=ARRAYFORMULA(REGEXEXTRACT(QUERY(TRANSPOSE(IMPORTDATA("https://api.coingecko.com/api/v3/search?locale=en&img_path_only=1"
));"select * WHERE Col1 starts with ""name""");"name:""(.+)"""))
Output (~ 8000 elements) :

ImportXML function in Google Sheets produces error 'Imported content is empty'!

Here is the ImportXML formula I am using:
=IMPORTXML("https://finance.yahoo.com/quote/RY.TO/profile",K6)
Cell K6 contains the following xpath query:
//*[#id="Col1-0-Profile-Proxy"]/section/div[1]/div/div/p[2]/strong[1]
I got the xpath query by using the Copy XPath function in Google Chrome (e.g. after inspecting the element I am interested in).
The element I am interested in is the Sector associated with the Royal Bank (e.g. Financial Services)
Any help would be appreciated. Many thanks!!

Using the Copy XPath function is a handy feature. However, the suggested query is usually clumsy and sometimes does not yield the desired result. Here is an alternative approach:
//span[.='Sector']/following-sibling::strong[1]
Select the span that has the innerHtml "Sector" and then select the following strong sibling; finally, we can select the /text() directly too like this:
=IMPORTXML($A$10;"//span[.='Sector']/following-sibling::strong[1]/text()")
which returns: Financial Services

ImportXML xpath to google sheets returning #N/A

I am a beginner to programming in general and google in particular. I've been trying to get this (what seems to me) simple web query working for a while using the importxml() function. I am trying to pull a reference from a citation generation website, where you search a pubmed ID number (PMID).
The site is https://mickschroeder.com/citation/?q=18515037 where 18515037 is the PMID. This brings up a citation.
Allison MA, Kwan K, Ditomasso D, Wright CM, Criqui MH. The epidemiology of
abdominal aortic diameter. J Vasc Surg. 2008;48(1):121-7.
I did inspect element and got the XPath as:
//*[#id="citation_formatted"]/text()
So i have tried
=importxml(ttps://mickschroeder.com/citation/?q=18515037, "//*[#id="citation_formatted"]/text()")
And it returns #N/A or blank. I've tried taking out the * but can't get it working. Do I need to escape the () in the text()? Or do I have the Xpath totally wrong. I did a search for the answer but I figure I'm so new I can't apply those concepts.
Thanks for any help you can give.

Extract a specific node from an XML file

I want to extract only the body node/tag from an XML file using doc.xpath in Ruby
The node to extract from the XML file:
<wcm:element name="Body"><p>A new study suggests that <a href="ssNODELINK/SmokingAndCancer">tobacco</a> companies may be using online video portals, such as YouTube, to get around advertising restrictions and market their products to young people.</p>
</wcm:element>
I have tried the following:
page_content = doc.xpath("/wcm:root/wcm:element").inner_text
But this extracts every node everything
Then I tried this:
page_content = doc.xpath("/wcm:root/wcm:element/Body")
But does not work.
Anyone has any suggestions how to extract exactly the body section of an XML file using doc.xpath in Ruby?

I'm not 100% certain I've understood what you mean but… let's not let that stop us. You want to get the content of a particular node from the input. Your first XPath statement:
/wcm:root/wcm:element
is extracting every element with name wcm:element that is a child of the wcm:root element which is the root element.
Your second:
/wcm:root/wcm:element/Body
is similar but looks for elements with name Body which are children of the wcm:element.
What you need to is to get the values of the wcm:element element where the attribute name is set to the value Body. You access attributes in XPath by prefixing them with an # sign and to express a where condition you use [...] - a predicate. You XPath statement needs to be:
/wcm:root/wcm:element[#name = 'Body']
I'm assuming that your XPath execution environment is fine the namespace prefixes (wcm) because you say that your first query returned content.

Google Spreadsheet ImportXML error #NA not received any data as a result of XPath queries

I have a problem with scraping one website - motoallegro
I want to get title of all ads in this page
So I set formula in google spreadsheet:
=ImportXML("http://allegro.pl/samochody-149?order=qd&string=Primera+GT&search_scope=automotive&department=automotive";"//header/h2/a/span")
This formula always return #NA error: not received any data as a result of XPath queries
But if I try to get other data from the same page, for example H1 text:
=ImportXML("http://allegro.pl/samochody-149?order=qd&string=Primera+GT&search_scope=automotive&department=automotive";"//h1/span")
The result is correct: "Primera GT"
I want to add, that xPath rule - //header/h2/a/span IS CORRECT. I tested it on few firefox xPath plugins.
Any ideas, why google spreadsheet formula ImportXML with correct xPath rule not return correct data?

Google seems to strip non-HTML4-tags like <header/> and <section/>. You could use <div id="listing">...</div> for accessing only the headlines you need.
Try this XPath expression:
//div[#id='listing']//h2/a/span

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Xpath importxml google spreadheet - xpath

Related

How to get list of all exchanges with xpath to google sheets

ImportXML function in Google Sheets produces error 'Imported content is empty'!

ImportXML xpath to google sheets returning #N/A

Extract a specific node from an XML file

Google Spreadsheet ImportXML error #NA not received any data as a result of XPath queries

Categories

Resources