I'm tryng to do web scraping by importXML in Google Spreadsheet, reading the content in this page:
http://ddp.usach.cl/procesos-de-seleccion-internos
What I need to do is select the list below "Lista de Procesos, and separate it by rows. I went to the page, inspected and copy the XPath
//*[#id="node-page-442"]/div[1]/div/div/div/p[5]
Resulting in this code:
=importxml("http://ddp.usach.cl/node/442";"//*[#id='node-page-442']/div[1]/div/div/div/p[7]/text()")
However, when I try to load it I get an error #N/A
"Imported content is empty"
One path to get the nodes following the h4 element with the content "Lista de Procesos" is
//article[#id='node-page-442']/div[contains(#class, 'content')]/div[contains(#class, 'field-name-body')]/div[#class='field-items']/div[contains(#class,'field-item')]/h4[contains(text(), 'Lista de Procesos')]/following-sibling::*
The retrieved children are not structured, but complete. If you can use XSLT-2.0, you could structure them by using for-each-group with group-starting-with='strong'. But this is only one possibility.
The expression could be reduced to the simple term:
//h4[contains(text(),'Lista de Procesos')]/following-sibling::*
Maybe this suits your needs better.
Related
Try to get a list of cryptocurrency exchanges from coingecko 2nd page into my google sheet.
To get a result like:
Tokenize
Bibox
Vebitcoin
...
Try to make it with.
IMPORTXML("https://www.coingecko.com/en/exchanges?page=2", "//*[contains(text(),' exchange')]")
As a result, get Error:
Imported content is empty.
How about this modified xpath?
Modified xpath:
//span/a[contains(#href,'/en/exchanges/')]
and
//span[#class='pt-2 flex-column']/a[contains(#href,'/en/exchanges/')]
Modified formula
=IMPORTXML(A1,"//span/a[contains(#href,'/en/exchanges/')]")
In this case, the URL of https://www.coingecko.com/en/exchanges?page=2 is put to the cell "A1".
Result:
Note:
The list of cryptocurrency exchanges can be retrieved by the modified xpath. But in this case, it seems that the values of Tokenize, Bibox and Vebitcoin are not included.
Reference:
IMPORTXML
To get the list of all exchanges, you can also use the following formula :
=ARRAYFORMULA(REGEXEXTRACT(QUERY(TRANSPOSE(IMPORTDATA("https://api.coingecko.com/api/v3/search?locale=en&img_path_only=1"
));"select * WHERE Col1 starts with ""name""");"name:""(.+)"""))
Output (~ 8000 elements) :
Here is the ImportXML formula I am using:
=IMPORTXML("https://finance.yahoo.com/quote/RY.TO/profile",K6)
Cell K6 contains the following xpath query:
//*[#id="Col1-0-Profile-Proxy"]/section/div[1]/div/div/p[2]/strong[1]
I got the xpath query by using the Copy XPath function in Google Chrome (e.g. after inspecting the element I am interested in).
The element I am interested in is the Sector associated with the Royal Bank (e.g. Financial Services)
Any help would be appreciated. Many thanks!!
Using the Copy XPath function is a handy feature. However, the suggested query is usually clumsy and sometimes does not yield the desired result. Here is an alternative approach:
//span[.='Sector']/following-sibling::strong[1]
Select the span that has the innerHtml "Sector" and then select the following strong sibling; finally, we can select the /text() directly too like this:
=IMPORTXML($A$10;"//span[.='Sector']/following-sibling::strong[1]/text()")
which returns: Financial Services
I am a beginner to programming in general and google in particular. I've been trying to get this (what seems to me) simple web query working for a while using the importxml() function. I am trying to pull a reference from a citation generation website, where you search a pubmed ID number (PMID).
The site is https://mickschroeder.com/citation/?q=18515037 where 18515037 is the PMID. This brings up a citation.
Allison MA, Kwan K, Ditomasso D, Wright CM, Criqui MH. The epidemiology of
abdominal aortic diameter. J Vasc Surg. 2008;48(1):121-7.
I did inspect element and got the XPath as:
//*[#id="citation_formatted"]/text()
So i have tried
=importxml(ttps://mickschroeder.com/citation/?q=18515037, "//*[#id="citation_formatted"]/text()")
And it returns #N/A or blank. I've tried taking out the * but can't get it working. Do I need to escape the () in the text()? Or do I have the Xpath totally wrong. I did a search for the answer but I figure I'm so new I can't apply those concepts.
Thanks for any help you can give.
I want to extract only the body node/tag from an XML file using doc.xpath in Ruby
The node to extract from the XML file:
<wcm:element name="Body"><p>A new study suggests that <a href="ssNODELINK/SmokingAndCancer">tobacco</a> companies may be using online video portals, such as YouTube, to get around advertising restrictions and market their products to young people.</p>
</wcm:element>
I have tried the following:
page_content = doc.xpath("/wcm:root/wcm:element").inner_text
But this extracts every node everything
Then I tried this:
page_content = doc.xpath("/wcm:root/wcm:element/Body")
But does not work.
Anyone has any suggestions how to extract exactly the body section of an XML file using doc.xpath in Ruby?
I'm not 100% certain I've understood what you mean but… let's not let that stop us. You want to get the content of a particular node from the input. Your first XPath statement:
/wcm:root/wcm:element
is extracting every element with name wcm:element that is a child of the wcm:root element which is the root element.
Your second:
/wcm:root/wcm:element/Body
is similar but looks for elements with name Body which are children of the wcm:element.
What you need to is to get the values of the wcm:element element where the attribute name is set to the value Body. You access attributes in XPath by prefixing them with an # sign and to express a where condition you use [...] - a predicate. You XPath statement needs to be:
/wcm:root/wcm:element[#name = 'Body']
I'm assuming that your XPath execution environment is fine the namespace prefixes (wcm) because you say that your first query returned content.
I have a problem with scraping one website - motoallegro
I want to get title of all ads in this page
So I set formula in google spreadsheet:
=ImportXML("http://allegro.pl/samochody-149?order=qd&string=Primera+GT&search_scope=automotive&department=automotive";"//header/h2/a/span")
This formula always return #NA error: not received any data as a result of XPath queries
But if I try to get other data from the same page, for example H1 text:
=ImportXML("http://allegro.pl/samochody-149?order=qd&string=Primera+GT&search_scope=automotive&department=automotive";"//h1/span")
The result is correct: "Primera GT"
I want to add, that xPath rule - //header/h2/a/span IS CORRECT. I tested it on few firefox xPath plugins.
Any ideas, why google spreadsheet formula ImportXML with correct xPath rule not return correct data?
Google seems to strip non-HTML4-tags like <header/> and <section/>. You could use <div id="listing">...</div> for accessing only the headlines you need.
Try this XPath expression:
//div[#id='listing']//h2/a/span