Xpath with contains() in importXML() - xpath

I've done dozens times, but now don't get what I'm doing wrong. I want to extract specific records, into 2 separate columns (I know that order wil not match), so I use:
//a/#href[contains(.; "github")]
and
//*[contains(text(); "Pricing:")]
But non of them is working - where my mistake?
(my sandbox: https://docs.google.com/spreadsheets/d/11Z3xybq_eYQvjn2-UBOomgeJxFrrsFoXKzF9yZSeASM/edit#gid=1841586203 with LT localle)

damn, those google sheet localles!!!... must be:
//a/#href[contains(., "github")]
and
//*[contains(text(), "Pricing:")]
I'll keep for further reference.

Related

Extract substring using importxml and substring-after

Using Google sheet 'ImportXML', I was able to extract the following data from a url(in cell A2) using:
=IMPORTXML(A2,"//a/#href[substring-after(., 'AGX:')]").
Data:
/vector/AGX:5WH
/vector/AGX:Z74
/vector/AGX:C52
/vector/AGX:A27
/vector/AGX:C6L
But, I want to extract the code after "/vector/AGX:". The code is not fixed to 3 letters and number of rows is not fixed as well.
I used =INDEX(SPLIT(AP2,"/,'vector',':'"),1,2). But it applied to only one line of data. Had to copy the index+split function to the whole column and had to insert an additional column to store the codes.
5WH
Z74
C52
A27
C6L
But, I want to be able to extract the code(s) after AGX: using ImportXML in one go. Is there a way?
Solution
Your issue is in how you are implementing the index formula. The first parameter returns the rows (in your case each element) and the second the column (in your case either AGX or the code after that).
If instead of getting a single cell we apply this formula on a range and we do not set any value for the row, the formula will return all the values achieving what you were aiming for. Here is its implementation (where F1:F5 will be the range of values you want this formula to be applied) :
=INDEX(SPLIT(F1:F5,"/,'vector',':'"),,2)
If you are interested in a solution simply using IMPORTXML and XPATH, according to the documentation you could use a substring as follows:
=IMPORTXML(A1,"//a/#href[substring-after(.,'SGX:')]")
The drawback of this is that it will return the full string and not exclusively what is after the SGX: which means that you would need to use a Google sheet formula to splitting this. This is the furthest I have achieved exclusively using XPath. In XML it would be easier to apply a forEach and really select what is after the : but I believe in sheets is more complicated if not impossible just using XPath.
I hope this has helped you. Let me know if you need anything else or if you did not understood something. :)

Google Sheet importxml - How to retrieve only the 5 first values?

I try to use Google Sheet's importxml function to get list of value, but only need first 12 value.
So how can I do it, please?
My query: =IMPORTXML("https://muagame.vn/may-ps4.html","//h3")
You want to retrieve the values from the URL of https://muagame.vn/may-ps4.html with the xpath of //h3.
When the xpath of //h3 is used, 12 items are retrieved. You want to retrieve the 1st 5 items.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
In this answer, the xpath is modified. Please modify the xpath of =IMPORTXML("https://muagame.vn/may-ps4.html","//h3") as follows.
From:
//h3
To:
//li[position()<=5]/h3
In the HTML data, the tag h3 is put in the tag li. So in order to retrieve the 1st 5 items of h3, I used li[position()<=5].
Result:
In this case, the formula is =IMPORTXML("https://muagame.vn/may-ps4.html","//li[position()<=5]/h3").
Reference:
position
If I misunderstood your question and this was not the result you want, I apologize.
try:
=QUERY(IMPORTXML("https://muagame.vn/may-ps4.html", "//h3"), "limit 5")
or:
=ARRAY_CONSTRAIN(IMPORTXML("https://muagame.vn/may-ps4.html", "//h3"), 5, 1)

Google Sheets IMPORTXML query

I'm using Google Sheets as web scraper.
I have been using this IMPORTXML
=importxml(A1, "//div[#class='review-content']//text()")
and this is the results
Row1: {"publishedDate":"2019-01-05T22:19:28Z","updatedDate":"null","reportedDate":"null}
Row2: {"publishedDate":"2018-12-10T22:19:28Z","updatedDate":"null","reportedDate":"null}
Row3: {"publishedDate":"2018-12-09T22:19:28Z","updatedDate":"null","reportedDate":"null}
but am having trouble figuring out how to get only the "publishedDate" value.
Example:
Row1: 2019-01-05T22:19:28Z
Row2: 2018-12-10T22:19:28Z
Row3: 2018-12-09T22:19:28Z
Any ideas as to what I may be missing
How about these 3 samples? I thought them from the samples of your question. I think that there are several answers for your situation. So please think of this as 3 samples of them.
It supposes that the URL is put in the cell "A1".
Sample 1:
=ARRAYFORMULA(MID(IMPORTXML(A1, "//div[#class='review-content']//text()"),19,20))
When the length of string of each value is the constant, how about this?
The value is retrieved by MID().
Sample 2:
=ARRAYFORMULA(INDEX(SPLIT(IMPORTXML(A1, "//div[#class='review-content']//text()"),"""",TRUE,TRUE),,4))
When the position of each value is the constant, how about this?
The value is retrieved by SPLIT() and INDEX().
Sample 3:
=ARRAYFORMULA(REGEXEXTRACT(IMPORTXML(A1, "//div[#class='review-content']//text()"),"publishedDate"":""(\w.+?)"""))
When the pattern of each value is the constant, how about this?
The value is retrieved by REGEXEXTRACT().
References:
MID
SPLIT
INDEX
REGEXEXTRACT
If these were not the results you want, I apologize. At that time, in order to correctly replicate your situation, can you provide the URL you are using as #Rubén says?

Return the count of courses using satisfies contains XQuery

I have been struggling to return the count of courses from this XML file that contain "Cross-listed" as their description. The problem I encounter is because I am using for, it iterates and gives me "1 1" instead of "2". When I try using let instead I get 13 which means it counts all without condition even when I point return count($c["Cross-listed"]. What am I doing wrong and how can I fix it? Thanks in advance
for $c in doc("courses.xml")//Department/Course
where some $desc in $c/Description
satisfies contains($desc, "Cross-listed")
return count($c)
The problem I encounter is because I am using for
You are quite correct. You don't need to process items individually in order to count them.
You've made things much too difficult. You want
count(doc("courses.xml")//Department/Course[Description[contains(., "Cross-listed"]])
The key thing here is: you want a count, so call the count() function, and give it an argument which selects the set of things you want to include in the count.

ImportXML Xpath Query Return txt

I need to use Google Spreadsheet ImportXML return a value from this website...
http://www.e-go.com.au/calculatorAPI2?pickuppostcode=2000&pickupsuburb=SYDNEY+CITY&deliverypostcode=4000&deliverysuburb=BRISBANE&type=Carton&width=40&height=35&depth=65&weight=2&items=3
the website simply displays the below in text and code...
error=OK
eta=Overnight
price=64.69
I need to return the values after last line 'price=', being a newbee I'm struggling with xpath query (?) required to make this happens...
=importxml("url",?)
Your help is greatly appreciated.
Thank you in advance.
Regards
first of all, IMPORTXML() won't work because your webpage is not formatted correctly for XML, and google sheets doesn't like it.
All hope is not lost tho, as your output is so simple. you can simply load the whole output using IMPORTDATA() and then process within google sheets
have a look at the output of the following formulae (where the url is stored in A1)
=IMPORTDATA(A1)
=transpose(IMPORTDATA(A1))
=index(IMPORTDATA(A1),3,1) - IF there are always 3 results, and price will always be in the third one this will work
=filter(IMPORTDATA(A1),left(IMPORTDATA(A1),5)="price") - if the price can appear in any of the result lines, but always starting with "price"

Resources