I'm using Google Sheets as web scraper.
I have been using this IMPORTXML
=importxml(A1, "//div[#class='review-content']//text()")
and this is the results
Row1: {"publishedDate":"2019-01-05T22:19:28Z","updatedDate":"null","reportedDate":"null}
Row2: {"publishedDate":"2018-12-10T22:19:28Z","updatedDate":"null","reportedDate":"null}
Row3: {"publishedDate":"2018-12-09T22:19:28Z","updatedDate":"null","reportedDate":"null}
but am having trouble figuring out how to get only the "publishedDate" value.
Example:
Row1: 2019-01-05T22:19:28Z
Row2: 2018-12-10T22:19:28Z
Row3: 2018-12-09T22:19:28Z
Any ideas as to what I may be missing
How about these 3 samples? I thought them from the samples of your question. I think that there are several answers for your situation. So please think of this as 3 samples of them.
It supposes that the URL is put in the cell "A1".
Sample 1:
=ARRAYFORMULA(MID(IMPORTXML(A1, "//div[#class='review-content']//text()"),19,20))
When the length of string of each value is the constant, how about this?
The value is retrieved by MID().
Sample 2:
=ARRAYFORMULA(INDEX(SPLIT(IMPORTXML(A1, "//div[#class='review-content']//text()"),"""",TRUE,TRUE),,4))
When the position of each value is the constant, how about this?
The value is retrieved by SPLIT() and INDEX().
Sample 3:
=ARRAYFORMULA(REGEXEXTRACT(IMPORTXML(A1, "//div[#class='review-content']//text()"),"publishedDate"":""(\w.+?)"""))
When the pattern of each value is the constant, how about this?
The value is retrieved by REGEXEXTRACT().
References:
MID
SPLIT
INDEX
REGEXEXTRACT
If these were not the results you want, I apologize. At that time, in order to correctly replicate your situation, can you provide the URL you are using as #Rubén says?
Related
I have a sheet where we paste values copied from a pdf into a column, such as:
2715411.0 28.10.2021 600.00
In Google sheets there are columns with formulas that split these values, one of each is:
=ArrayFormula(INDEX(SPLIT(REGEXREPLACE(C2:C274, "\s", "♥"),"♥"),ROW(C2)-ROW(C2),1))
This formula is returning "2715411" instead of "2715411.0". I've tested the formula if the value was "2715411.1" and it works so I'm assuming it's because the number is being "rounded".
Another thing to take into consideration is that sometimes the number we paste is something like "32434346 28.10.2021 600.00" so having always decimal places can't be the answer.
Can anyone help?
Thank you in advance.
=ArrayFormula(SUBSTITUTE(SPLIT(SUBSTITUTE(C2:C274,".","♦")," "),"♦","."))
Using Google sheet 'ImportXML', I was able to extract the following data from a url(in cell A2) using:
=IMPORTXML(A2,"//a/#href[substring-after(., 'AGX:')]").
Data:
/vector/AGX:5WH
/vector/AGX:Z74
/vector/AGX:C52
/vector/AGX:A27
/vector/AGX:C6L
But, I want to extract the code after "/vector/AGX:". The code is not fixed to 3 letters and number of rows is not fixed as well.
I used =INDEX(SPLIT(AP2,"/,'vector',':'"),1,2). But it applied to only one line of data. Had to copy the index+split function to the whole column and had to insert an additional column to store the codes.
5WH
Z74
C52
A27
C6L
But, I want to be able to extract the code(s) after AGX: using ImportXML in one go. Is there a way?
Solution
Your issue is in how you are implementing the index formula. The first parameter returns the rows (in your case each element) and the second the column (in your case either AGX or the code after that).
If instead of getting a single cell we apply this formula on a range and we do not set any value for the row, the formula will return all the values achieving what you were aiming for. Here is its implementation (where F1:F5 will be the range of values you want this formula to be applied) :
=INDEX(SPLIT(F1:F5,"/,'vector',':'"),,2)
If you are interested in a solution simply using IMPORTXML and XPATH, according to the documentation you could use a substring as follows:
=IMPORTXML(A1,"//a/#href[substring-after(.,'SGX:')]")
The drawback of this is that it will return the full string and not exclusively what is after the SGX: which means that you would need to use a Google sheet formula to splitting this. This is the furthest I have achieved exclusively using XPath. In XML it would be easier to apply a forEach and really select what is after the : but I believe in sheets is more complicated if not impossible just using XPath.
I hope this has helped you. Let me know if you need anything else or if you did not understood something. :)
I try to use Google Sheet's importxml function to get list of value, but only need first 12 value.
So how can I do it, please?
My query: =IMPORTXML("https://muagame.vn/may-ps4.html","//h3")
You want to retrieve the values from the URL of https://muagame.vn/may-ps4.html with the xpath of //h3.
When the xpath of //h3 is used, 12 items are retrieved. You want to retrieve the 1st 5 items.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
In this answer, the xpath is modified. Please modify the xpath of =IMPORTXML("https://muagame.vn/may-ps4.html","//h3") as follows.
From:
//h3
To:
//li[position()<=5]/h3
In the HTML data, the tag h3 is put in the tag li. So in order to retrieve the 1st 5 items of h3, I used li[position()<=5].
Result:
In this case, the formula is =IMPORTXML("https://muagame.vn/may-ps4.html","//li[position()<=5]/h3").
Reference:
position
If I misunderstood your question and this was not the result you want, I apologize.
try:
=QUERY(IMPORTXML("https://muagame.vn/may-ps4.html", "//h3"), "limit 5")
or:
=ARRAY_CONSTRAIN(IMPORTXML("https://muagame.vn/may-ps4.html", "//h3"), 5, 1)
I've done dozens times, but now don't get what I'm doing wrong. I want to extract specific records, into 2 separate columns (I know that order wil not match), so I use:
//a/#href[contains(.; "github")]
and
//*[contains(text(); "Pricing:")]
But non of them is working - where my mistake?
(my sandbox: https://docs.google.com/spreadsheets/d/11Z3xybq_eYQvjn2-UBOomgeJxFrrsFoXKzF9yZSeASM/edit#gid=1841586203 with LT localle)
damn, those google sheet localles!!!... must be:
//a/#href[contains(., "github")]
and
//*[contains(text(), "Pricing:")]
I'll keep for further reference.
I'm currently struggling on finding the formula that will resolve my problem.
Here's the status quo:
In Sheet 1, column A, I have a set of string, such as:
/search.action?gender=men&brand=10177&tag=10203&tag=10336
/search.action?gender=women&brand=11579&tag=10001&tag=10138
/search.action?gender=men&brand=12815&tag=10203&tag=10299
/search.action?gender=women&brand=1396&tag=10203&tag=10513
/search.action?gender=women&brand=11&tag=10001&tag=10073
/search.action?gender=women&brand=1396&tag=10203&tag=10336
/search.action?gender=women&brand=13
In Sheet 2, column A, I have a set of strings such as:
brand=10177
brand=12815
brand=13
brand=1396
brand=11579
Finally, in sheet 1, column B will be my "filter" with the formula I'm struggling to find. The goal of my formula is to detect in any of the strings in sheet 1 if one of the string in sheet 2 is present (as an exact match!). Indeed, now it only finds approximative matches. As you can see, the row 5 shouldn't return anything. But with my current formula it does.
Here's the formula:
{=IFERROR(INDEX('Sheet 2'!$A$1:$A$5;MATCH(1;COUNTIF(A1;"*"&'Sheet 2'!$A$1:$A$5&"*");0));"")}
Any idea on the matter?
Please note that I don't want to use VBA, macros, but only a formula.
Thanks a lot for your help!
Following will solve your problem I guess:
=VLOOKUP(MID(A2,FIND("&",A2)+1,FIND("&",A2,FIND("&",A2)+1)-FIND("&",A2)-1),Sheet2!A:A,1,FALSE)
Basically with find function I have identified the start and length of the string in between "&" signs. and used in vlookup.
Another point to mention is this formula is only looking for the first 2 "&" signs.
For completeness, here is another solution based on this answer
=INDEX(Sheet2!$A$1:$A$5,MAX(IF(ISERROR(FIND(Sheet2!$A$1:$A$5,A1)),-1,1)*(ROW(Sheet2!$A$1:$A$5)-ROW(Sheet2!$A$1)+1)))
This is a bit more general and it doesn't matter how many search tags there are.
However as it stands it would match brand=13 in the second sheet with brand=1396 in the first sheet. To avoid that you could add an ampersand to the search strings:-
=INDEX(Sheet2!$A$1:$A$5,MAX(IF(ISERROR(FIND(Sheet2!$A$1:$A$5&"&",A1&"&")),-1,1)*(ROW(Sheet2!$A$1:$A$5)-ROW(Sheet2!$A$1)+1)))
This formula throws a #VALUE error if there is no match: to avoid this, you would need to put an IFERROR statement round it:-
=IFERROR(INDEX(Sheet2!$A$1:$A$5,MAX(IF(ISERROR(FIND(Sheet2!$A$1:$A$5&"&",A1&"&")),-1,1)*(ROW(Sheet2!$A$1:$A$5)-ROW(Sheet2!$A$1)+1))),"")
All these are array formulae.