I try to use Google Sheet's importxml function to get list of value, but only need first 12 value.
So how can I do it, please?
My query: =IMPORTXML("https://muagame.vn/may-ps4.html","//h3")
You want to retrieve the values from the URL of https://muagame.vn/may-ps4.html with the xpath of //h3.
When the xpath of //h3 is used, 12 items are retrieved. You want to retrieve the 1st 5 items.
If my understanding is correct, how about this answer? Please think of this as just one of several possible answers.
In this answer, the xpath is modified. Please modify the xpath of =IMPORTXML("https://muagame.vn/may-ps4.html","//h3") as follows.
From:
//h3
To:
//li[position()<=5]/h3
In the HTML data, the tag h3 is put in the tag li. So in order to retrieve the 1st 5 items of h3, I used li[position()<=5].
Result:
In this case, the formula is =IMPORTXML("https://muagame.vn/may-ps4.html","//li[position()<=5]/h3").
Reference:
position
If I misunderstood your question and this was not the result you want, I apologize.
try:
=QUERY(IMPORTXML("https://muagame.vn/may-ps4.html", "//h3"), "limit 5")
or:
=ARRAY_CONSTRAIN(IMPORTXML("https://muagame.vn/may-ps4.html", "//h3"), 5, 1)
Related
Try to get a list of cryptocurrency exchanges from coingecko 2nd page into my google sheet.
To get a result like:
Tokenize
Bibox
Vebitcoin
...
Try to make it with.
IMPORTXML("https://www.coingecko.com/en/exchanges?page=2", "//*[contains(text(),' exchange')]")
As a result, get Error:
Imported content is empty.
How about this modified xpath?
Modified xpath:
//span/a[contains(#href,'/en/exchanges/')]
and
//span[#class='pt-2 flex-column']/a[contains(#href,'/en/exchanges/')]
Modified formula
=IMPORTXML(A1,"//span/a[contains(#href,'/en/exchanges/')]")
In this case, the URL of https://www.coingecko.com/en/exchanges?page=2 is put to the cell "A1".
Result:
Note:
The list of cryptocurrency exchanges can be retrieved by the modified xpath. But in this case, it seems that the values of Tokenize, Bibox and Vebitcoin are not included.
Reference:
IMPORTXML
To get the list of all exchanges, you can also use the following formula :
=ARRAYFORMULA(REGEXEXTRACT(QUERY(TRANSPOSE(IMPORTDATA("https://api.coingecko.com/api/v3/search?locale=en&img_path_only=1"
));"select * WHERE Col1 starts with ""name""");"name:""(.+)"""))
Output (~ 8000 elements) :
Using Google sheet 'ImportXML', I was able to extract the following data from a url(in cell A2) using:
=IMPORTXML(A2,"//a/#href[substring-after(., 'AGX:')]").
Data:
/vector/AGX:5WH
/vector/AGX:Z74
/vector/AGX:C52
/vector/AGX:A27
/vector/AGX:C6L
But, I want to extract the code after "/vector/AGX:". The code is not fixed to 3 letters and number of rows is not fixed as well.
I used =INDEX(SPLIT(AP2,"/,'vector',':'"),1,2). But it applied to only one line of data. Had to copy the index+split function to the whole column and had to insert an additional column to store the codes.
5WH
Z74
C52
A27
C6L
But, I want to be able to extract the code(s) after AGX: using ImportXML in one go. Is there a way?
Solution
Your issue is in how you are implementing the index formula. The first parameter returns the rows (in your case each element) and the second the column (in your case either AGX or the code after that).
If instead of getting a single cell we apply this formula on a range and we do not set any value for the row, the formula will return all the values achieving what you were aiming for. Here is its implementation (where F1:F5 will be the range of values you want this formula to be applied) :
=INDEX(SPLIT(F1:F5,"/,'vector',':'"),,2)
If you are interested in a solution simply using IMPORTXML and XPATH, according to the documentation you could use a substring as follows:
=IMPORTXML(A1,"//a/#href[substring-after(.,'SGX:')]")
The drawback of this is that it will return the full string and not exclusively what is after the SGX: which means that you would need to use a Google sheet formula to splitting this. This is the furthest I have achieved exclusively using XPath. In XML it would be easier to apply a forEach and really select what is after the : but I believe in sheets is more complicated if not impossible just using XPath.
I hope this has helped you. Let me know if you need anything else or if you did not understood something. :)
I'm using Google Sheets as web scraper.
I have been using this IMPORTXML
=importxml(A1, "//div[#class='review-content']//text()")
and this is the results
Row1: {"publishedDate":"2019-01-05T22:19:28Z","updatedDate":"null","reportedDate":"null}
Row2: {"publishedDate":"2018-12-10T22:19:28Z","updatedDate":"null","reportedDate":"null}
Row3: {"publishedDate":"2018-12-09T22:19:28Z","updatedDate":"null","reportedDate":"null}
but am having trouble figuring out how to get only the "publishedDate" value.
Example:
Row1: 2019-01-05T22:19:28Z
Row2: 2018-12-10T22:19:28Z
Row3: 2018-12-09T22:19:28Z
Any ideas as to what I may be missing
How about these 3 samples? I thought them from the samples of your question. I think that there are several answers for your situation. So please think of this as 3 samples of them.
It supposes that the URL is put in the cell "A1".
Sample 1:
=ARRAYFORMULA(MID(IMPORTXML(A1, "//div[#class='review-content']//text()"),19,20))
When the length of string of each value is the constant, how about this?
The value is retrieved by MID().
Sample 2:
=ARRAYFORMULA(INDEX(SPLIT(IMPORTXML(A1, "//div[#class='review-content']//text()"),"""",TRUE,TRUE),,4))
When the position of each value is the constant, how about this?
The value is retrieved by SPLIT() and INDEX().
Sample 3:
=ARRAYFORMULA(REGEXEXTRACT(IMPORTXML(A1, "//div[#class='review-content']//text()"),"publishedDate"":""(\w.+?)"""))
When the pattern of each value is the constant, how about this?
The value is retrieved by REGEXEXTRACT().
References:
MID
SPLIT
INDEX
REGEXEXTRACT
If these were not the results you want, I apologize. At that time, in order to correctly replicate your situation, can you provide the URL you are using as #Rubén says?
I've done dozens times, but now don't get what I'm doing wrong. I want to extract specific records, into 2 separate columns (I know that order wil not match), so I use:
//a/#href[contains(.; "github")]
and
//*[contains(text(); "Pricing:")]
But non of them is working - where my mistake?
(my sandbox: https://docs.google.com/spreadsheets/d/11Z3xybq_eYQvjn2-UBOomgeJxFrrsFoXKzF9yZSeASM/edit#gid=1841586203 with LT localle)
damn, those google sheet localles!!!... must be:
//a/#href[contains(., "github")]
and
//*[contains(text(), "Pricing:")]
I'll keep for further reference.
I am looking to write an XPath query to return the full element ID from a partial ID that I have constructed. Does anyone know how I could do this? From the following HTML (I have cut this down to remove work specific content) I am looking to extract f41_txtResponse from putting f41_txt into my query.
<input id="f41_txtResponse" class="GTTextField BGLQSTextField2 txtResponse" value="asdasdadfgasdfg" name="f41_txtResponse" title="" tabindex="21"/>
Cheers
You can use contains to select the element:
//*[contains(#id, 'f41_txt')]
Thanks to Thomas Jung I have been able to figure this out. If I use:
//*[contains(./#id, 'f41_txt')]/#id
This will return just the ID I am looking for.
I suggest to not use numbers from Id , when you are composing xpath's using partial id. Those number reprezent DINAMIC elements. And dinamic elements change over the next deploys / releases in the System Under Test.The pourpose is to UNIQUE identify elements.
Using this may be a better option or something like this, yo got the idea:
//input[contains(#id, '_txtResponse')]/#id
It worked for me like below
//*[contains(./#id, 'f41_txt')]