Extract substring using importxml and substring-after - xpath

Using Google sheet 'ImportXML', I was able to extract the following data from a url(in cell A2) using:
=IMPORTXML(A2,"//a/#href[substring-after(., 'AGX:')]").
Data:
/vector/AGX:5WH
/vector/AGX:Z74
/vector/AGX:C52
/vector/AGX:A27
/vector/AGX:C6L
But, I want to extract the code after "/vector/AGX:". The code is not fixed to 3 letters and number of rows is not fixed as well.
I used =INDEX(SPLIT(AP2,"/,'vector',':'"),1,2). But it applied to only one line of data. Had to copy the index+split function to the whole column and had to insert an additional column to store the codes.
5WH
Z74
C52
A27
C6L
But, I want to be able to extract the code(s) after AGX: using ImportXML in one go. Is there a way?

Solution
Your issue is in how you are implementing the index formula. The first parameter returns the rows (in your case each element) and the second the column (in your case either AGX or the code after that).
If instead of getting a single cell we apply this formula on a range and we do not set any value for the row, the formula will return all the values achieving what you were aiming for. Here is its implementation (where F1:F5 will be the range of values you want this formula to be applied) :
=INDEX(SPLIT(F1:F5,"/,'vector',':'"),,2)
If you are interested in a solution simply using IMPORTXML and XPATH, according to the documentation you could use a substring as follows:
=IMPORTXML(A1,"//a/#href[substring-after(.,'SGX:')]")
The drawback of this is that it will return the full string and not exclusively what is after the SGX: which means that you would need to use a Google sheet formula to splitting this. This is the furthest I have achieved exclusively using XPath. In XML it would be easier to apply a forEach and really select what is after the : but I believe in sheets is more complicated if not impossible just using XPath.
I hope this has helped you. Let me know if you need anything else or if you did not understood something. :)

Related

What is simplest way to filter blanks or nulls in Tabulator?

I have a requirement to filter out blank cells OR non-blank cells for a given column. I could imagine a custom header formatter function that would add a control in the header to do this, but I would need to write a lot of code for this, right? Both the headerFilter and the function would take significant custom coding. Am I over-thinking this problem? even if there is nothing built in to tabulator, is there an easy way to do it?
It is not clear from your question what you are looking to filter.
If you want to filter out rows if the value of a certain property is empty then you can use a filter function (in this case im using the age field):
table.setFilter(function(data){
return !!data.age;
});

Google Sheets IMPORTXML query

I'm using Google Sheets as web scraper.
I have been using this IMPORTXML
=importxml(A1, "//div[#class='review-content']//text()")
and this is the results
Row1: {"publishedDate":"2019-01-05T22:19:28Z","updatedDate":"null","reportedDate":"null}
Row2: {"publishedDate":"2018-12-10T22:19:28Z","updatedDate":"null","reportedDate":"null}
Row3: {"publishedDate":"2018-12-09T22:19:28Z","updatedDate":"null","reportedDate":"null}
but am having trouble figuring out how to get only the "publishedDate" value.
Example:
Row1: 2019-01-05T22:19:28Z
Row2: 2018-12-10T22:19:28Z
Row3: 2018-12-09T22:19:28Z
Any ideas as to what I may be missing
How about these 3 samples? I thought them from the samples of your question. I think that there are several answers for your situation. So please think of this as 3 samples of them.
It supposes that the URL is put in the cell "A1".
Sample 1:
=ARRAYFORMULA(MID(IMPORTXML(A1, "//div[#class='review-content']//text()"),19,20))
When the length of string of each value is the constant, how about this?
The value is retrieved by MID().
Sample 2:
=ARRAYFORMULA(INDEX(SPLIT(IMPORTXML(A1, "//div[#class='review-content']//text()"),"""",TRUE,TRUE),,4))
When the position of each value is the constant, how about this?
The value is retrieved by SPLIT() and INDEX().
Sample 3:
=ARRAYFORMULA(REGEXEXTRACT(IMPORTXML(A1, "//div[#class='review-content']//text()"),"publishedDate"":""(\w.+?)"""))
When the pattern of each value is the constant, how about this?
The value is retrieved by REGEXEXTRACT().
References:
MID
SPLIT
INDEX
REGEXEXTRACT
If these were not the results you want, I apologize. At that time, in order to correctly replicate your situation, can you provide the URL you are using as #Rubén says?

Xpath with contains() in importXML()

I've done dozens times, but now don't get what I'm doing wrong. I want to extract specific records, into 2 separate columns (I know that order wil not match), so I use:
//a/#href[contains(.; "github")]
and
//*[contains(text(); "Pricing:")]
But non of them is working - where my mistake?
(my sandbox: https://docs.google.com/spreadsheets/d/11Z3xybq_eYQvjn2-UBOomgeJxFrrsFoXKzF9yZSeASM/edit#gid=1841586203 with LT localle)
damn, those google sheet localles!!!... must be:
//a/#href[contains(., "github")]
and
//*[contains(text(), "Pricing:")]
I'll keep for further reference.

Excel - Search an exact match within a string

I'm currently struggling on finding the formula that will resolve my problem.
Here's the status quo:
In Sheet 1, column A, I have a set of string, such as:
/search.action?gender=men&brand=10177&tag=10203&tag=10336
/search.action?gender=women&brand=11579&tag=10001&tag=10138
/search.action?gender=men&brand=12815&tag=10203&tag=10299
/search.action?gender=women&brand=1396&tag=10203&tag=10513
/search.action?gender=women&brand=11&tag=10001&tag=10073
/search.action?gender=women&brand=1396&tag=10203&tag=10336
/search.action?gender=women&brand=13
In Sheet 2, column A, I have a set of strings such as:
brand=10177
brand=12815
brand=13
brand=1396
brand=11579
Finally, in sheet 1, column B will be my "filter" with the formula I'm struggling to find. The goal of my formula is to detect in any of the strings in sheet 1 if one of the string in sheet 2 is present (as an exact match!). Indeed, now it only finds approximative matches. As you can see, the row 5 shouldn't return anything. But with my current formula it does.
Here's the formula:
{=IFERROR(INDEX('Sheet 2'!$A$1:$A$5;MATCH(1;COUNTIF(A1;"*"&'Sheet 2'!$A$1:$A$5&"*");0));"")}
Any idea on the matter?
Please note that I don't want to use VBA, macros, but only a formula.
Thanks a lot for your help!
Following will solve your problem I guess:
=VLOOKUP(MID(A2,FIND("&",A2)+1,FIND("&",A2,FIND("&",A2)+1)-FIND("&",A2)-1),Sheet2!A:A,1,FALSE)
Basically with find function I have identified the start and length of the string in between "&" signs. and used in vlookup.
Another point to mention is this formula is only looking for the first 2 "&" signs.
For completeness, here is another solution based on this answer
=INDEX(Sheet2!$A$1:$A$5,MAX(IF(ISERROR(FIND(Sheet2!$A$1:$A$5,A1)),-1,1)*(ROW(Sheet2!$A$1:$A$5)-ROW(Sheet2!$A$1)+1)))
This is a bit more general and it doesn't matter how many search tags there are.
However as it stands it would match brand=13 in the second sheet with brand=1396 in the first sheet. To avoid that you could add an ampersand to the search strings:-
=INDEX(Sheet2!$A$1:$A$5,MAX(IF(ISERROR(FIND(Sheet2!$A$1:$A$5&"&",A1&"&")),-1,1)*(ROW(Sheet2!$A$1:$A$5)-ROW(Sheet2!$A$1)+1)))
This formula throws a #VALUE error if there is no match: to avoid this, you would need to put an IFERROR statement round it:-
=IFERROR(INDEX(Sheet2!$A$1:$A$5,MAX(IF(ISERROR(FIND(Sheet2!$A$1:$A$5&"&",A1&"&")),-1,1)*(ROW(Sheet2!$A$1:$A$5)-ROW(Sheet2!$A$1)+1))),"")
All these are array formulae.

SUMIFS function in Google Spreadsheet

I'm trying to have a similar function to SUMIFS (like SUMIF but with more than a single criterion) in a Google Spreadsheet. MS-Excel has this function built-in (http://office.microsoft.com/en-us/excel-help/sumifs-function-HA010342933.aspx?CTT=1).
I've tried to use ArrayFormula (http://support.google.com/docs/bin/answer.py?hl=en&answer=71291), similar to the SUMIF:
=ARRAYFORMULA(SUM(IF(A1:A10>5, A1:A10, 0)))
By Adding AND:
=ARRAYFORMULA(SUM(IF(AND(A1:A10>5,B1:B10=1), C1:C10, 0)))
But the AND function didn't pick up the ArrayFormula instruction and returned FALSE all the times.
The only solution I could find was to use QUERY which seems a bit slow and complex:
=SUM(QUERY(A1:C10,"Select C where A>5 AND B=1"))
My Target is to fill up a table (similar to a Pivot Table) with many values to calculate:
=SUM(QUERY(DataRange,Concatenate( "Select C where A=",$A2," AND B=",B$1)))
Did anyone manage to do it in a simpler and faster way?
The simplest way to easily make SumIFS-like functions in my opinion is to combine the FILTER and SUM function.
SUM(FILTER(sourceArray, arrayCondition_1, arrayCondition_2, ..., arrayCondition_30))
For example:
SUM(FILTER(A1:A10;A1:A10>5;B1:B10=1)
Explanation: the FILTER() filters the rows in A1:A10 where A1:A10 > 5 and B1:B10 = 1. Then SUM() sums the values of those cells.
This approach is very flexible and easily allows for making COUNTIFS() functions for example as well (just use COUNT() instead of SUM()).
I found a faster function to fill up the "pivot table":
=ARRAYFORMULA(SUM(((Sample!$A:$A)=$A2) * ((Sample!$B:$B)=B$1) * (Sample!$C:$C) ))
It seems to run much faster without the heavier String and Query functions.
As of December, 2013, Google Sheets now has a SUMIFS function, as mentioned in this blog post and documented here.
Note that old spreadsheets are not converted to the new version, though you can try copy-pasting the data into a new workbook.
This guy used the Filter function to chop down the array by the criteria, then the sum function to add it all in the same cell.
http://www.youtube.com/watch?v=Q4j3uSqet14
It worked like a charm for me.

Resources