Xpath, fetching table with text and images in Google sheets - xpath

I'm trying to parse this table into Google Sheets: https://exvius.gamepedia.com/Chaining/Bolting_Strike
And getting the title text from where there are images.
I can't figure out how to get the text from the full table, as well as img/#alt in cases where it's available. I can get the table with
=IMPORTXML("https://exvius.gamepedia.com/Chaining/Bolting_Strike","//table[#class='wikitable']/tbody/tr[position()>=3]")
And only the image texts
=IMPORTXML("https://exvius.gamepedia.com/Chaining/Bolting_Strike","//table[#class='wikitable']/tbody/tr[position()>=3]/td/a/img/#alt")
But I can't seem to do both, is that a limitation of Google Sheets IMPORTXML?
I've tried with OR and other bool operators with no luck. Tried with axes but that was also a no go for me.

I propose something like this :
Sheet
Description:
In B1 we have the url of the webpage.
In B3 we have the following formula to import the first part of the table :
=QUERY(IMPORTHTML(B1;"table";1);"select Col1,Col2,Col3 OFFSET 2";0)
Columns L to O contain the following formulas to get the element names and the ability names (which will be used as a key in a VLOOKUP step). 4 formulas because an ability could have 2 element names. In L3,M3,N3,03 we have :
=IMPORTXML(B1;"//td/a[1]/img[#srcset]/ancestor::td[1]/preceding::a[1][#title]")
=IMPORTXML(B1;"//td/a[1]/img[#srcset]/#alt")
=IMPORTXML(B1;"//td/a[2]/img[#srcset]/ancestor::td[1]/preceding::a[1][#title]")
=IMPORTXML(B1;"//td/a[2]/img[#srcset]/#alt")
Formula in E4 is a one liner where the results of 2 VLOOKUP are merged together. We use VLOOKUP to pair each ability name with an element.
=ARRAYFORMULA(REGEXREPLACE(ARRAYFORMULA(IFERROR(VLOOKUP(C4:INDIRECT("C"&COUNTA(C:C)+2);L:M;2;FALSE);"")&"|"&ARRAYFORMULA(IFERROR(VLOOKUP(C4:INDIRECT("C"&COUNTA(C:C)+2);N:O;2;FALSE);"")));"^\||\|$";""))
To finish, in H3 we have the last part of the table :
=QUERY(IMPORTHTML(B1;"table";1);"select Col5,Col6 OFFSET 2";0)
The rest (colours, borders,..) is standard and conditionnal formatting.
Side note : I'm based in Europe so you might have to change ; with , in the formulas.

Related

How use filter formula in Google Sheet with data contains #N/A

For my example, I have 2 columns A,B in Google Sheet
Column A with list of Stocks symbols like AAPL, IBM, etc....
Column B with simple formula of GOOGLEFINANCE(A2,"price")
Sometimes GOOGLEFINANCE returns error and the cells display #N/A. But this is not my issue...
I would like using filter in column B which show all symbols with prices greater than 100 or #N/A
I prefer not using extra column to achieve that
I'm struggling with it and still didn't find the way to get my result
Just note, my issue isn't GOOGLEFINANCE, It's like example to get the #N/A value
My tought was using filter with formula like: =OR(ISNA(B:B), B:B>100)
But it seems it's ignore the #N/A and doesn't show it
Link for example
In my question I tried the formula "=OR(ISNA(B:B), B:B>100)"
But I must know that if Google Sheets "see" cells with N/A on this column - The result is automatically N/A, even if I put the ISNA in the first condition
So to solve it I used a formula like this:
=IF(ISERROR(B:B), TRUE, B:B>100)
I updated the sheet if someone wants to check it

Query referencing 20 sheets / Indirect error with multiple ranges

I have 20 sheets (Eagle, Kestral etc) and want to reference the whole group of them, in different queries.
To stop query formula text being massive I have tried to use the Indirect function but looks like Indirect may not be able to return multiple ranges.
Example for just 2 sheets:
Query({Indirect(A1)}) where A1 contains the text Eagle!F3:I33;Kestrel!F3:I33
gives Indirect error "not a valid cell/range reference".
The 2 formulas below work OK but become unweildy when referencing 20 sheets.
Query({Eagle!F3:I33;Kestrel!F3:I33})
Query{indirect(A2); indirect(A3)} where A2 is Eagle!F3:I33 and A3 is Kestrel!F3:I33
Suggestions please (no script).
Challenge2 = How to include sheet name (bird) in Col1 of query output. Sheet name (bird) is written in cell A1 of each sheet.
Here is the solution that I settled on.
Problem summary
Challenge 1: Avoid oversized query formula when referencing many sheets/tabs.
Challenge2: Return sheet name as part of the query output.
Key information
Script is not an option as causes access and performance issues for users in my organisation.
Indirect function cannot pull multiple ranges into a Query.
There is not a function that returns sheet names (except within Script).
I started with a static list of sheet names.
Each sheet contains Name and Total data, but needs to be tagged with sheet name to identify it in output of query. Each sheet also included the sheet name in cell A1 (but not used in solution).
Solutions
Solution to Challenge 1: Specify the unique sheet ranges & select statements within hidden helper columns then reference them in the query.
Solution to Challenge 2: Insert sheet name as text within each select statement.
=query(
{query({indirect(B4)},C4);query({indirect(B5)},C5);
query({indirect(B6)},C6);query({indirect(B7)},C7);
query({indirect(B8)},C8);query({indirect(B9)},C9);
query({indirect(B10)},C10);query({indirect(B11)},C11);
query({indirect(B12)},C12);query({indirect(B13)},C13);
query({indirect(B14)},C14);query({indirect(B15)},C15);
query({indirect(B16)},C16);query({indirect(B17)},C17);
query({indirect(B18)},C18);query({indirect(B19)},C19);
query({indirect(B20)},C20);query({indirect(B21)},C21);
query({indirect(B22)},C22);query({indirect(B23)},C23)}
,"where Col3 >="&F2 &B2 ,0)
useful screen shot - helper columns and output
Cells F2 & B2 are user defined. F2 is the minimum value to return. B2 relates to ordering of output.
B2 creates an extra bit of text for select statement, depending on user defined dropdown in E2.
=if(E2="order by lap count"," order by Col3 desc",)
The ,0 at the end of the final wraparound query is the optional query header row clause. Zero tells query that the input data has no headers. Necessary for this query.
The curly brackets inside each sheet query convert column names F, G, H to Col1, Col2 Col3.
The curly brackets and semicolons in the final wraparound query combine the sheet query outputs into an array, one underneath the other.
Top Tip – When referencing multiple sheets/tabs in a query, it is better create a wraparound query (as above) to filter the output . This is because if you were to filter the individual sheet queries and one of them returned no data, the curly brackets in the wraparound query would return an array error.

Matching (querying?) criteria with IMPORTRANGE

Forgive me if I am not using the correct terminology, I short of crash-coursed myself in Google sheets a few days ago.
Is there a way that I could using IMPORTRANGE to import a data range from spreadsheet 2 into spreadsheet 1, where the range selected from spreadsheet 2 can be matched against criteria in spreadsheet 1 that corresponds to criteria in spreadsheet 2? I have a specific set of data in spreadsheet 1 that, while the same in content, is not in the same order as spreadsheet 2 (which I don't myself maintain) or spreadsheet 3 (which is maintained by someone other than myself or the person that maintains spreadsheet 2), but am being given access to spreadsheet 2 and spreadsheet 3 data that I didn't previously have.
EXAMPLE:
https://docs.google.com/spreadsheets/d/1ByN9Ju8QiiHTfFgow7lDF4VN-zBRqP1gzpAK73ZRBNg/edit?usp=sharing
You work with IMPORTRANGE content the same way as you do with any range within your spreadsheet. Good practice is to use columns with unique content as ID's for searching, filtering, etc.
If you want put the content of somebody's spreadsheet into yours, you can control it.
For example:
In order to get REGISTRATION number from sheet3
Think of VLOOKUP construction:
=VLOOKUP(key,table with key value on the leftmost column;number of column to take value from,false)
You use vlookup formula that takes name in your table as a key (first parameter of formula), then you must rebuild your importrange to have key in leftmost column.
2nd parameter of VLOOKUP will look like this:
{importrange("Sheet3url";"Sheet!Columnwithname"),importrange("Sheet3url";"Sheet!Columnwithregistration")}
This is your temporary table made of 2 importranges.
You want 2nd column of this construction - which is column with registration.
Whole vlookup looks like this:
=VLOOKUP(key,{importrange("Sheet3url";"Sheet!Columnwithname"),importrange("Sheet3url";"Sheet!Columnwithregistration")},2,false)
It's much easier when key is on the left. If you want to extract SEX and DOB you use:
=VLOOKUP(key,importrange("Sheet3url";"Sheet!Columnsfromname to DOB"),2 and then 3,false)
Beware - using multiple importrange makes your sheet slow.
If you have hundreds of rows, you should wrap it around with arrayformula to work with all rows in one go.
Also you can first importrange somebodys table into your sheet on a side and operate inside your sheet.
It's advised when using big datasets and not that many files.

ImportXML XPath issue using Google Sheets on a simple web scraping query

I've been trying with no success to importxml using google sheets to scrape the Advanced Receiving table data from the url https://www.pro-football-reference.com/boxscores/201912290car.htm.
I've tried the XPath copied directly from the inspect chrome page of: //*[#id="div_receiving_advanced"]
where I always get the "Imported content is empty" error message.
I'm stumped because it works with the Passing, Rushing, & Receiving table data using the XPath of: //*[#id="div_player_offense"]
When I use the XPath of: //*[#id="all_receiving_advanced"], I get the following results.
unparsed results
However, I'd like to parse the data from the 2nd column so it looks like this.
parsed results
Any help would be greatly appreciated.
Since some players don't have value for specific columns (for eg : "Rec/Br"), transforming directly the data returned by IMPORTXML will produce a scrambled table.
2 solutions :
A) Use IMPORTFROMWEB addon (number of requests are limited in the free plan) with JS rendering activated and a base selector option to keep the data structure. XPath expressions needed for data :
/th/a
/td[#data-stat="team"]
/td[#data-stat="targets"]
/td[#data-stat="rec"]
/td[#data-stat="rec_yds"]
/td[#data-stat="rec_first_down"]
/td[#data-stat="rec_air_yds"]
/td[#data-stat="rec_air_yds_per_rec"]
/td[#data-stat="rec_yac"]
/td[#data-stat="rec_yac_per_rec"]
/td[#data-stat="rec_broken_tackles"]
/td[#data-stat="rec_broken_tackles_per_rec"]
/td[#data-stat="rec_drops"]
/td[#data-stat="rec_drop_pct"]
for the headers :
//div[#id="div_receiving_advanced"]//th[contains(#class,"poptip")]
for the base selector :
//div[#id="div_defense_advanced"]//tr[#data-row][not(#class)]
Formula used in C6 :
IMPORTFROMWEB(B1;B2:O2;B3:C4)
Output :
Side note : IMPORTFROMWEB often output loading errors.
B) Use IMPORTDATA and formulas to generate the table. First we load the data of interest with a filter (QUERY). Then we fix the blank cells problem with SUBSTITUTE. After that we extract the data with REGEXEXTRACT. Finally we apply a last filter and SPLIT the data to populate the cells.
Formula :
=ARRAYFORMULA(SPLIT(QUERY(ARRAYFORMULA(REGEXREPLACE(ARRAYFORMULA(SUBSTITUTE(QUERY(IMPORTDATA(B3);"select Col1 where Col1 contains 'rec_broken_tackles_per_rec'");"></td>";">0</td>"));".+htm.+?>(.+?)<.+team.+([A-Z]{3}).+targets.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+?rec.+?>(.+?)<.+";"$1;$2;$3;$4;$5;$6;$7;$8;$9;$10;$11;$12;$13;$14"));"select * WHERE NOT Col1 contains '<'");";"))
Output :
In both cases, blank cells are replaced with 0.
My working workbook is here.
EDIT :
For "Advanced Defense Table" with IMPORTDATA :
=ARRAYFORMULA(SPLIT(QUERY(ARRAYFORMULA(REGEXREPLACE(ARRAYFORMULA(SUBSTITUTE(QUERY(IMPORTDATA(B3);"select Col1 where Col1 contains 'def_tgt_yds_per_att'");"></td>";">0</td>"));".+htm.+?>(.+?)<.+team.+([A-Z]{3})<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?def.+?>(.+?)<.+?bli.+?>(.+?)<.+?qb_.+?>(.+?)<.+?qb_.+?>(.+?)<.+?sac.+?>(.+?)<.+?pre.+?>(.+?)<.+?tac.+?>(.+?)<.+?tac.+?>(.+?)<.+?tac.+?>(.+?)<.+";"$1;$2;$3;$4;$5;$6;$7;$8;$9;$10;$11;$12;$13;$14;$15;$16;$17;$18;$19;$20;$21;$22"));"select * WHERE NOT Col1 contains '<'");";"))
Output :

Get meaning for a word from dictionary Using XPath Google sheets importxml function

I'm trying to use the IMPORTXML function in Google sheets to grab the meaning and information words on https://www.powerthesaurus.org/
I kind of succeeded getting some data from another website, but as a newbie, I got some troubles to get any data when I try on this one in this Google sheet in D6 cell.
=ImportXML("https://www.powerthesaurus.org/"&A6,"//*[#id='link link--primary link--term']")
Could someone help to educate me with the correct formula?
You're looking for synonyms. Note you can display up to 200 on Power Thesaurus.
To get the 50 first synonyms in one cell (since you have one word per row), you can try this :
Create 50 numbered columns in your GoogleSheet.
Apply this formula to the first cell and drag it to the right.
=IMPORTXML("https://www.powerthesaurus.org/abbreviation/synonyms";"(//div[#class='pt-thesaurus-card__term'])"&"["&B2&"]")
Then use join formula to get all the words in one cell (XX:XX is the range of your columns, B3:F3 on the provided screenshot).
=JOIN("|";XX:XX)
Result :
Alternatively we could have use this one-liner (and make some cleanup afterwards) but GoogleSheet returns a blank cell whereas the XPath is perfectly valid :
=IMPORTXML("https://www.powerthesaurus.org/abbreviation/synonyms";"normalize-space(//div[#class='pt-list-terms__container'])")

Resources