Octoparse Xpath or iframe Problem extracting - xpath

I am trying to extract the prices of some products in Mercado Libre website.
The problem is that sometimes it has discounts, and then it doesn't extract the text.
I leave one link with discount and one without. I want octoparse to extract the price in both situations.
How can I do it?
LINKS:
https://articulo.mercadolibre.com.mx/MLM-666847965-funda-protector-iphone-7-8-se-2020-supcase-ubstyle-negro-_JM?quantity=1#position=1&type=item&tracking_id=9e0a5e4a-891d-4b89-add3-7aca91d6969a
https://articulo.mercadolibre.com.mx/MLM-721688631-protector-funda-case-rudo-iphone-6-7-8-x-xs-xr-xs-max-11-pro-_JM?quantity=1#searchVariation=43860021612&position=8&type=pad&tracking_id=9e0a5e4a-891d-4b89-add3-7aca91d6969a&is_advertising=true&ad_domain=VQCATCORE_LST&ad_position=8&ad_click_id=YTY0MWNiMWQtMDFmNi00ZGJmLThjZjMtYWM3YWQyZTc3OWNl

I don't know well Octoparse, but if you can specify XPath manually then you can go with :
(//fieldset[contains(#class,"item-price")]//#content)[last()]
This will select exact price (with decimals) of the items. The value is taken from the attribute of the holding span element. So in your case :
254.62 and 250.0 will be extracted.
Alternative ways :
A) :
string(//fieldset[contains(#class,"item-price")]//span[#class="price-tag"])
Output :
$ 254 . 62 and $ 250
B) :
(//fieldset[contains(#class,"item-price")]//span[#class="price-tag-fraction"]/text())[last()]
Output :
254 and 250

Related

How to get list of all exchanges with xpath to google sheets

Try to get a list of cryptocurrency exchanges from coingecko 2nd page into my google sheet.
To get a result like:
Tokenize
Bibox
Vebitcoin
...
Try to make it with.
IMPORTXML("https://www.coingecko.com/en/exchanges?page=2", "//*[contains(text(),' exchange')]")
As a result, get Error:
Imported content is empty.
How about this modified xpath?
Modified xpath:
//span/a[contains(#href,'/en/exchanges/')]
and
//span[#class='pt-2 flex-column']/a[contains(#href,'/en/exchanges/')]
Modified formula
=IMPORTXML(A1,"//span/a[contains(#href,'/en/exchanges/')]")
In this case, the URL of https://www.coingecko.com/en/exchanges?page=2 is put to the cell "A1".
Result:
Note:
The list of cryptocurrency exchanges can be retrieved by the modified xpath. But in this case, it seems that the values of Tokenize, Bibox and Vebitcoin are not included.
Reference:
IMPORTXML
To get the list of all exchanges, you can also use the following formula :
=ARRAYFORMULA(REGEXEXTRACT(QUERY(TRANSPOSE(IMPORTDATA("https://api.coingecko.com/api/v3/search?locale=en&img_path_only=1"
));"select * WHERE Col1 starts with ""name""");"name:""(.+)"""))
Output (~ 8000 elements) :

My XPath in Google Sheets IMPORTXML command always returns #N/A

I am trying to scrape some data from the website Sporcle (specifically the Date Earned from one of the Badges) but the XPath that I got from [F12-->right-clicking the element-->Copy-->Copy XPath] does not seem to work with the google sheets command IMPORTXML; all I ever get is #N/A.
=IMPORTXML("https://www.sporcle.com/user/Jimmy/badges/earned/","//*[#id='badge-container']/div[1]/div[3]")
Website uses dynamic rendering. So, classic methods don't work. I see 3 ways to do it :
With IMPORTXML : we retrieve the JSON data from a script element and we parse it with formulas.
With IMPORTXML+ImportJSON script : we retrieve the JSON data from a script element and we parse it with the script (cleaner).
With IMPORTFROMWEB addon (number of requests are limited in the "free" plan).
Solution 1 :
Output :
First, we extract the JSON data in A1 with IMPORTXML and the following formula :
=IMPORTXML(B1;"substring-before(substring-after(//*[contains(text(),'badge_limiter')],'var badgeList = [{'),'}]')")
Then we parse the data with a combination of multiple formulas. In J2 we write :
=QUERY(ARRAYFORMULA(SPLIT(TRANSPOSE(SPLIT(SUBSTITUE(SUBSTITUE(SUBSTITUE(REGEXREPLACE(M1;"(""\w+?_\w+?"":)";"");""",";""";");"""";"");"},";"");"{"));";"));"select Col1,Col6")
Solution 2 :
Output :
First, we extract the JSON data in A1 with IMPORTXML and the following formula :
=IMPORTXML(B1;"substring-before(substring-after(//*[contains(text(),'badge_limiter')],'var badgeList = '),'}]')")&"}]"
Then we parse the data with the script. Formula used in F1 is :
=ImportJSONFromSheet("Feuille 15";"/badge_name,/earned_date")
Where Feuille 15 is the name of the sheet I'm working with. The rest is to select the columns of interest.
Solution 3 :
Output :
XPath used for badges names and dates :
:
//td[#class='left-align link-col col-width-1']
//td[#class="col-width-3"]
Then we pass the formula in B5:
=IMPORTFROMWEB(C1;C2:D2;B3:C3)
Note : be sure to set jsRendering to TRUE.
Side note : I'm based in Europe, so you'll probably need to replace ; with , in the formulas.

importxml to Google Sheets + hltv

I'm trying to bring data from players in hltv to Sheet with importxml but can't get it. I've discovered that there are multiple div classes in a row and inside them there are spans where the actual data is.
I have tried multiple ways to get either, the all the info together or one at a time, but I'm starting to get out options.
For example:
=IMPORTXML("https://www.hltv.org/stats/players/11893/ZywOo","//#class='Statistics-row'//#class='columns'")
Also I have tried to get players from certain country in https://www.hltv.org/stats/players
Can someone help?
Alternative to #Madhurjya proposal. With IMPORTFROMWEB addon you can have :
XPaths used :
//div[#class="statistics"]//span[1]
//div[#class="statistics"]//span[2]
Formula :
=IMPORTFROMWEB(B1;B2:C2)
But also :
Xpaths used :
//a[preceding-sibling::img[#alt="France"]]
//img[#alt="France"]/#alt
Formula :
=IMPORTFROMWEB(B1;B2:C2)
Note : number of requests are limited. Check the pricing or code your own GoogleAppScript.

IMPORTXML Google Sheets for every 2nd node?

I'm having trouble trying to get a value with IMPORTXML in a google spreadsheet ...
I am using as xpath:
//*[contains(#class,"price") which returns me smoothly, ALL prices posted on a web page
The problem is that within that same class (and I don't know why, with dynamic ID's!) I have 2 nodes/prices: "Registered Customer Price" and "Non-Customer Price", which is the 2nd. value ... and the one I am interested in obtaining.
So, I wanted to apply it like this:
(//*[contains(#class,"price")])[2] and with this, I only get the 2nd price... but of the whole page!
(and not the 2nd. price of each and every item!)
I assume it is a "syntax" problem ... but no matter how many times I try it, I don't get the expected result!
Can you give me a hand with this?
Thanks in advance for any suggestion!
Just use :
//div[#class='price-box'][2]//span[#id]
Output :
EDIT : With IMPORTFROMWEB:
//h4[.="Precio unitario por unidad"]/following-sibling::span/span[#id]
EDIT 2 : More robust XPath :
//h4[.="Precio unitario por unidad"]/following-sibling::span[#class="price-excluding-tax"][count(following-sibling::*)=0]/span[#id]
try:
=FILTER(IMPORTXML(
"http://www.maxiconsumo.com/sucursal_villa_dominico/comestibles/aceites/aceite-girasol.html";
"//*[contains(#id,'price-including-tax')]"); MOD(ROW(INDIRECT("A1:A"&COUNTA(IMPORTXML(
"http://www.maxiconsumo.com/sucursal_villa_dominico/comestibles/aceites/aceite-girasol.html";
"//*[contains(#id,'price-including-tax')]")))); 2)=0)

How to extract the price with importxml google sheets xpath

Good morning,
I can't extract the price on this page with the importxml function:
https://www.t-collector.com/reine?prop%5Bcolor%5D=black&product=26&side=front
I need it to update my google merchant files.
I've tried different formulas like:
=importxml(G2;"//span[#itemprop='price']")
=importxml(G2;"//b[#itemprop='price']/#content")
=importxml(G2;"//b[#itemprop='price'][1]/#content")
=importxml(G2;"//meta[#itemprop='price'][1]/#content")
=importxml("G2";"//span[#itemprop='price']")
but nothing works
Thanks
Sincerely
Website uses dynamic rendering. Selenium would be required here. But we can try with GoogleSheets. We use a custom script to load directly the JSON data.
The script to import JSON data with GoogleSheets (credits to Paul Gambill) : https://gist.github.com/paulgambill/cacd19da95a1421d3164
And the data :
https://www.t-collector.com/campaigns/C-PGE7F?format=json&store=tcollectorofficiel
We use SQL-like formulas to keep only the price. Result :
EDIT : Solution with IMPORTXML :
You can use the following formula (tested with 5 shirts) :
=IMPORTXML(A2;"substring-after(substring-before((//script)[6],'"",""category""'),',""price"":""')")
Output :
EDIT 2 : Fix to extract the default displayed price in euros :
=IMPORTXML(A2;"substring-after(substring-before(//script[starts-with(.,'var campaignObj')],'"",""gbp""'),'""eur"":""')")
Output :
EDIT 3 : To ignore on sale prices, we can use the following one liner :
=SI(IMPORTXML(A2;"substring(substring-after(//script[starts-with(.,'var campaignObj')],'""compare_at_prices"":{""eur"":""'),1,1)")=0;IMPORTXML(A2;"substring-after(substring-before(//script[starts-with(.,'var campaignObj')],'"",""gbp""'),'""eur"":""')");IMPORTXML(A2;"substring-before(substring-after(//script[starts-with(.,'var campaignObj')],'""compare_at_prices"":{""eur"":""'),'""')"))
Output :

Resources