Xpath Google Sheets counting class icon from html - xpath

Relatively new to Xpath using google sheets. I am trying to get scores from a movie website where the score is out of five stars with images used for the stars, so I need to count the class icon-star-full from the HTML below
<span class="rating "><i class="icon-star-full"></i> <i class="icon-star-full"></i> <i class="icon-star-full"></i> <i class="icon-star-full"></i> <i class="icon-star"></i></span>
In Google Sheets, the count function seems to be working fine for every class I try except for icon-star-full. For example count(//[#class='rating']) works fine I get a count of every class named rating. However count(//[#class='icon-star-full']) returns 0 on every page. For example, in the HTML above I should get 3 for my count but it's 0.
It there any different way I should be doing the count for icons?

try:
=COUNTA(IFERROR(QUERY(ARRAY_CONSTRAIN(IMPORTDATA(A1), 1000, 1),
"where Col1 contains 'icon-star-full'")))&"/5"
where A1 is your URL

Related

ImportXML function in Google Dynamic XML path

I am trying to import the headlines and landing page URL's from "New + Updated" section of this page:
https://www.nytimes.com/wirecutter/
The issue is that the class "_988f698c" keeps changing as the headline is being replaced with a new headline/topic.
I need a workaround to use IMPORTXML function which will dynamically capture the class of that object in that position. The current formula is:
=IMPORTXML(https://www.nytimes.com/wirecutter/,"//*[#class='_988f698c']")
Here is the html tag for example. The class "_988f698c" refreshes every hour or so with new headlines coming in.
<li class="e9a6bea7">
<a class="_988f698c" href="https://www.nytimes.com/wirecutter/reviews/gir-spatula-review/">Why We Love GIR Spatulas</a>
<p class="_9d1f22a9">today
</p>
</li>
Is there a way I can do this?
Come back a little and look for an alternative path without forcing the use of random numbers.
For the title, use:
=IMPORTXML(
"https://www.nytimes.com/wirecutter/",
"//ul[#data-testid='new-and-updated']/li/a"
)
For the URL attached to the title:
=IMPORTXML(
"https://www.nytimes.com/wirecutter/",
"//ul[#data-testid='new-and-updated']/li/a/#href"
)
For the text indicating the day of publication:
=IMPORTXML(
"https://www.nytimes.com/wirecutter/",
"//ul[#data-testid='new-and-updated']/li/p"
)
If you want to collect everything together, use | to split the paths:
=IMPORTXML(
"https://www.nytimes.com/wirecutter/",
"//ul[#data-testid='new-and-updated']/li/a |
//ul[#data-testid='new-and-updated']/li/a/#href |
//ul[#data-testid='new-and-updated']/li/p"
)
only use it if you are absolutely sure that the values will always exist, because if they don't, you will have problems with the position in the sheet rows if you define formulas that depend on fixed values in each of the cells.

Xpath: Select elements based on descendants that do not have a certain attribute

I am scraping through real estate listings from a certain site that contains multiple pages.
Here, I have summarized a structure nested deep in the DOM. I want to select all list items, based on the descendants that do not have a certain attribute name like <div id="nav-ad-container">
<ul class="photo-cards photo-cards_wow photo-cards_short photo-cards_extra-attribution">
<li>..</li>
<li>..</li>
<li>
<div id="nav-ad-container" class="zsg-aspect-ratio"></div>
</li>
<li>..</li>
<li>..</li>
<li>..</li>
</ul>
However, given that the attribute and the attribute's name change in the DOM for each page.
For example:
#id = 'nav-ad-container' or #class = 'nav-ad-empty'
In general, I want to retrieve the list items that do not contain the name pattern 'nav-ad'.
Things that I've tried with no success (still selects every list item)
xpath + //li[not(contains(#class, 'nav-ad'))]
xpath + //li[not((contains(#class,'nav-ad')) or contains(#id,'nav-ad'))]
Can anyone guide me toward a solution? I feel like I'm pretty close but missing something.
filter by classname of list items or descendants:
//li[not(contains(descendant-or-self::node()/#class,'nav-ad'))]
(not tested)
Try
//li[not(descendant-or-self::node()/#class[contains(.,'nav-ad')])]

To compare selenium xpath values

You are trying to run xpath values by comparing them.
You want to compare whether there are comments or not.
<div class="media-body">
<a href="https://url" class="ellipsis">
<span class="pull-right count orangered">
+26 </span>
post title </a>
<div class="media-info ellipsis">
admin <i class="fa fa-clock-o"></i> date </div>
</div>
If there is a comment, span class="full-right count or changed" is generated. If you don't have it, it won't be produced.
xpath comment //*[#id="thema_wrapper"]/div[3]/div/div/div[3]/div/div[7]/div[2]/div[1]/div[2]/a/span
xpath nocomment //*[#id="thema_wrapper"]/div[3]/div/div/div[3]/div/div[7]/div[2]/div[1]/div[2]/a/
I think we can compare this with if,else,but I don't know how.
if
#nocomment start
else
#comment stop
I searched a lot for the data, but I couldn't find it. Please help me.
Here's an XPath example to select/click on something without comment. This website seems to use the same system as your sample data :
http://cineaste.co.kr/
To select the entries with no comment for the movies block ("영화이야기"), just use :
//h3[.="영화이야기"]/following::div[#class="widget-small-box"][1]//li[#class="ellipsis"][not(contains(.,"+"))]
We verify the presence of the "+" in the li node to filter the data.
Oh, it's the same system. I tested it and there was an error.
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//h3[.='영화이야기']/following::div[#class='widget-small-box'][1]//li[#class='ellipsis'][not(contains(.,'+'))]"}
(Session info: chrome=81.0.4044.138)
from selenium import webdriver
import time
path = "C:\chromed\chromedriver.exe"
driver = webdriver.Chrome(path) #path
'''
'''
driver.get("http://cineaste.co.kr/") #url
time.sleep(0.5)
postclick = driver.find_element_by_xpath("//h3[.='영화이야기']/following::div[#class='widget-small-box'][1]//li[#class='ellipsis'][not(contains(.,'+'))]") #로그인창 활성화
postclick.click()
driver.close()
Could you make an example with the site? I want to ignore the posts with comments and just click the ones without comments.

IMPORTXML and XPath in Google Sheets

I have a question how to get pages quantity from here
One of the problems is that I never know how many spans will be in every page with book - here we have just 3, and "pages" span is number [2] here in the list but it can be any number so I cant just get it by using //p[#class='book']//text()[2]
I need to extract "300" using Google spreadsheets IMPORTXML function
<p class="book">
<span>condition: <b>good</b></span>
<br>
<span>pages: <b>300</b></span>
<br>
<span>color: <b>red</b></span>
<br>
</p>
I tried adding
[contains('pages: ')]
but no success here
Any suggestions?
p.s. //p[#class='book']//text() by itself
returns
condition:
good
pages:
300
color:
red
So you look for a span that start with 'pages:' and than take a value from it.
//p[#class='book']/span[starts-with(., 'pages:')]/b/text()

xpath trying to select content inside a div except one, with text included

Im trying to select the content inside a div, this div has some text inside and some additional tags. I dont want to select the first div inside. I was trying with this selector, but only gives me the tags, without text
//div[#class='contentDealDescriptionFacts cf']/div[#class='viewHalfWidthSize' and position()=2]/*[not(#class='subHeadline')]
the div that is giving me problems is this one:
<div class="viewHalfWidthSize">
.......
</div>
<div class="viewHalfWidthSize">
<div class="subHeadline firefinder-match">The Fine Print</div> <----------Except this div I want everything inside of this div!!
<strong class="firefinder-match">Validity: </strong>
Expires 27 June 2013.
<br class="firefinder-match">
<strong class="firefinder-match">Purchase: </strong>
Limit 1 per 2 people. May buy multiple as gifts.
<br class="firefinder-match">
<strong class="firefinder-match">Redemption: </strong>
Booking required online at
<a target="_blank" href="http://grouponbookings.co.uk/lautre-pied-march/" class="firefinder-match">http://grouponbookings.co.uk/lautre-pied-march/</a>
. 48-hour cancellation policy; late cancellation incurs a £30 surcharge per person.
<br class="firefinder-match">
<strong class="firefinder-match">Further information: </strong>
Valid Mon-Sun midday-2.45pm; Mon-Wed 6pm-10.45pm. Must be 18 or older, ID may be requested. Valid only on set tasting menu only; menu is dependent on market changes and seasonality and is subject to change. Max. two hours seating time. Discretionary service charge will be added to the bill based on original price. Original value verified 19 March 2013 at 9.01am.
<br class="firefinder-match">
<a target="_blank" href="http://www.groupon.co.uk/universal-fine-print" style="color: #339933;" class="firefinder-match">See the rules</a>
that apply to all deals.
</div>
The * matches element nodes and not text nodes. Try replacing * with node() to select all node types.
To break down what your XPath is doing:
You are looking anywhere in the document (//) for a div with class 'contentDealDescriptionFacts cf'.
Then you are looking for the 2nd div under that which also has the class viewHalfWidthSize. Note, this is not the 2nd div that has the class but the div that is 2nd AND has that class, so if the divs with that class are the 3rd and 4th it wouldn't match anything as the 2nd div with the class has position() = 4. If you want the 2nd viewHalfWidthSize div then you'll want [#class='viewHalfWidthSize'][position()=2].
Finally, you are returning a nodelist of all elements without the class subHeadline. If you change the * to node() then you will get a nodelist of all nodes.
The following XPath:
//div[#class='contentDealDescriptionFacts cf']/div[#class='viewHalfWidthSize' and position()=2]/node()[not(name(.)='div' and position() = 1)]
should return what you want as long as the first child node is the div you want to ignore.
If you change it to:
//div[#class='contentDealDescriptionFacts cf']/div[#class='viewHalfWidthSize' and position()=2]/node()[position() != count(../div[1]/preceding-sibling::node()) + 1]
then it should work regardless. It returns your nodelist, then works out how many preceding nodes there are before the first div, and checks the position isn't one greater than that (i.e. position of first div) and excludes that from the list.
As yet another alternative you could just modify your original solution but instead of doing not(#class='subHeadline') you should do
not(contains(concat(' ', #class, ' '), ' subHeadline '))
which will check if the class attribute contains subHeadline anywhere in the string on the assumption that your classes are space separated. This would then match your fragment which has the class "subHeadline firefinder-match"

Resources