My goal is to scrape a link containing either a word "apple" or a word "pear", and for each scraped link I need to scrape the anchor.
At present I am using the following:
=IMPORTXML(A1,"//a/#href[contains(., 'apple')]")
Unfortunately, I can only scrape the links containing apple. Still need to add another condition - "pear" and scrape the anchor.
Thank you for your help.
Try this:
=IMPORTXML(A1,"//a/#href[contains(., 'apple') or contains(., 'pear')]")
Related
I'm importing into Google Sheets with IMPORTXML with the following XPATH:
=IMPORTXML(A2;"//*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li")
A2 containing the URL (https://stt.wiki/wiki/20th_Century_Pistol).
From the website I want to import the list entries in the "Basic" column and "Crafted From" row of the table.
There are only two list entries in this section of the table:
"x1 Basic Security Codes" and
"x4 Basic Casing"
Therefore, I expected to get only those two list entries as rows in my sheet.
Instead, I got an additional blank row above those two entries. When I change "td[1]" to "td[3]" in the XPATH query however, there are no extra blanks.
I don't understand where the additional blank row is coming from and how I can avoid it.
Google Sheet with desired and actual result
When I saw the HTML of the URL, there are 2 li tags in the ul tag. So I think that your xpath is correct. But from your issue, I was worry that the sup tag might affect to this situation. But I'm not sure whether this is the direct reason. So I would like to propose to add the attribute of li for your xpath as follows.
Modified xpath:
When your xpath is modified, please modify as follows.
From:
//*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li
To:
//*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li[#style='white-space:nowrap']
By adding [#style='white-space:nowrap'], the value of li with style='white-space:nowrap' is retrieved.
Result:
The formula is =IMPORTXML(A1;"//*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li[#style='white-space:nowrap']"). Please put the URL to the cell "A1".
Note:
Also, you can use the xpath of //*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li[position()>1].
To complete the very neat #Tanaike's answer, another expression :
=IMPORTXML(A2;"//th[contains(.,'Crafted')]/following::td[1]//li[contains(#style,'white')]")
If a blank line is added it's because GoogleSheets parses an additional blank li element containing a #style attribute.
I'm using nokogiri to scrape contents from https://www.nba.com/teams/warriors but I am unable to scrape the contents from the a tag.
I played around with adding and removing the classes but I receive an empty array.
base_url.css(".nba-player-index__trending-item a ['title']").map(&:text)
I would like to see: Jordan Bell.Image of the tag
You can get the title values with:
base_url.css(".nba-player-index__trending-item a:first").map{|item| item[:title] }
I'm using Chrome Data Miner, and so far, failing to extract the data from my query: http://www.allinlondon.co.uk/restaurants.php?type=name&rest=gluten+free
How to code the Next Element Xpath for this website? I tried all the possible web sources, nothing worked.
Thanks in advance!
You could look for a tags (//a) whose descendant::text() starts with "Next" and then get the href attribute of that a element.
% xpquery -p HTML '//a[starts-with(descendant::text(), "Next")]/#href' 'http://www.allinlondon.co.uk/restaurants.php?type=name&rest=gluten+free'
href="http://www.allinlondon.co.uk/restaurants.php?type=name&tube=0&rest=glutenfree®ion=0&cuisine=0&start=30&ordering=&expand="
I'm making an Xpath as part of a scraping project I'm working on. However, the only defining feature of the text I want is the title attribute of the enclosing <a> tag like so:
This is what I want to scrape
Is it at all possible to refer to that title and create a path like this?
//tr/td[style='vertical-align:top']/a[title='Vacancy details']
Attributes in XPath expressions need to be prefixed with the # symbol...
//tr/td/a[#title='Vacancy details']
//tr/td/a[#title='Vacancy details']/#title
You can grab just the title if that's all you want
I would like to search for a link on a page by its domain name - possibly using contains()? And then only show the anchor text of that link.
I've been able to get all of the a tag using
//a[contains(text(), 'domain_name')]
but unable to retrieve just the anchor text. Can anybody help?
Just use the text() node:
//a[contains(#href, 'domain_name')]/text()