Using ImportXML to pull URL and anchor

Using ImportXML to pull URL and anchor - xpath

My goal is to scrape a link containing either a word "apple" or a word "pear", and for each scraped link I need to scrape the anchor.
At present I am using the following:
=IMPORTXML(A1,"//a/#href[contains(., 'apple')]")
Unfortunately, I can only scrape the links containing apple. Still need to add another condition - "pear" and scrape the anchor.
Thank you for your help.

Try this:
=IMPORTXML(A1,"//a/#href[contains(., 'apple') or contains(., 'pear')]")

Related

Why does IMPORTXML with XPATH return unexpected blank row in addition to expected result?

I'm importing into Google Sheets with IMPORTXML with the following XPATH:
=IMPORTXML(A2;"//*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li")
A2 containing the URL (https://stt.wiki/wiki/20th_Century_Pistol).
From the website I want to import the list entries in the "Basic" column and "Crafted From" row of the table.
There are only two list entries in this section of the table:
"x1 Basic Security Codes" and
"x4 Basic Casing"
Therefore, I expected to get only those two list entries as rows in my sheet.
Instead, I got an additional blank row above those two entries. When I change "td[1]" to "td[3]" in the XPATH query however, there are no extra blanks.
I don't understand where the additional blank row is coming from and how I can avoid it.
Google Sheet with desired and actual result

When I saw the HTML of the URL, there are 2 li tags in the ul tag. So I think that your xpath is correct. But from your issue, I was worry that the sup tag might affect to this situation. But I'm not sure whether this is the direct reason. So I would like to propose to add the attribute of li for your xpath as follows.
Modified xpath:
When your xpath is modified, please modify as follows.
From:
//*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li
To:
//*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li[#style='white-space:nowrap']
By adding [#style='white-space:nowrap'], the value of li with style='white-space:nowrap' is retrieved.
Result:
The formula is =IMPORTXML(A1;"//*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li[#style='white-space:nowrap']"). Please put the URL to the cell "A1".
Note:
Also, you can use the xpath of //*[#id='mw-content-text']/div/table[1]/tbody/tr[4]/td[1]/ul/li[position()>1].

To complete the very neat #Tanaike's answer, another expression :
=IMPORTXML(A2;"//th[contains(.,'Crafted')]/following::td[1]//li[contains(#style,'white')]")
If a blank line is added it's because GoogleSheets parses an additional blank li element containing a #style attribute.

Scraping contents of title from an 'a' tag

I'm using nokogiri to scrape contents from https://www.nba.com/teams/warriors but I am unable to scrape the contents from the a tag.
I played around with adding and removing the classes but I receive an empty array.
base_url.css(".nba-player-index__trending-item a ['title']").map(&:text)
I would like to see: Jordan Bell.Image of the tag

You can get the title values with:
base_url.css(".nba-player-index__trending-item a:first").map{|item| item[:title] }

XPath Next Page navigation

I'm using Chrome Data Miner, and so far, failing to extract the data from my query: http://www.allinlondon.co.uk/restaurants.php?type=name&rest=gluten+free
How to code the Next Element Xpath for this website? I tried all the possible web sources, nothing worked.
Thanks in advance!

You could look for a tags (//a) whose descendant::text() starts with "Next" and then get the href attribute of that a element.
% xpquery -p HTML '//a[starts-with(descendant::text(), "Next")]/#href' 'http://www.allinlondon.co.uk/restaurants.php?type=name&rest=gluten+free'
href="http://www.allinlondon.co.uk/restaurants.php?type=name&tube=0&rest=glutenfree&region=0&cuisine=0&start=30&ordering=&expand="

Referring to "title" of <a> in Xpath request

I'm making an Xpath as part of a scraping project I'm working on. However, the only defining feature of the text I want is the title attribute of the enclosing <a> tag like so:
This is what I want to scrape
Is it at all possible to refer to that title and create a path like this?
//tr/td[style='vertical-align:top']/a[title='Vacancy details']

Attributes in XPath expressions need to be prefixed with the # symbol...
//tr/td/a[#title='Vacancy details']

//tr/td/a[#title='Vacancy details']/#title
You can grab just the title if that's all you want

xpath - how to show the anchor text of a specified link

I would like to search for a link on a page by its domain name - possibly using contains()? And then only show the anchor text of that link.
I've been able to get all of the a tag using
//a[contains(text(), 'domain_name')]
but unable to retrieve just the anchor text. Can anybody help?

Just use the text() node:
//a[contains(#href, 'domain_name')]/text()

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Using ImportXML to pull URL and anchor - xpath

Try this: =IMPORTXML(A1,"//a/#href[contains(., 'apple') or contains(., 'pear')]")

Related

Why does IMPORTXML with XPATH return unexpected blank row in addition to expected result?

Scraping contents of title from an 'a' tag

XPath Next Page navigation

Referring to "title" of <a> in Xpath request

xpath - how to show the anchor text of a specified link

Categories

Resources