How to select specific elements in span from #class xpath response - xpath

I am learning to use xpath in scrapy, but the html I am trying to scrape is quite complicated.
I have tried selecting some things with html, css and xpath but I have reached this:
response.xpath('//span[starts-with(#class,"Animal-")]').getall()
which returns:
[u'<span class="Animal-1" title="Dogs" legs="4" tail="true"></span>', u'<span class="Animal-7" title="Birds" beak="true"></span>', u'<span class="Animal-24" title="Elephants"></span>']
I used a separate script to just return the contents of the title element to get me going, but I know that this is a hacky solution.
How can I return only the following:
Dogs
Birds
Elephants

Xpath are very flexible you should learn more about them, below code will get you your result.
response.xpath('//span[starts-with(#class,"Animal-")]/#title').getall()
Cheers!!
Also you can test the above XPath here, and play around to learn more, the above approach I have used works for all tag attributes, for example to extract href from all tags use //a/#href

Related

Setting the correct xpath

I'm trying to set the right xpath for using RSelenium, but I'm not very experienced in this area, so any help would be much appreciated.
Since I'm not allowed to post pictures yet I have tried to add a link to a screenshot of the html:
The html
I need R to scrape the dates (28-10-2020 - 13-11-2020), but so far I have not been able to set the correct xpath when using html.nodes.
I'm trying to scrape from sites like this one: https://www.boligsiden.dk/adresse/topperne-9-3-33-2620-albertslund-01650532___9__3__33
I usually do this on python rather than R
As you can see in this image when you right-click on the element concerned. You get a drop-down menu with an x-path to the element.
Other than that, the site orientation and x-path might change and a full x-path might be a good option in the short-run, so I rather prefer driver.find_element_by_xpath('//button[contains(text(),"Login")]')\ .click()
In your case which would be find_element_by_xpath('//*[contains(#class, 'u-pb-4 u-block')]')
I hope this helps and it is mostly the same across different languages

how to select the second <p> element using Xpath

I am trying to scrape full reviews from this webpage. (Full reviews - after clicking the 'Read More' button). This I am doing using RSelenium. I am able to select and extract text from the first <p> element, using the code
reviewNodes <- mybrowser$findElements(using = 'xpath', "//p[#id][1]")
which is for less text review.
But not able to extract full text reviews using the code
reviewNodes <- mybrowser$findElements(using = 'xpath', "//p[#id][2]")
or
reviewNodes <- mybrowser$findElements(using = 'xpath', "//p[#itemprop = 'reviewBody']")
It shows blank list elements. I don't know what is wrong. Please help me..
Drop the double slash and try to use the explicit descendant axis:
/descendant::p[#id][2]
(see the note from W3C document on XPath I mentioned in this answer)
As you're dealing with a list, you should first find the list items, e.g. using CSS selector
div.srm
Based on these elements, you can then search on inside the list items, e.g. using CSS selector
p[itemprop='reviewBody']
Of course you can also do it in 1 single expression, but that is not quite as neat imho:
div.srm p[itemprop='reviewBody']
Or in XPath (which I wouldn't recommend):
//div[#class='srm']//p[#itemprop='reviewBody']
If neither of these work for you, then the problem must be somewhere else.

Cannot narrow down an xpath using mechanize in ruby to extract a table

I'm trying to get the data contained after this highlighted td tag:
(screenshot taken from firefox developer tools)
but I don't understand how I'm going to get there. I tried to use xpath
page.parser.xpath("//table//tbody//tr//td//ul//form//table//tbody//tr/td")
but this doesn't work and I assuming it's because I'm not identifying anything? I am not sure how I'm going to identify stuff though, since some of them have no ids or names. So the question is how do I reach this tag.
Probably because the tbody isn't really there. There's a lot of discussion on SO about that particular issue.
Here's how what I would probably do:
td = page.at '#tF0 .txt2'

How do I select a href="mailto" tag on a page?

I can use either xpath or CSS, doesn't matter...but there are other a tags on the page. But I just want to use either the first a href=mailto: tag or anyone (there is actually just one, so it doesn't matter the order).
You could use an XPath starts-with function:
mailto = doc.xpath('//a[starts-with(#href, "mailto:")]').first
The standardese is particularly thick in the XPath spec so hopefully the example is clear enough.

extract xpath

I want to retrieve the xpath of an attribute (example "brand" of a product from a retailer website).
One way of doing it is using addons like xpather or xpath checker to firefox, opening up the website using firefox and right clicking the desired attrbute I am interested in. This is ok. But I want to capture this information for many attributes and right clicking each and every attribute maybe time consuming. Also, the other problem I have is that attributes I maybe interested in will be there for one product. The other attributes maybe for some other product. So, I will have to go that product & then do it manually again.
Is there an automated or programatic way of retrieving the xpath of the desired attributes from a website rather than having to do this manually?
You must notice that not all websites use valid XML that you can use xpath on...
That said, you should check out some HTML parsers that will allow you to use xpath on HTML even if it is not a valid XML.
Since you did not specify the technology you are working with - I'll suggest the .NET HTML Agility Pack, if you need others, search for questions dealing with this here on SO.
The solution I use for this kind of thing is to write an xpath something like this:
//*[text()="Brand"]/following-sibling::*
//*[text()="Color"]/following-sibling::*
//*[text()="Size"]/following-sibling::*
//*[text()="Material"]/following-sibling::*
It works by finding all elements (labels) with the text you want and then looking to the next sibling in the HTML. Without a specific URL to see I can't help any further.
This is a generalised version you can make more specific versions by replacing the asterisks is tag types, and you can navigate differently by replacing the axis following sibling with something else.
I use xPaths in import.io to make APIs for this kind of thing all the time, It's just a matter of finding a xPath that's generic enough to find the HTML no matter where it is on the page, but being specific enough to get the right data.

Resources