scrapy xpath : selector with many <tr> <td> - xpath

Hello I want to ask a question
I scrape a website with xpath ,and the result is like this:
[u'<tr>\r\n
<td>address1</td>\r\n
<td>phone1</td>\r\n
<td>map1</td>\r\n
</tr>',
u'<tr>\r\n
<td>address1</td>\r\n
<td>telephone1</td>\r\n
<td>map1</td>\r\n
</tr>'...
u'<tr>\r\n
<td>address100</td>\r\n
<td>telephone100</td>\r\n
<td>map100</td>\r\n
</tr>']
now I need to use xpath to analyze this results again.
I want to save the first to address,the second to telephone,and the last one to map
But I can't get it.
Please guide me.Thank you!
Here is code,it's wrong. it will catch another thing.
store = sel.xpath("")
for s in store:
address = s.xpath("//tr/td[1]/text()").extract()
tel = s.xpath("//tr/td[2]/text()").extract()
map = s.xpath("//tr/td[3]/text()").extract()

As you can see in scrappy documentation to work with relative XPaths you have to use .// notation to extract the elements relative to the previous XPath, if not you're getting again all elements from the whole document. You can see this sample in the scrappy documentation that I referenced above:
For example, suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:
divs = response.xpath('//div')
At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements from the document, not only those inside <div> elements:
for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document
This is the proper way to do it (note the dot prefixing the .//p XPath):
for p in divs.xpath('.//p'): # extracts all <p> inside
So I think in your case you code must be something like:
for s in store:
address = s.xpath(".//tr/td[1]/text()").extract()
tel = s.xpath(".//tr/td[2]/text()").extract()
map = s.xpath(".//tr/td[3]/text()").extract()
Hope this helps,

Related

Xpath get element above

suppose I have this structure:
<div class="a" attribute="foo">
<div class="b">
<span>Text Example</span>
</div>
</div>
In xpath, I would like to retrieve the value of the attribute "attribute" given I have the text inside: Text Example
If I use this xpath:
.//*[#class='a']//*[text()='Text Example']
It returns the element span, but I need the div.a, because I need to get the value of the attribute through Selenium WebDriver
Hey there are lot of ways by which you can figure it out.
So lets say Text Example is given, you can identify it using this text:-
//span[text()='Text Example']/../.. --> If you know its 2 level up
OR
//span[text()='Text Example']/ancestor::div[#class='a'] --> If you don't know how many level up this `div` is
Above 2 xpaths can be used if you only want to identify the element using Text Example, if you don't want to iterate through this text. There are simple ways to identify it directly:-
//div[#class='a']
From your question itself you have mentioned the answer for it
but I need the div.a,
try this
driver.findElement(By.cssSelector("div.a")).getAttribute("attribute");
use cssSelector for best result.
or else try the following xpath
//div[contains(#class, 'a')]
If you want attribute of div.a with it's descendant span which contains text something, try as below :-
driver.findElement(By.xpath("//div[#class = 'a' and descendant::span[text() = 'Text Example']]")).getAttribute("attribute");
Hope it helps..:)

accessing the text value of last nested <tr> element with no id or class hooks

I need to access the value of the 10th <td> element in the last row of a table. I can't use an ID as a hook because only the table has an ID. I've managed to make it work using the code below. Unfortunately, its static. I know I will always need the 10th <td> element, but I won't ever know which row it needs to be. I just know it needs to be the last row in the table. How would I replace "tr[6]" with the actual last <tr> dynamically? (this is probably really easy, but this is literally my first time doing anything with ruby).
page = Nokogiri::HTML(open(url))
test = page.css("tr[6]").map { |row|
row.css("td[10]").text}
puts test
You want to do:
page.at("tr:last td:eq(10)")
If you do not need to do anything else with the page you can actually make this a single line with
test = Nokogiri::HTML(open(url)).search("tr").last.search("td")[10].text
Otherwise (this will work):
page = Nokogiri::HTML(open(url))
test = page.search("tr").last.search("td")[10].text
puts test
Example:(Used a large table from another question on StackOverflow)
Nokogiri::HTML(open("http://en.wikipedia.org/wiki/Richard_Dreyfuss")).search('table')[1].search('tr').last.search('td').children.map{|c| c.text}.join(" ")
#=> "2013 Paranoia Francis Cassidy"
Is there a particular reason you want an Array with 1 element? My example will return a string but you could easily modify it to return an Array.
You can use CSS pseudo class selectors for this:
page.css("table#the-table-id tr:last-of-type td:nth-of-type(10)")
This first selects the <table> with the appropriate id, then selects the last <tr> child of that table, and then selects the 10th <td> of that <tr>. The result is an array of all matching elements, if youexpect there to be only one you could use at_css instead.
If you prefer XPath, you could use this:
page.xpath("//table[#id='the-table-id']/tr[last()]/td[10]")

Xpath/HtmlAgilityPack: Getting the specific attributes from href tag

I'm using the HtmlAgilityPack to parse href tags in an html file. The href tags look like this:
<h3 class="product-name">Super Cool Product</h3>
So far I can successfully pull out the url and the title together, and display it in a list. This is the main code I'm using to parse the html:
var linksOnPage = from lnks in document.DocumentNode.SelectNodes("//h3[#class='product-name']//a")
where
lnks.Attributes["href"] != null &&
lnks.InnerText.Trim().Length > 0
select new
{
Url = lnks.Attributes["href"].Value,
Text = lnks.InnerText
};
The code above gives me a result that looks like this:
Super Cool Product - http://www.somewebsite.com/blahblah
I'm trying to figure out how to pull out the name and url separately, and put them into separate strings, instead of pulling them out together and putting them into one string. I'm guessing there is some sort of Xpath notation I can use to do this. I would be extremely thankful if someone could lead me in the right direction
Thanks,
Miles

Parsing inner tags using Nokogiri

I'm stuck not being able to parse irregularly embedded html tags. Is there a way to remove all html tags from a node and retain all text?
I'm using the code:
rows = doc.search('//table[#id="table_1"]/tbody/tr')
details = rows.collect do |row|
detail = {}
[
[:word, 'td[1]/text()'],
[:meaning, 'td[6]/font'],
].collect do |name, xpath|
detail[name] = row.at_xpath(xpath).to_s.strip
end
detail
end
Using Xpath:
[:meaning, 'td[6]/font']
generates
:meaning: ! '<font size="3">asking for information specifying <font
color="#CC0000" size="3">what is your name?</font> /what/ as in, <font color="#CC0000" size="3">I'm not sure what you mean</font>
/what/ as in <a style="text-decoration: none;" href="http://somesecretlink.com">what</a></font>
On the other hand, using Xpath:
'td/font/text()'
generates
:meaning: asking for information specifying
thus ignoring all children of the node. What I want to achieve is this
:meaning: asking for information specifying what is your name? /what/ as in, I'm not sure what you mean /what/ as in what? I can't hear you
This depends on what you need to extract. If you want all text in font elements, you can do it with the following xpath:
'td/font//text()'
It extracts all text nodes in font tags. If you want all text nodes in the cell, then:
'td//text()'
You can also call the text method on a Nokogiri node:
row.at_xpath(xpath).text
I added an answer for this same sort of question the other day. It's a very easy process.
Take a look at: Convert HTML to plain text and maintain structure/formatting, with ruby

Use Nokogiri to get all nodes in an element that contain a specific attribute name

I'd like to use Nokogiri to extract all nodes in an element that contain a specific attribute name.
e.g., I'd like to find the 2 nodes that contain the attribute "blah" in the document below.
#doc = Nokogiri::HTML::DocumentFragment.parse <<-EOHTML
<body>
<h1 blah="afadf">Three's Company</h1>
<div>A love triangle.</div>
<b blah="adfadf">test test test</b>
</body>
EOHTML
I found this suggestion (below) at this website: http://snippets.dzone.com/posts/show/7994, but it doesn't return the 2 nodes in the example above. It returns an empty array.
# get elements with attribute:
elements = #doc.xpath("//*[#*[blah]]")
Thoughts on how to do this?
Thanks!
I found this here
elements = #doc.xpath("//*[#*[blah]]")
This is not a useful XPath expression. It says to give you all elements that have attributes that have child elements named 'blah'. And since attributes can't have child elements, this XPath will never return anything.
The DZone snippet is confusing in that when they say
elements = #doc.xpath("//*[#*[attribute_name]]")
the inner square brackets are not literal... they're there to indicate that you put in the attribute name. Whereas the outer square brackets are literal. :-p
They also have an extra * in there, after the #.
What you want is
elements = #doc.xpath("//*[#blah]")
This will give you all the elements that have an attribute named 'blah'.
You can use CSS selectors:
elements = #doc.css "[blah]"

Resources