What's the right xpath expression to get this data? - xpath

I'm trying to use the IMPORTXML function in Google Sheets to import data from the webpage http://chesstempo.com/chess-statistics/brainwolf.
The relevant section of code loooks like this:
<h3>Stats for blitz tactics</h3>
Rating: 2420.5 (RD: 291.15) (Best Active Rating: 2462 Worst Active Rating: 2221)
<br>
What's the right xpath expression to extract the data?

Get the h3 tag by text and then get the first following text sibling:
//h3[. = "Stats for blitz tactics"]/following-sibling::text()[1]
Demo (using google chrome console):
$x('//h3[. = "Stats for blitz tactics"]/following-sibling::text()[1]')
[
"Rating: 2420.5 (RD: 291.15) (Best Active Rating: 2462 Worst Active Rating: 2221)"
]

Related

Unsure why xpath query coming back empty

I'm new to scrapy and am experimenting with the shell, attempting to retrieve products from the following URL: https://www.newbalance.co.nz/men/shoes/running/trail/?prefn1=sizeRefinement&prefv1=men_shoes_7
The following is my code - I'm not sure why the final query is coming back empty:
$ scrapy shell
fetch("https://www.newbalance.co.nz/men/shoes/running/trail/?prefn1=sizeRefinement&prefv1=men_shoes_7")
div_product_lists = response.xpath('//div[#id="product-lists"]')
ul_product_list_main = div_product_lists.xpath('//ul[#id="product-list-main"]')
for li_tile in ul_product_list_main.xpath('//li[#class="tile"]'):
... print li_tile.xpath('//div[#class="product"]').extract()
...
[]
[]
If I inspect the page using a property inspector, then I am seeing data for the div (with class product), so am not sure why this is coming back empty. Any help would be appreciated.
The problem here is that xpath doesn't understand that class="product product-tile " means "This element has 2 classes, product and product-tile".
In an xpath selector, the class attribute is just a string, like any other.
Knowing that, you can search for the entire class string:
>>> li_tile.xpath('.//div[#class="product product-tile "]')
[<Selector xpath='.//div[#class="product product-tile "]' data='<div class="product product-tile " tabin'>]
If you want to find all the elements that have the "product" class, the easiest way to do so is using a css selector:
>>> li_tile.css('div.product')
[<Selector xpath="descendant-or-self::div[#class and contains(concat(' ', normalize-space(#class), ' '), ' product ')]" data='<div class="product product-tile " tabin'>]
As you can see by looking at the resulting Selector, this is a bit more complicated to achieve using only xpath.
The data you want to extract is more readily available in other div having class product-top-spacer.
For instance you can get all divs having class="product-top-spacer" by following:
ts = response.xpath('//div[#class="product-top-spacer"]')
and check the item of first extracted div and its price:
ts[0].xpath('descendant::p[#class="product-name"]/a/text()').extract()[0]
>> 'Leadville v3'
ts[0].xpath('descendant::div[#class="product-pricing"]/text()').extract()[0].strip()
>> '$260.00'
All items can be viewed by iterating ts
for t in ts:
itname = t.xpath('descendant::p[#class="product-name"]/a/text()').extract()[0]
itprice = t.xpath('descendant::div[#class="product-pricing"]/text()').extract()[0].strip()
itprice = ' '.join(itprice.split()) # some cleaning
print(itname + ", " + itprice)

IMPORTXML and XPath in Google Sheets

I have a question how to get pages quantity from here
One of the problems is that I never know how many spans will be in every page with book - here we have just 3, and "pages" span is number [2] here in the list but it can be any number so I cant just get it by using //p[#class='book']//text()[2]
I need to extract "300" using Google spreadsheets IMPORTXML function
<p class="book">
<span>condition: <b>good</b></span>
<br>
<span>pages: <b>300</b></span>
<br>
<span>color: <b>red</b></span>
<br>
</p>
I tried adding
[contains('pages: ')]
but no success here
Any suggestions?
p.s. //p[#class='book']//text() by itself
returns
condition:
good
pages:
300
color:
red
So you look for a span that start with 'pages:' and than take a value from it.
//p[#class='book']/span[starts-with(., 'pages:')]/b/text()

XPath for Google Results: <em> and description without date

I have 3 questions:
1) How can I XPath the text in the Google Results, the bold marked. If there's no , there should be nothing shown.
2) =XPathOnUrl("https://www.google.de/search?q=KEYWORD&num=10");"//span[#class='st']") This gives me the Google Description, but how can i get the description without the <span class="f"> date?
3) I get the description with � as an "ä, ö, ü". How can these letters be displayed?
HTML DOM CODE:-
<span class="st">
<span class="f">18.11.2009 - </span>
This Thursday 19th November
<em>Moonshine</em>
turns 4 years old. I'm proud to say that's 4 years of Malaysian acts pretty much every month. We've ...
</span>
The code I used for this issue
driver.get("https://www.google.de/?gws_rd=ssl#q=moonshine+site:blogspot.com&nu%E2%80%8C%E2%80%8Bm=10");
List<WebElement> ele = driver.findElements(By.xpath("//span[#class='f']/following-sibling::text()"));
ele.toString();
for(int i=0;i<ele.size();i++)
{
System.out.println(ele.get(i).getText());
}
This code throws an InvalidSelectorException
The result of the xpath expression "//span[#class='f']/following-sibling::text()" is: [object Text]. It should be an element.
In future you try this following xpath to capture only the text i.e. description
//span[#class='f']/following-sibling::text()
Actually you can't capture that text because this is selenium Open Issue
[selenium-developer-activity] Issue 5459 in selenium: InvalidSelectorError: The result of the xpath expression is: [object Text]
you can find it in below link (issue details)
http://grokbase.com/t/gg/selenium-developer-activity/13475y4cgj/issue-5459-in-selenium-invalidselectorerror-the-result-of-the-xpath-expression-is-object-text
Use below Xpath for same. It will return all the dates present on the page:-
//span[#class='f']/text()
if you just want text the use below xpath
//span[#class='st' and not(#class='f')]/text()
Hope it will help you :)

scrapy xpath : selector with many <tr> <td>

Hello I want to ask a question
I scrape a website with xpath ,and the result is like this:
[u'<tr>\r\n
<td>address1</td>\r\n
<td>phone1</td>\r\n
<td>map1</td>\r\n
</tr>',
u'<tr>\r\n
<td>address1</td>\r\n
<td>telephone1</td>\r\n
<td>map1</td>\r\n
</tr>'...
u'<tr>\r\n
<td>address100</td>\r\n
<td>telephone100</td>\r\n
<td>map100</td>\r\n
</tr>']
now I need to use xpath to analyze this results again.
I want to save the first to address,the second to telephone,and the last one to map
But I can't get it.
Please guide me.Thank you!
Here is code,it's wrong. it will catch another thing.
store = sel.xpath("")
for s in store:
address = s.xpath("//tr/td[1]/text()").extract()
tel = s.xpath("//tr/td[2]/text()").extract()
map = s.xpath("//tr/td[3]/text()").extract()
As you can see in scrappy documentation to work with relative XPaths you have to use .// notation to extract the elements relative to the previous XPath, if not you're getting again all elements from the whole document. You can see this sample in the scrappy documentation that I referenced above:
For example, suppose you want to extract all <p> elements inside <div> elements. First, you would get all <div> elements:
divs = response.xpath('//div')
At first, you may be tempted to use the following approach, which is wrong, as it actually extracts all <p> elements from the document, not only those inside <div> elements:
for p in divs.xpath('//p'): # this is wrong - gets all <p> from the whole document
This is the proper way to do it (note the dot prefixing the .//p XPath):
for p in divs.xpath('.//p'): # extracts all <p> inside
So I think in your case you code must be something like:
for s in store:
address = s.xpath(".//tr/td[1]/text()").extract()
tel = s.xpath(".//tr/td[2]/text()").extract()
map = s.xpath(".//tr/td[3]/text()").extract()
Hope this helps,

How do I grab content based on an outside tag match?

I am trying to organize a list of links and names based on a tag that is outside of the group of where the links and name reside. It's setup like so:
<h4>Volkswagen</h4>
<ul>
<li>beetle</li>
</ul>
<h4>Chevy</h4>
<ul>
<li>Volt / Electric</li>
</ul>
What I need is the result to be in the following format with the name as a link eventually but I can do that later if I can just get the items organized properly.
Each car brand could have multiple models of varying counts. I would need to organize them by car brand:
Volkswagen
Beetle Link Beetle
Jetta Link Jetta
Chevy
Volt Link Volt / Electric
S10 Link S10
I can get the list of brands with no problem. I am just having a hard time associating the batch of models with each brand as the <h4> tags aren't nested so I don't know how to associate them with the following <ul> list of cars.
I prefer to dive straight to each car, then back out to extract the car's brand:
cars = Hash.new { |h, k| h[k] = [] }
doc.xpath('//ul/li/a').each do |car|
brand = car.at('../../preceding-sibling::h4[1]').text
cars[brand] << {link: car['href'], name: car.text}
end
Note that the hash is initialized with a block specifying that the default value is an array. This allows appending hashes (via <<) as shown. The XPath ../../preceding-sibling::h4[1] says: go back up to the ul level and look back to the first preceding h4. This is the corresponding brand for the car.
Output:
{"Volkswagen"=>[
{:link=>"http://beetle.cars.com", :name=>"beetle"}
# others here
],
"Chevy"=>[
{:link=>"http://volt.cars.com", :name=>"Volt / Electric"}
# others here
]
}
I find this technique nice and simple, with just a single loop. Not everyone likes this style though.

Resources