I'm scrapping a website that have a system pagination based on javascript, so I want to extract the page number from the #href attribute, this is how the link look:
Page 1
Scrapy selectors support regular expressions:
sel.xpath('//a/#href').re(r"javascript:AllerAPage\('(\d+)',")
Note that //a/#href xpath exression is an example one - you may have a different one.
Demo shows the work of the regex I've provided:
>>> import re
>>> s = "javascript:AllerAPage('1', 'element_id');"
>>> re.search("javascript:AllerAPage\('(\d+)',", s).group(1)
'1'
Related
I'm trying to scrape match statistics of a game of football yesterday at the following url:
https://www.flashscore.com/match/8S0QVm38/#match-statistics;0
I've written code, just for Webdriver to select the stats I want and print them for me so I can then see what I want to use. My code is:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Firefox()
browser.get("https://www.flashscore.com/match/8S0QVm38/#match-statistics;0")
print(browser.find_elements_by_class_name("statText--homeValue"))
A list of elements are printed out and to be honest, I don't know if this was what I was looking for because what is returned doesn't show anything to identify with what i'm looking at in the developer tools.
I'm trying to get all the numbers under statistics like Possession and shots on target, but print returns a list of xpaths like this, where the session is the same but the element is always different:
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="da88ca87-e318-934e-ba75-dca1d652cd37", element="c53f5f3e-2c89-b34c-a639-ab50fbbf0c33")>,
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="da88ca87-e318-934e-ba75-dca1d652cd37", element="3e422b45-e26d-de44-8994-5f9788462ec4")>,
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="da88ca87-e318-934e-ba75-dca1d652cd37", element="9e110a54-4ecb-fb4b-9d8f-ccd1b210409d")>, <
Anyone know why this is and what I can do to get the actual numbers?
What you're getting are not XPaths, but a list of WebElement objects. To get text from each try
print([node.text for node in browser.find_elements_by_class_name("statText--homeValue")])
You have printed the generators instead of actual contents. For that you have to use .text for each element. Like,
elements = browser.find_elements_by_class_name("statText--homeValue")
for element in elements:
print(element.text)
You can opt for a list comprehensive method shown in Andersson's answer also.
Hope this helps! Cheers!
I want to extract all the functions listed inside the table in the below link : python functions list
I have tried using the chrome developers console to get the exact xpath to be used in the file spider.py as below:
$x('//*[#id="built-in-functions"]/table[1]/tbody//a/#href')
but this returns a list of all href's ( which I think what the xpath expression refers to).
I need to extract the text from here I believe but appending /text() to the above xpath return nothing. Can someone please help me to extract the function names from the table.
I think this should do the trick
response.css('.docutils .reference .pre::text').extract()
a non-exact xpath equivalent of it (but that also works in this case) would be:
response.xpath('//table[contains(#class, "docutils")]//*[contains(#class, "reference")]//*[contains(#class, "pre")]/text()').extract()
Try this:
for td in response.css("#built-in-functions > table:nth-child(4) td"):
td.css("span.pre::text").extract_first()
I have a portion of html like below
<li><label>The Keyword:</label><span>The text</span></li>
I want to get the string "The keyword: The text".
I know that I can get xpath of above html using Chrome inspect or FF firebug, then hxs.select(xpath).extract(), then strip html tags to get the string. However, the approach is not generic enough since the xpath is not consistent across different pages.
Hence, I'm thinking of below approach:
Firstly, search for "The Keyword:" using
hxs = HtmlXPathSelector(response)
hxs.select('//*[contains(text(), "The Keyword:")]')
When do pprint I get some return:
>>> pprint( hxs.select('//*[contains(text(), "The Keyword:")]') )
<HtmlXPathSelector xpath='//*[contains(text(), "The Keyword:")]' data=u'<label>The Keyword:</label>'>
My question is how to get the wanted string: "The keyword: The text". I am thinking of how to determine xpath, if xpath is known, then of course I can get the wanted string.
I am open to any solution other than scrapy HtmlXPathSelector. ( e.g lxml.html might have more features but I am very new to it).
Thanks.
If I understand your question correctly, "following-sibling" is what you are looking after.
//*[contains(text(), "The Keyword:")]/following-sibling::span/a/text()
Xpath Axes
I am trying to automate some tests using selenium webdriver. I am dealing with a third-party login provider (OAuth) who is using duplicate id's in their html. As a result I cannot "find" the input fields correctly. When I just select on an id, I get the wrong one.
This question has already been answered for JQuery. But I would like an answer (I am presuming using Xpath) that will work in Selenium webdriver.
On other questions about this issue, answers typically say "you should not have duplicate id's in html". Preaching to the choir there. I am not in control of the webpage in question. If it was, I would use class and id properly and just fix the problem that way.
Since I cannot do that. What options do I get with xpath etc?
you can do it by driver.find_element_by_id, for example ur duplicate "duplicate_ID" is inside "div_ID" wich is unique :
driver.find_element_by_id("div_ID").find_element_by_id("duplicate_id")
for other duplicate id under another div :
driver.find_element_by_id("div_ID2").find_element_by_id("duplicate_id")
This XPath expression:
//div[#id='something']
selects all div elements in the XML document, the string value of whose id attribute is the string "something".
This Xpath expression:
count(//div[#id='something'])
produces the number of the div elements selected by the first XPath expression.
And this XPath expression:
(//div[#id='something'])[3]
selects the third (in document order) div element that is selected by the first XPath expression above.
Generally:
(//div[#id='something'])[$k]
selects the $k-th such div element ($k must be substituted with a positive integer).
Equipped with this knowledge, one can get any specific div whose id attribute has string value "something".
Which language are you working on? Dublicate id's shouldn't be a problem as you can virtually grab any attribute not just the id tag using xpath. The syntax will differ slightly in other languages (let me know if you want something else than Ruby) but this is how you do it:
driver.find_element(:xpath, "//input[#id='loginid']"
The way you go about constructing the xpath locator is the following:
From the html code you can pick any attribute:
<input id="gbqfq" class="gbqfif" type="text" value="" autocomplete="off" name="q">
Let's say for example that you want to consturct your xpath with the html code above (Google's search box) using name attribute. Your xpath will be:
driver.find_element(:xpath, "//input[#name='q']"
In other words when the id's are the same just grab another attribute available!
Improvement:
To avoid fragile xpath locators such as order in the XML document (which can change easily) you can use something even more robust. Two xpath locators instead of one. This can also be useful when dealing with hmtl tags that are really similar. You can locate an element by 2 of its attributes like this:
driver.find_element(:id, 'amount') and driver.find_element(xpath: "//input[#maxlength='50']")
or in pure xpath one liner if you prefer:
//input[#id="amount" and #maxlength='50']
Alternatively (and provided your xpath will only return one unique element) you can move one more step higher in the abstraction level; completely omitting the attribute values:
//input[#id and #maxlength]
It's not listed at http://selenium-python.readthedocs.io/locating-elements.html but I'm able access a method find_elements_by_id
This returns a list of all elements with the duplicate ID.
links = browser.find_elements_by_id("link")
for link in links:
print(link.get_attribute("href"))
you should use driver.findElement(By.xpath() but while locating element with firebug you should select absolute path for particular element instead of getting relative path this is how you will get the element even with duplicate ID's
From a table I am looking at on the web in firefox this is an xpath selector.
id('ls-page')/x:div[5]/x:div[1]/x:div[2]/x:table/x:tbody/x:tr[2]/x:td[2]/x:a
So i remove /x:tbody because that was added by firefox. But how is it generalised to get a links in the table that have the same base Xpath. the only obvious difference is that tr increases by 1 for each link in the table.
id('ls-page')/x:div[5]/x:div[1]/x:div[2]/x:table/x:tr[2]/x:td[2]/x:a
id('ls-page')/x:div[5]/x:div[1]/x:div[2]/x:table/x:tr[3]/x:td[2]/x:a
If there are successive tables of links on the page. and the only difference to me appears that div increases from 1 to 2.
So second table link.
id('ls-page')/x:div[5]/x:div[2]/x:div[2]/x:table/x:tr[2]/x:td[2]/x:a
/x:div[5]/x:div[1]
becomes
/x:div[5]/x:div[2]
1) Is there a method or process to use to generalize an XPATH selector?
2) For each table do i have to create two separate generalised functions one to retrieve tables and one to retrieve links from tables?
Note I am referring to this site live nrl stats . I have been reading scrapy documentation and beautifulsoup documentation but am open to any suggestions regarding tooling as I am just learning.
XPATH is a query language, I don't know of any automated means of generalizing queries, it's something you have to work out for yourself based on the document structure.
My preferred library is lxml.etree. Here's a simple working example of a query that should return you all of the match links.
I've saved the html to the working directory to avoid hitting the website frequently while testing.
from lxml import etree
import os
local_file = 'season2012.html'
url = "http://live.nrlstats.com/nrl/season2012.html"
if not os.path.exists(local_file):
from urllib2 import urlopen
data = urlopen(url).read()
with open(local_file,'w') as f:
f.write(data)
else:
with open(local_file,'r') as f:
data = f.read()
doc = etree.HTML(data)
for link in doc.xpath('//table[#class="tablel"]/tr/td[2]/a'):
print "%s\t%s" % (link.attrib['href'],link.text)
Yielding:
/matches/nrl/match15300.html Melbourne v Newcastle
/matches/nrl/match15291.html Brisbane v St George Illawarra
/matches/nrl/match15313.html Penrith v Cronulla
/matches/nrl/match15312.html Parramatta v Manly
/matches/nrl/match15311.html Sydney Roosters v Warriors
[truncated]
I'd suggest working with the ElementTree object, doc in this example with the interactive python interpreter, to test your queries and have a look at other XPATH questions and answers on SO for working query examples to aid your learning.