I'm trying to scrape match statistics of a game of football yesterday at the following url:
https://www.flashscore.com/match/8S0QVm38/#match-statistics;0
I've written code, just for Webdriver to select the stats I want and print them for me so I can then see what I want to use. My code is:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
browser = webdriver.Firefox()
browser.get("https://www.flashscore.com/match/8S0QVm38/#match-statistics;0")
print(browser.find_elements_by_class_name("statText--homeValue"))
A list of elements are printed out and to be honest, I don't know if this was what I was looking for because what is returned doesn't show anything to identify with what i'm looking at in the developer tools.
I'm trying to get all the numbers under statistics like Possession and shots on target, but print returns a list of xpaths like this, where the session is the same but the element is always different:
[<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="da88ca87-e318-934e-ba75-dca1d652cd37", element="c53f5f3e-2c89-b34c-a639-ab50fbbf0c33")>,
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="da88ca87-e318-934e-ba75-dca1d652cd37", element="3e422b45-e26d-de44-8994-5f9788462ec4")>,
<selenium.webdriver.firefox.webelement.FirefoxWebElement (session="da88ca87-e318-934e-ba75-dca1d652cd37", element="9e110a54-4ecb-fb4b-9d8f-ccd1b210409d")>, <
Anyone know why this is and what I can do to get the actual numbers?
What you're getting are not XPaths, but a list of WebElement objects. To get text from each try
print([node.text for node in browser.find_elements_by_class_name("statText--homeValue")])
You have printed the generators instead of actual contents. For that you have to use .text for each element. Like,
elements = browser.find_elements_by_class_name("statText--homeValue")
for element in elements:
print(element.text)
You can opt for a list comprehensive method shown in Andersson's answer also.
Hope this helps! Cheers!
Related
I have an RDF/XML Element and would like to find out all the elements between the start and end of a particular tag. How could I do that?
for example :
<cim:BaseVoltage rdf:ID="_0526B48408F744919E7C03672FCD0D71">
<cim:BaseVoltage.isDC>false</cim:BaseVoltage.isDC>
<cim:BaseVoltage.nominalVoltage>400.000000000</cim:BaseVoltage.nominalVoltage>
</cim:BaseVoltage>
I would like to extract the values BaseVoltage.isDC and BaseVoltage.nominalVoltage, since they are between the start and end tag of . As mentioned this is just an example and I have many more such starting and ending tag.
I thought of doing it using Xpath, but am not really sure how.
Parsing the XML File using XPath seemed to be a really bad idea for the question. Rdflib makes it very easy.
import rdflib
from rdflib import Graph
from rdflib.namespace import Namespace
BASE = Namespace('http://example.org/')
graph = rdflib.Graph()
graph.parse('rdf.xml', format='xml', publicID=BASE)
for p,o in graph[BASE['#_0526B48408F744919E7C03672FCD0D71']]:
print(p, o)
I am trying to extract every title of this mailing list while registering how many replies each thread has.
According to Firebug, the Xpath to the <ul> that contains all the titles is:
/html/body/table[2]/tbody/tr1/td[2]/table/tbody/tr/td/ul
However, if I paste this directly in Scrapy Shell, it will yield an empty list:
scrapy shell http://seclists.org/fulldisclosure/2002/Jul/index.html
response.xpath('/html/body/table[2]/tbody/tr[1]/td[2]/table/tbody/tr/td/ul')
[]
After some trial and error (since I couldn't figure out from the documentation any way to list the immediate sub-elements from a given Selector (please let em know if you know of one), I figured out that the element "tbody" didn't work on Xpath. By removing them, I was able to navigate up to /td:
almost_email_threads = response.xpath('/html/body/table[2]/tr[1]/td[2]/table/tr/td')
However, if I attempt now to reach "ul" it will not work:
email_threads.xpath('/ul')
[]
Now, what confuses me the most is that running:
response.xpath('/html/body/table[2]/tr[1]/td[2]/table/tr/td//ul')
will give me the ul's, but not in the same order as appearing on the website. It skips threads and in different orders. Furthermore it seems impossible to be able to count the amount of replies per thread.
What am I missing here? It's been a while since I've used Scrapy, but I don't recollect being this hard to figure out, and tutorials for whatever reason do not pull out either on Bing or Google for me.
I have never used Firebug, but looking at the HTML page you refer, I'd say that the following XPath expression will give you all top level threads:
//li[not(ancestor::li) and ./a/#name]
Starting from each list element, you then need to count the amount of list children to get the amount of replies to any given thread.
Using the Scrapy shell, this results in:
> scrapy shell http://seclists.org/fulldisclosure/2002/Jul/index.html
In [1]: threads = response.xpath('//li[not(ancestor::li) and ./a/#name]')
In [2]: for thread in threads:
...: print thread, len(thread.xpath('descendant::li'))
<Selector xpath='//li[not(ancestor::li) and ./a/#name]' data=u'<li><a name="0" href="0">Testing</a> <em'> 0
<Selector xpath='//li[not(ancestor::li) and ./a/#name]' data=u'<li><a name="1" href="1">full disclosure'> 4
<Selector xpath='//li[not(ancestor::li) and ./a/#name]' data=u'<li><a name="3" href="3">The Death Of TC'> 1
<Selector xpath='//li[not(ancestor::li) and ./a/#name]' data=u'<li><a name="7" href="7">Re: Announcing '> 24
[...]
Regarding your question on how to list all sub-elements from a given selector, you just need to realize that the result of running an XPath query on a selector is a SelectorList where each list element implements the Selector interface. So you can simply use XPath again to e.g. list all the children:
In [3]: thread.xpath('child::*')
Out[3]:
[<Selector xpath='child::*' data=u'<a name="309" href="309">it\'s all about '>,
<Selector xpath='child::*' data=u'<em>Florin Andrei (Jul 31)</em>'>,
<Selector xpath='child::*' data=u'<ul>\n<li><a name="313" href="313">it\'s a'>]
I want to scrape data from the table on this webpage http://www.changning.sh.cn/jact/front/front_mailpublist.action?sysid=9
Before writing a spider, I tested my Xpath expressions in Scrapy shell, but ran into one problem: Xpath can't get any text out of the table.
Say I want to extract the text LM2015122827458 in the upperleft cell, I used response.xpath("//tr[#class = 'tr_css']/td[1]/text()").extract(). Only an empty list was returned. I tried alternative Xpath expressions including the ones inspired by Chrome "copy Xpath," but had no luck. I even used response.xpath("//text()") to extract all the texts on the page to see if LM2015122827458 is there. It wasn't. So, is this a page that Xpath can't deal with? Or did I do something wrong? Thank you very much!
This Xpath is working fine for me:-
//tr[#class='tr_css'][1]/td[#class='text-center'][1]
Below code work in java is working fine for me :-
driver.get("http://www.changning.sh.cn/jact/front/front_mailpublist.action?sysid=9");
driver.manage().timeouts().implicitlyWait(30, TimeUnit.SECONDS);
String a = driver.findElement(By.xpath("//tr[#class='tr_css'][1]/td[#class='text-center'][1]")).getText();
System.out.println(a);
Hope it will help you :)
I have a portion of html like below
<li><label>The Keyword:</label><span>The text</span></li>
I want to get the string "The keyword: The text".
I know that I can get xpath of above html using Chrome inspect or FF firebug, then hxs.select(xpath).extract(), then strip html tags to get the string. However, the approach is not generic enough since the xpath is not consistent across different pages.
Hence, I'm thinking of below approach:
Firstly, search for "The Keyword:" using
hxs = HtmlXPathSelector(response)
hxs.select('//*[contains(text(), "The Keyword:")]')
When do pprint I get some return:
>>> pprint( hxs.select('//*[contains(text(), "The Keyword:")]') )
<HtmlXPathSelector xpath='//*[contains(text(), "The Keyword:")]' data=u'<label>The Keyword:</label>'>
My question is how to get the wanted string: "The keyword: The text". I am thinking of how to determine xpath, if xpath is known, then of course I can get the wanted string.
I am open to any solution other than scrapy HtmlXPathSelector. ( e.g lxml.html might have more features but I am very new to it).
Thanks.
If I understand your question correctly, "following-sibling" is what you are looking after.
//*[contains(text(), "The Keyword:")]/following-sibling::span/a/text()
Xpath Axes
From a table I am looking at on the web in firefox this is an xpath selector.
id('ls-page')/x:div[5]/x:div[1]/x:div[2]/x:table/x:tbody/x:tr[2]/x:td[2]/x:a
So i remove /x:tbody because that was added by firefox. But how is it generalised to get a links in the table that have the same base Xpath. the only obvious difference is that tr increases by 1 for each link in the table.
id('ls-page')/x:div[5]/x:div[1]/x:div[2]/x:table/x:tr[2]/x:td[2]/x:a
id('ls-page')/x:div[5]/x:div[1]/x:div[2]/x:table/x:tr[3]/x:td[2]/x:a
If there are successive tables of links on the page. and the only difference to me appears that div increases from 1 to 2.
So second table link.
id('ls-page')/x:div[5]/x:div[2]/x:div[2]/x:table/x:tr[2]/x:td[2]/x:a
/x:div[5]/x:div[1]
becomes
/x:div[5]/x:div[2]
1) Is there a method or process to use to generalize an XPATH selector?
2) For each table do i have to create two separate generalised functions one to retrieve tables and one to retrieve links from tables?
Note I am referring to this site live nrl stats . I have been reading scrapy documentation and beautifulsoup documentation but am open to any suggestions regarding tooling as I am just learning.
XPATH is a query language, I don't know of any automated means of generalizing queries, it's something you have to work out for yourself based on the document structure.
My preferred library is lxml.etree. Here's a simple working example of a query that should return you all of the match links.
I've saved the html to the working directory to avoid hitting the website frequently while testing.
from lxml import etree
import os
local_file = 'season2012.html'
url = "http://live.nrlstats.com/nrl/season2012.html"
if not os.path.exists(local_file):
from urllib2 import urlopen
data = urlopen(url).read()
with open(local_file,'w') as f:
f.write(data)
else:
with open(local_file,'r') as f:
data = f.read()
doc = etree.HTML(data)
for link in doc.xpath('//table[#class="tablel"]/tr/td[2]/a'):
print "%s\t%s" % (link.attrib['href'],link.text)
Yielding:
/matches/nrl/match15300.html Melbourne v Newcastle
/matches/nrl/match15291.html Brisbane v St George Illawarra
/matches/nrl/match15313.html Penrith v Cronulla
/matches/nrl/match15312.html Parramatta v Manly
/matches/nrl/match15311.html Sydney Roosters v Warriors
[truncated]
I'd suggest working with the ElementTree object, doc in this example with the interactive python interpreter, to test your queries and have a look at other XPATH questions and answers on SO for working query examples to aid your learning.