I have some HTML code
<li><h3>Number Theory - Even Factors</h3>
<p lang="title">Number N = 2<sup>6</sup> * 5<sup>5</sup> * 7<sup>6</sup> * 10<sup>7</sup>; how many factors of N are even numbers?</p>
<ol class="xyz">
<li>1183</li>
<li>1200</li>
<li>1050</li>
<li>840</li>
</ol>
<ul class="exp">
<li class="grey fleft">
<span class="qlabs_tooltip_bottom qlabs_tooltip_style_33" style="cursor:pointer;">
<span>
<strong>Correct Answer</strong>
Choice (A).</br>1183
</span>
Correct answer
</span>
</li>
<li class="primary fleft">
Explanatory Answer
</li>
<li class="grey1 fleft">Factors - Even numbers</li>
<li class="orange flrt">Medium</li>
</ul>
</li>
In the HTML snippet above, I am trying to extract the <p lang="title"> Notice how it has <sup></sup> and <sub></sub> tags being used inside.
My Xpath expression .//p[#lang="title"]/text() does not retrieve the sub and sup contents. How do I get this output below
Desired Output
Number N = 2<sup>6</sup>*5<sup>5</sup> * 7<sup>6</sup> * 10<sup>7</sup>; how many factors of N are even numbers?
XPath
You can simply get innerHTML with node() as below:
//p[#lang="title"]/node()
Note that it returns an array of nodes
Python
You can get required innerHTML with below Python code
from BeautifulSoup import BeautifulSoup
def innerHTML(element):
"Function that receives element and returns its innerHTML"
return element.decode_contents(formatter="html")
html = """<html>
<head>...
<body>...
Your HTML source code
..."""
soup = BeautifulSoup(html)
paragraph = soup.find('p', { "lang" : "title" })
print(innerHTML(paragraph))
Output:
'Number N = 2<sup>6</sup> * 5<sup>5</sup> * 7<sup>6</sup> * 10<sup>7</sup>; how many factors of N are even numbers?'
Related
I am banging my head against a wall here, its probably something simple that I am missing.
I have a HTML un-ordered list (ul) like the following:
<ul>
<li>Elm 1</li>
<li>Elm 2 - with children
<ul>
<li>Nested Elm</li>
<li>Another Elm</li>
</ul>
</li>
</ul>
Using xpath (version 1 compatible with Scrapy), how would i get the text out of all the li elements including the nested one?
Thanks for any help!
If you need xpath, use response.xpath('//ul//li/text()').extract().
If you can use css, it is shorter: response.css('ul li::text').extract()
Try with a simple xpath selector:
from scrapy.selector import Selector
selector = Selector(text="""
<ul>
<li>Elm 1</li>
<li>Elm 2 - with children
<ul>
<li>Nested Elm</li>
<li>Another Elm</li>
</ul>
</li>
</ul>""")
print(selector.xpath('//li/text()').extract())
This outputs:
['Elm 1', 'Elm 2 - with children\n ', 'Nested Elm', 'Another Elm', '\n ']
<ul>
<li class="xyz">
<div class="divClass">
<span class="ContentItem---status---dL0iS">
<span>Success</span>
</span>
<p class="ContentItem---title---37IqA">
<span>Test Check</span>
: Please display the text
</p>
</div>
</li>
<li class="xyz">
<div class="divClass">
<span class="ContentItem---status---dL0iS">
<span>Not COMPLETED</span>
</span>
<p class="ContentItem---title---37IqA">
<span>Knowledge</span> A Team
</p>
</div>
</li>
.... and so on
</ul>
This is my html structure.I have this text Test Check inside a Span and : Please display the text inside a Paragraph tag.
What i need is ,i need to identify, whether my structure contains this complete text or not Test Check: Please display the text.
I have tried multiple ways and couldn't identify the complete path.Please find the way which i have tried
//span[text()='Test Check']/p[text()=': Please display the text']
Can you please provide me the xpath for this?
I think there is one possible solution to identify within the given html text and retrieve. I hope this solves your problem.
def get_tag_if_present(html_text):
soup_obj = BeautifulSoup(html_text,"html.parser")
test_check = soup_obj.find_all(text = re.compile(r"Test Check"))
result_val = "NOT FOUND"
if test_check:
for each_value in test_check:
parent_tag_span = each_value.parent
if parent_tag_span.name == "span":
parent_p_tag = parent_tag_span.parent
if parent_p_tag.name == "p" and "Please display the text" in parent_p_tag.get_text():
result_val = parent_p_tag
break
return result_val
The returned result_val will have the tag corresponding to the p tag element with the parameter. It would return NOT FOUND, if no such element exists.
I've taken this with the assumption that the corresponding data entries would exist in a "p" tag and "span" tag respectively , feel free to remove the said conditions for all identifications of the text in the given html text.
I'm using nokogiri to scrape web pages. The structure of the page is made of an unordered list containing multiple list items each of which has a link, an image and text, all contained in a div.
I'm trying to find clean way to extract the elements in each list item so I can have each li contained in an array or hash like so:
li[0] = ['Acme co 1', 'image1.png', 'Customer 1 details']
li[1] = ['Acme co 2', 'image2.png', 'Customer 2 details']
At the moment I get all the elements in one go then store them in separate arrays. Is there a better, more idiomatic way of doing this?
This is the code atm:
data = Nokogiri::HTML(html)
images = []
name = []
data.css('ul li img').each {|l| images << l}
data.css('ul li a').each {|a| names << a.text }
This is the html I'm working from:
<ul class="customers">
<li>
<div>
Acme co 1
<div class="customer-image">
<img src="image1.png"/>
</div>
<div class=" customer-description">
Cusomter 1 details
</div>
</div>
</li>
<li>
<div>
Acme co 2
<div class="customer-image">
<img src="image1.png"/>
</div>
<div class=" customer-description">
Customer 2 details
</div>
</div>
</li>
</ul>
Thanks
Assuming the code you have is giving you what you want, I wouldn't try to rewrite anything significant. You can be more brief and idiomatic by replacing your #each methods with #map:
data = Nokogiri::HTML(html)
images = data.css('ul li img')
names = data.css('ul li a').map(&:text)
data = Nokogiri::HTML(html)
images = data.css('ul li img')
names = data.css('ul li a').map(&:text)
This simplifies your code slightly, but your original version wasn't too bad.
And my simplification may not generalise if you are, for example, scraping images from multiple regions on the page! In which case, reverting back to something like your original may be fine.
Is it possible to select the first element in each row which matches a specific class? This is the HTML structure at the moment.
<ul>
<li>
<article>
<time class="published-date"></time>
<p>Text</p>
</article>
</li>
<li>
<article>
<time class="published-date"></time>
<p>Text</p>
</article>
</li>
<ul>
I was wondering what would be the best and most specific query string in terms of getting the time element with the class published-date in each row?
If there are more time elements with class="published-date" in every row, you need to use indexing (1-based):
//ul/li/article/time[#class = "published-date"][1]
If there is only a single time element in every row, simply do:
//ul/li/article/time[#class = "published-date"]
Using the XPath selector....
//time[#class="published-date"]
...will select all time nodes with the class published-date. XPathFiddle
I am trying to access a li element using indexing
<div class="item-list">
<ul>
<li class="views-row views-row-1 views-row-odd views-row-first">
<li class="views-row views-row-2 views-row-even">
<li class="views-row views-row-3 views-row-odd">
<li class="views-row views-row-4 views-row-even">
<li class="views-row views-row-5 views-row-odd">
<li class="views-row views-row-6 views-row-even">
<li class="views-row views-row-7 views-row-odd">
<li class="views-row views-row-8 views-row-even">
<li class="views-row views-row-9 views-row-odd views-row-last">
</ul>
</div>
The code I am using is
#browser.div(:class,'item-list').ul.li(:index => 2)
The question is : These are elements on a page and I will be using a loop to access each element. I thought using indexing will solve the problem but when I write my code and execute it I get the following error
expected #<Watir::LI:0x2c555f80 located=false selector={:index=>2, :tag_name=>"li"}> to exist (RSpec::Expectations::ExpectationNotMetError)
How can I access these elements using Indexing.
If you've got class-naming that nice, forget indexing! Do a partial match on the "views-row" parameter:
#browser.li(:class => /views-row-1/)
This can easily be parameterized for looping (although I don't know what you're doing with the information so this loop will not be very exciting).
x = 0
until x==9
x+=1
puts #browser.li(:class => /views-row-#{x}/).text
end
You could also blindly loop through the li's contained in your div if you'd like:
#browser.div(:class,'item-list').lis.each do |li|
puts li.text
end
According to the Watir wiki, Watir supports the :index method on the li element. So unless it is a bug in watir-webdriver, I think the index should work.
You may want to try the watir mailing list to see if this is a problem for others.