extract text from nested div using xpath - xpath

I would like to get the text inside the h2 tag
<p>Mi. 5. Dezember 2018</p>
<h2>Slam: Jägerschlacht</h2>
<p>Einlass 19:30 Uhr // Beginn 20:30 Uhr</p>
<p>Tickets: 4€</p>
out of this page with xpath. The problem is i cant find the right xpath with all the div. All i get when i use this python code
from lxml import html
import requests
page = requests.get("https://www.gruener-jaeger-stpauli.de/")
tree = html.fromstring(page.content)
text = tree.xpath("/html/body/div/div/div/div/div/div/div[1]/div/div[2]/div/div/div[1]/div/a[1]/h2")
print (text)
is [< Element h2 at 0x25ae6341a98 >]

It is better to use a handwritten XPath instead of a generated path.
Try it like this to get the first h2-element (selecting all text-node children using /text())
"//a[contains(#class, 'event_box_gj')][1]/h2/text()")
or drop [1] to get all of them.

Related

Scrapy xpath selector doesn't retrieve the element

From this url: https://www.basketball-reference.com/boxscores/202110190LAL.html, I want to extract the text from this xpath:
//div[#id='div_four_factors']/table/tbody/tr[1]/td[1]
But, the element I got is None.
In Scrapy shell I use this:
>>> text = response.xpath("//div[#id='div_four_factors']/table/tbody/tr[1]/td[1]/text()").get()
>>> print(text)
>>> None
I have tried to write the right xpath for the element I want to retrieve but get none result.
It is because that table, and it looks like all the tables from that page are loaded using javascript after the page has already loaded. So the xpath path doesn't exist in the response html you are parsing.
You can see this if you open the page in a webbrowser and right click and select "open page source" or something like that. Alternatively you could just print(response.text) but it won't be formatted and will be difficult to read.
However it does look like a copy of the tables html is commented out adjacent to where it is located when rendered. Which means you can do this:
In [1]: import re
In [2]: pat = re.compile(r'<!--(.*?)-->', flags=re.DOTALL)
In [3]: text = response.xpath("//div[#id='all_four_factors']//comment()").get()
In [4]: selector = scrapy.Selector(text=pat.findall(text)[0])
In [5]: result = selector.xpath('//tbody/tr[1]/td[1]')
In [6]: result
Out[6]: [<Selector xpath='//tbody/tr[1]/td[1]' data='<td class="right " data-stat="pace">1...'>]
In [7]: result[0].xpath('./text()').get()
Out[7]: '112.8'
In [8]:

how to use regex in nokogiri xpath

div class="ydpbfddd73dsignature" >......
How do I use xpath to get whatever text comes after this tag?
I tried doing this
nokogiri_html=Nokogiri::HTML html
nokogiri_html.xpath('//div[#class="/.*signature/"]')
But it doesn't work.
You can apply below XPath:
//div[substring(#class, string-length(#class) - 8)="signature"]
which means return div node which has "signature" as last 9 characters of class name

css/xpath selector to exclude the child node in the element when using selenium webdriver (java)

CSS/xpath selector to get the link text excluding the text in .muted.
I have html like this:
<a href="link">
Text
<span class="muted"> –text</span>
</a>
When I do getText(), I get the complete text like, Text-text. Is it possible to exclude the muted subclass text ?
Tried cssSelector = "a:not([span='muted'])" doesn't work.
xpath = "//a/node()[not(name()='span')][1]"
ERROR: The result of the xpath expression "//a/node()[not(name()='span')][1]" is: [objectText]. It should be an element.
AFAIK this cannot be done with CSS selector only. You can try to use JavaScriptExecutor to get required text.
As you didn't mention programming language you use I show you example on Python:
link = driver.find_element_by_css_selector('a[href="link"]')
driver.execute_script('return arguments[0].childNodes[0].nodeValue', link)
This will return just "Text" without " -text"
You cannot do this using Selenium WebDriver's API. You have to handle it in your code as follows:
// Get the entire link text
String linkText = driver.findElement(By.xpath("//a[#href='link']")).getText();
// Get the span text only
String spanText = driver.findElement(By.xpath("//a[#href='link']/span[#class='muted']")).getText();
// Replace the span text from link text and trim any whitespace
linkText.replace(spanText, "").trim();

How to find an element's text in Capybara while ignoring inner element text

In the HTML example below I am trying to grab the $16.95 text in the outer span.price element and exclude the text from the inner span.sale one.
<div class="price">
<span class="sale">
<span class="sale-text">"Low price!"</span>
"$16.95"
</span>
</div>
If I was using Nokogiri this wouldn't be too difficult.
price = doc.css('sale')
price.search('.sale-text').remove
price.text
However Capybara navigates rather than removes nodes. I knew something like price.text would grab text from all sub elements, so I tried to use xpath to be more specific. p.find(:xpath, "//span[#class='sale']", :match => :first).text. However this grabs text from the inner element as well.
Finally, I tried looping through all spans to see if I could separate the results but I get an Ambiguous error.
p.find(:css, 'span').each { |result| puts result.text }
Capybara::Ambiguous: Ambiguous match, found 2 elements matching css "span"
I am using Capybara/Selenium as this is for a web scraping project with authentication complications.
There is no single statement way to do this with Capybara since the DOMs concept of innerText doesn't really support what you want to do. Assuming p is the '.price' element, two ways you could get what you want are as follows:
Since you know the node you want to ignore just subtract that text from the whole text
p.find('span.sale').text.sub(p.find('span.sale-text').text, '')
Grab the innerHTML string and parse that with Nokogiri or Capybara.string (which just wraps Nokogiri elements in the Capybara DSL)
doc = Capybara.string(p['innerHTML'])
nokogiri_fragment = doc.native
#do whatever you want with the nokogiri fragment

XPath - Nested path scraping

I'm trying to perform html scrapping of a webpage. I like to fetch the three alternate text (alt - highlighted) from the three "img" elements.
I'm using the following code extract the whole "img" element of slide-1.
from lxml import html
import requests
page = requests.get('sample.html')
tree = html.fromstring(page.content)
text_val = tree.xpath('//a[class="cover-wrapper"][id = "slide-1"]/text()')
print text_val
I'm not getting the alternate text values displayed. But it is an empty list.
HTML Script used:
This is one possible XPath :
//div[#id='slide-1']/a[#class='cover-wrapper']/img/#alt
Explanation :
//div[#id='slide-1'] : This part find the target <div> element by comparing the id attribute value. Notice the use #attribute_name syntax to reference attribute in XPath. Missing the # symbol would change the XPath selector meaning to be referencing a -child- element with the same name, instead of an attribute.
/a[#class='cover-wrapper'] : from each <div> element found by the previous bit of the XPath, find child element <a> that has class attribute value equals 'cover-wrapper'
/img/#alt : then from each of such <a> elements, find child element <img> and return its alt attribute
You might want to change the id filter to be starts-with(#id,'slide-') if you meant to return the all 3 alt attributes in the screenshot.
Try this:
//a[#class="cover-wrapper"]/img/#alt
So, I am first selecting the node having a tag and class as cover-wrapper and then I select the node img and then the attribute alt of img.
To find the whole image element :
//a[#class="cover-wrapper"]
I think you want:
//div[#class="showcase-wrapper"][#id="slide-1"]/a/img/#alt

Resources