Scrapy: Extract HTML as string inside Element

Scrapy: Extract HTML as string inside Element - xpath

I want to extract HTML inside a dic. For example in this piece of HTML:
<div id="main"><h1><xyz>Title<xyz></h1></div>
I want to extract div content: <h1><xyz>Title<xyz></h1> as a string.
Is that possible with CSS o Xpath scrapy selectors?
Thanks :)

With XPath, use the dedicated function string() :
string(//div[#id='main']/h1/xyz)
Output : "Title"
EDIT : To output the whole path if you're looking for "Title" :
concat(concat("<",name(//*[.="Title"]/parent::*),">"),concat("<",name(//*[.="Title"]),">"),string(//*[.="Title"]),concat("</",name(//*[.="Title"]),">"),concat("</",name(//*[.="Title"]/parent::*),">"))
Output : <H1><XYZ>Title</XYZ></H1>

Solution with css selector is not possible, but pretty simple with xpath:
desired_str = selector.xpath("//div[#id='main']").extract()

Related

XPath one of multiple elements of an attribute

in this HTML using scrapy i can access the full info-car by : './/#info-car' XPath
<div class="car car-root"
info-car='{brand":"BMW","Price":"&#30000"name":"X5","color":null,"}'>
</div>
what is the XPath to pick only the name of info-car ?

You can obtain the name by using a combination of xpath and regex. See below sample code:
response.xpath(".//#info-car").re_first(r'"name":"(.*)",')

XPath to element text

<p><span class="label">key</span>value</p>
How am I able to get just the "value" out using xPath? I managed to get to the element using the following expression:
//span[#class='label']/..

Try this one to get required value:
//p[span[#class='label']]/text()

You just have to use text() to get the text from the p
//span[#class='label']/../text()

How to exclude a child node from xpath?

I have the following code :
<div class = "content">
<table id="detailsTable">...</table>
<div class = "desc">
<p>Some text</p>
</div>
<p>Another text<p>
</div>
I want to select all the text within the 'content' class, which I would get using this xPath :
doc.xpath('string(//div[#class="content"])')
The problem is that it selects all the text including text within the 'table' tag. I need to exclude the 'table' from the xPath. How would I achieve that?

XPath 1.0 solutions :
substring-after(string(//div[#class="content"]),string(//div[#class="content"]/table))
Or just use concat :
concat(//table/following::p[1]," ",//table/following::p[2])

The XPath expression //div[#class="content"] selects the div element - nothing more and nothing less - and applying the string() function gives you the string value of the element, which is the concatenation of all its descendant text nodes.
Getting all the text except for that containing in one particular child is probably not possible in XPath 1.0. With XPath 2.0 it can be done as
string-join(//div[#class="content"]/(node() except table)//text(), '')
But for this kind of manipulation, you're really in the realm of transformation rather than pure selection, so you're stretching the limits of what XPath is designed for.

XPath Search inside span class following a specific text inside a li

I would like to find "How are you?" using xpath with this part of html :
<li>Hello<span class="redS bold">How are you ?</span></li
I tried with :
//span[contains(#class, 'redS bold') and text() = 'Hello']
Thanks in advance for your help

maybe
//span[contains(text(),'How are you')]
? or maybe
//span[contains(#class,'redS bold') and contains(text(),'How are you')]

css/xpath selector to exclude the child node in the element when using selenium webdriver (java)

CSS/xpath selector to get the link text excluding the text in .muted.
I have html like this:
<a href="link">
Text
<span class="muted"> –text</span>
</a>
When I do getText(), I get the complete text like, Text-text. Is it possible to exclude the muted subclass text ?
Tried cssSelector = "a:not([span='muted'])" doesn't work.
xpath = "//a/node()[not(name()='span')][1]"
ERROR: The result of the xpath expression "//a/node()[not(name()='span')][1]" is: [objectText]. It should be an element.

AFAIK this cannot be done with CSS selector only. You can try to use JavaScriptExecutor to get required text.
As you didn't mention programming language you use I show you example on Python:
link = driver.find_element_by_css_selector('a[href="link"]')
driver.execute_script('return arguments[0].childNodes[0].nodeValue', link)
This will return just "Text" without " -text"

You cannot do this using Selenium WebDriver's API. You have to handle it in your code as follows:
// Get the entire link text
String linkText = driver.findElement(By.xpath("//a[#href='link']")).getText();
// Get the span text only
String spanText = driver.findElement(By.xpath("//a[#href='link']/span[#class='muted']")).getText();
// Replace the span text from link text and trim any whitespace
linkText.replace(spanText, "").trim();

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Scrapy: Extract HTML as string inside Element - xpath

I want to extract HTML inside a dic. For example in this piece of HTML: <div id="main"><h1><xyz>Title<xyz></h1></div> I want to extract div content: <h1><xyz>Title<xyz></h1> as a string. Is that possible with CSS o Xpath scrapy selectors? Thanks :)

Solution with css selector is not possible, but pretty simple with xpath: desired_str = selector.xpath("//div[#id='main']").extract()

Related

XPath one of multiple elements of an attribute

XPath to element text

How to exclude a child node from xpath?

XPath Search inside span class following a specific text inside a li

css/xpath selector to exclude the child node in the element when using selenium webdriver (java)

Categories

Resources