I have a very long website with many script nodes, how can I access to the one that has 'var config' in the text with xpath?
<script>
var config = {
locale: 'es',
userAuthenticated: false
}
<script>
In case this is the only node with script tag name having config attribute you can use the following XPath to locate it:
"//script[#config]"
In case there is some unique value inside the config values, like userAuthenticated here XPath like this could be used:
"//script[contains(#config,'userAuthenticated')]"
UPD
The element you are looking for can be located with the following XPath:
"//script[contains(.,'userAuthenticated')]"
Related
I'm trying to extract text contained within HTML tags in order build a python defaultdict.
To accomplish this I need to clean out all xpath and/or HTML data and get just the text, which I can accomplish with /text() , unless it's an href.
How I scrape the items:
for item in response.xpath(
"//*[self::h3 or self::p or self::strong or self::a[#href]]"):
How it looks if I print the above, without extraction attempts:
<Selector xpath='//*[self::h3 or self::p or self::a[#href]]' data='<h3> Some text here ...'>
<Selector xpath='//*[self::h3 or self::p or self::a[#href]]' data='<a href="https://some.url.com...'>
I want to extract "Some text here" and "https://some.url.com"
How I try to extract the text:
item = item.xpath("./text()").get()
print(item):
The result:
Some text here
None
"None" is where I would expect to see: https://some.url.com, after trying various methods suggested online, I cannot get this to work.
Try to use this line to extract either text or #href:
item = item.xpath("./text() | ./#href").get()
CSS/xpath selector to get the link text excluding the text in .muted.
I have html like this:
<a href="link">
Text
<span class="muted"> –text</span>
</a>
When I do getText(), I get the complete text like, Text-text. Is it possible to exclude the muted subclass text ?
Tried cssSelector = "a:not([span='muted'])" doesn't work.
xpath = "//a/node()[not(name()='span')][1]"
ERROR: The result of the xpath expression "//a/node()[not(name()='span')][1]" is: [objectText]. It should be an element.
AFAIK this cannot be done with CSS selector only. You can try to use JavaScriptExecutor to get required text.
As you didn't mention programming language you use I show you example on Python:
link = driver.find_element_by_css_selector('a[href="link"]')
driver.execute_script('return arguments[0].childNodes[0].nodeValue', link)
This will return just "Text" without " -text"
You cannot do this using Selenium WebDriver's API. You have to handle it in your code as follows:
// Get the entire link text
String linkText = driver.findElement(By.xpath("//a[#href='link']")).getText();
// Get the span text only
String spanText = driver.findElement(By.xpath("//a[#href='link']/span[#class='muted']")).getText();
// Replace the span text from link text and trim any whitespace
linkText.replace(spanText, "").trim();
I'm trying to perform html scrapping of a webpage. I like to fetch the three alternate text (alt - highlighted) from the three "img" elements.
I'm using the following code extract the whole "img" element of slide-1.
from lxml import html
import requests
page = requests.get('sample.html')
tree = html.fromstring(page.content)
text_val = tree.xpath('//a[class="cover-wrapper"][id = "slide-1"]/text()')
print text_val
I'm not getting the alternate text values displayed. But it is an empty list.
HTML Script used:
This is one possible XPath :
//div[#id='slide-1']/a[#class='cover-wrapper']/img/#alt
Explanation :
//div[#id='slide-1'] : This part find the target <div> element by comparing the id attribute value. Notice the use #attribute_name syntax to reference attribute in XPath. Missing the # symbol would change the XPath selector meaning to be referencing a -child- element with the same name, instead of an attribute.
/a[#class='cover-wrapper'] : from each <div> element found by the previous bit of the XPath, find child element <a> that has class attribute value equals 'cover-wrapper'
/img/#alt : then from each of such <a> elements, find child element <img> and return its alt attribute
You might want to change the id filter to be starts-with(#id,'slide-') if you meant to return the all 3 alt attributes in the screenshot.
Try this:
//a[#class="cover-wrapper"]/img/#alt
So, I am first selecting the node having a tag and class as cover-wrapper and then I select the node img and then the attribute alt of img.
To find the whole image element :
//a[#class="cover-wrapper"]
I think you want:
//div[#class="showcase-wrapper"][#id="slide-1"]/a/img/#alt
How I can retrieve multiple DIVs (with a given class attribute "a") that contain a span tag with a class attribute "b" by using Xpath?
<div class='a'>
<span class='b'/>
</div>
The structure of my XML is not defined so basically the span could be at any level of the div and the div itself could be at any level of the XML tree.
This should work:
//div[#class='a'][span/#class='b']
// means search anywhere if it starts the expression.
If the span is deeper in the div, use descendant:: which can be shortened to // again:
//div[#class='a'][.//span/#class='b']
I have an XML document which contains nodes like following:-
<a class="custom">test</a>
<a class="xyz"></a>
I was tryng to get the nodes for which class is NOT "Custom" and I wrote an expression like following:-
XmlNodeList nodeList = document.SelectNodes("//*[self::A[#class!='custom'] or self::a[#class!='custom']]");
Now, I want to get IMG tags as well and I want to add the following experession as well to the above expression:-
//*[self::IMG or self::img]
...so that I get all the IMG nodes as well and any tag other than having "custom" as value in the class attribute.
Any help will be appreciated.
EDIT :-
I tried the following and this is an invalid syntax as this returns a boolean and not any nodelist:-
XmlNodeList nodeList = document.SelectNodes("//*[self::A[#class!='custom'] or self::a[#class!='custom']] && [self::IMG or self::img]");
Not sure of what you are asking, but have you tried something like the following?
"//A[#class!='custom'] | //a[#class!='custom'] | //IMG | //img"