I'm trying to use xpath to get the raw value of an element. The element is a description and it can contain raw text or xhtml.
So it can be as follows:
<description>asdasdasd <a>Item1</a> asd <a> Price </a></description>
based on the above xml, i just need this:
asdasdasd Item1 asd Price
I've tried //description/text(), //description/descendant::*/text() and some others with no success. Any suggestion?
Just use:
//description
The value of an element is its text
Or if it must be a string and there is just one element:
string(//description)
Related
I'm trying to extract text contained within HTML tags in order build a python defaultdict.
To accomplish this I need to clean out all xpath and/or HTML data and get just the text, which I can accomplish with /text() , unless it's an href.
How I scrape the items:
for item in response.xpath(
"//*[self::h3 or self::p or self::strong or self::a[#href]]"):
How it looks if I print the above, without extraction attempts:
<Selector xpath='//*[self::h3 or self::p or self::a[#href]]' data='<h3> Some text here ...'>
<Selector xpath='//*[self::h3 or self::p or self::a[#href]]' data='<a href="https://some.url.com...'>
I want to extract "Some text here" and "https://some.url.com"
How I try to extract the text:
item = item.xpath("./text()").get()
print(item):
The result:
Some text here
None
"None" is where I would expect to see: https://some.url.com, after trying various methods suggested online, I cannot get this to work.
Try to use this line to extract either text or #href:
item = item.xpath("./text() | ./#href").get()
From my xml, I can get this :
<home>
<creditors>
<count>2</count>
</creditors>
</home>
OR even this :
<home>
<creditors>
<moreThan>2</moreThan>
</creditors>
</home>
Which xpath expression can I use to get "<count>2</count>" instead of getting only "2" OR to get "<moreThan>2</moreThan>" instead of getting "2" ?
This XPath,
//creditors/count
will select all count child elements of all creditors elements in the XML document.
Update per OP's request in comments for a single XPath that selects both count and moreThan elements:
This XPath,
//creditors/*[self::count or self::moreThan]
will select all count or moreThan child elements of all creditors elements in the XML document.
Assuming that your xpath expression is OK, you just need to convert the element to string:
doc.xpath("home/creditors/*").to_s
=> "<count>2</count>"
Please check with queries returning more than one element, to make sure that it's desired behaviour.
I am trying to fetch the numeric value after strong tag, as its not an web element, I am not able to get the value 123456789 in to variable:
If I use Get Text xpath=//*[#id='referral-or-navinet-reference-number'] then the result is "Referral #: 123456789"
Please help me in getting only numeric value in to variable.
HTML Code:
<td class="normal-text" id="referral-or-navinet-reference-number" align="right">
<strong>Referral #:</strong> 123456789
</td>
You can directly use split method of python
Like :-
x.split(":") // x is a string variable of your gettext
http://www.tutorialspoint.com/python/string_split.htm
http://www.pythonforbeginners.com/dictionary/python-split
Hope it will help you :)
If your td only contains the wanted text as content text you may use the following xpath:
//*[#id='referral-or-navinet-reference-number']/text()
This should return 123456789 (perhaps with some whitespace)
You can use given xpath :
//td[#id="referral-or-navinet-reference-number"]/text()[normalize-space()]
I'm trying to perform html scrapping of a webpage. I like to fetch the three alternate text (alt - highlighted) from the three "img" elements.
I'm using the following code extract the whole "img" element of slide-1.
from lxml import html
import requests
page = requests.get('sample.html')
tree = html.fromstring(page.content)
text_val = tree.xpath('//a[class="cover-wrapper"][id = "slide-1"]/text()')
print text_val
I'm not getting the alternate text values displayed. But it is an empty list.
HTML Script used:
This is one possible XPath :
//div[#id='slide-1']/a[#class='cover-wrapper']/img/#alt
Explanation :
//div[#id='slide-1'] : This part find the target <div> element by comparing the id attribute value. Notice the use #attribute_name syntax to reference attribute in XPath. Missing the # symbol would change the XPath selector meaning to be referencing a -child- element with the same name, instead of an attribute.
/a[#class='cover-wrapper'] : from each <div> element found by the previous bit of the XPath, find child element <a> that has class attribute value equals 'cover-wrapper'
/img/#alt : then from each of such <a> elements, find child element <img> and return its alt attribute
You might want to change the id filter to be starts-with(#id,'slide-') if you meant to return the all 3 alt attributes in the screenshot.
Try this:
//a[#class="cover-wrapper"]/img/#alt
So, I am first selecting the node having a tag and class as cover-wrapper and then I select the node img and then the attribute alt of img.
To find the whole image element :
//a[#class="cover-wrapper"]
I think you want:
//div[#class="showcase-wrapper"][#id="slide-1"]/a/img/#alt
I have the plenty of links like this:
<b>Edit issue >></b>
Trying to extract the href' content I use Xpath expression:
//a[contains(#href,'/edit_flat')]
but it returns me null. What am I doing wrong ?
//a[contains(#href,'/edit_flat')] selects a elements anywhere in the document tree that have an href attribute containing the '/edit_flat' string.
These matching elements do have this very "href" attribute, but the XPath expression you are using returns "only" the a elements, if there are any.
To actually return the matching elements' attribute's values, you need an extra step, with / and #href. So what you want is:
//a[contains(#href,'/edit_flat')]/#href
Suggestion:
What you really want is probably to select links which href begin with the substring "/edit_flat", so it's safer to use:
.//a[starts-with(#href,'/edit_flat')]/#href