I can match sometext and othertext in
<br>
sometext
<br>
othertext
using xpath selector '//br/following-sibling::text()'
but if there is only whitespace after the <br> element
<br>
<br>
othertext
only the second match occurs. Is it possible to match whitespace as well?
I tried
//br/following-sibling::matches(., "\s+")
to attempt to match whitespace without success.
'matches' is to match regular-expressions, not to match nodes. And it can't be used with an axis specifier. You could use it as condition like:
//br/following-sibling::text()[matches(., "\s+")]
Or without regexs (might be faster depending on the implementation), checking if it is all whitespace and not the empty string:
//br/following-sibling::text()[(normalize-space(.) = "") and (. != "")]
Related
I have html contents in following text.
"This is my text to be parsed which contains url
http://someurl.com?param1=foo¶ms2=bar
<a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w
</a> <img src="http://someasseturl.com/abc.jpeg"/>
<span>i have a link too http://someurlinsidespan.com?xyz=abc </span>
"
Need a regex that will convert plain urls to hyperlink(without tampering existing hyperlink)
Expected result:
"This is my text to be parsed which contains url
<a href="http://someurl.com?param1=foo¶ms2=bar">
http://someurl.com?param1=foo¶ms2=bar</a>
<a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test
1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
<span>i have a link too http://someurlinsidespan.com?xyz=abc </span> "
Disclaimer: You shouldn't use regex for this task, use an html parser. This is a POC to demonstrate that it's possible if you expect a good formatted HTML (which you won't have anyway).
So here's what I came up with:
(https?:\/\/(?:w{1,3}.)?[^\s]*?(?:\.[a-z]+)+)(?![^<]*?(?:<\/\w+>|\/?>))
What does this mean ?
( : group 1
https? : match http or https
\/\/ : match //
(?:w{1,3}.)? : match optionally w., ww. or www.
[^\s]*? : match anything except whitespace zero or more times ungreedy
(?:\.[a-z]+)+) : match a dot followed by [a-z] character(s), repeat this one or more times
(?! : negative lookahead
[^<]*? : match anything except < zero or more times ungreedy
(?:<\/\w+>|\/?>) : match a closing tag or /> or >
) : end of lookahead
) : end of group 1
regex101 online demo
rubular online demo
Maybe you could do a search-and-replace first to remove the HTML elements. I don't know Ruby, but the regex would be something like /<(\w+).*?>.*?</\1>/. But it might be tricky if you have nested elements of the same type.
Maybe try http://rubular.com/ .. there are some Regex tips helps you get the desired output.
I would do something like this:
require 'nokogiri'
doc = Nokogiri::HTML.fragment <<EOF
This is my text to be parsed which contains url
http://someurl.com <a href="http://thisshouldnotbetampered.com">
some text and a url http://someotherurl.com test 1q2w </a> <img src="http://someasseturl.com/abc.jpeg"/>
EOF
doc.search('*').each{|n| n.replace "\n"}
URI.extract doc.text
#=> ["http://someurl.com"]
I have a text similar to this:
<p>some text ...</p><p>The post text... appeared first on some another text.</p>
I need to remove everything from <p>The post, so the results would be:
<p>some text ...</p>
I am trying ot do that this way:
text.sub!(/^<p>The post/, '')
But it returns just an empty string... how to fix that?
Your regex is incorrect. It matches every <p>The post that is in the beginning of the string. You want the opposite: match from its position to the end of the string. Check this out.
s = '<p>some text ...</p><p>The post text... appeared first on some another text.</p>'
s.sub(/<p>The\spost.*$/, '') # => "<p>some text ...</p>"
You have specified ^, which matches the beginning of a string. You should do
text.sub!(/<p>The post.*$/, '')
Play with this in http://rubular.com/r/c91EbHN0Af
'^' is matching the beginning of the whole string. try doing
text.sub!(/<p>The post/, '')
EDIT just read it more carefully...
text.sub!(/<p>The post.*$/, '')
I have this HTML:
<tr class="even expanded first>
<td class="score-time status">
<a href="/matches/2012/08/02/europe/uefa-cup/">
16 : 00
</a>
</td>
</tr>
I want to extract the (16 : 00) string without the extra whitespace. Is this possible?
I. Use this single XPath expression:
translate(normalize-space(/tr/td/a), ' ', '')
Explanation:
normalize-space() produces a new string from its argument, in which any leading or trailing white-space (space, tab, NL or CR characters) is deleted and any intermediary white-space is replaced by a single space character.
translate() takes the result produced by normalize-space() and produces a new string in which each of the remaining intermediary spaces is replaced by the empty string.
II. Alternatively:
translate(/tr/td/a, '

', '')
Please try the below xpath expression :
//td[#class='score-time status']/a[normalize-space() = '16 : 00']
You can use XPath's normalize-space() as in //a[normalize-space()="16 : 00"]
I came across this thread when I was having my own issue similar to above.
HTML
<div class="d-flex">
<h4 class="flex-auto min-width-0 pr-2 pb-1 commit-title">
<a href="/nsomar/OAStackView/releases/tag/1.0.1">
1.0.1
</a>
XPath start command
tree.xpath('//div[#class="d-flex"]/h4/a/text()')
However this grabbed random whitespace and gave me the output of:
['\n ', '\n 1.0.1\n ']
Using normalize-space, it removed the first blank space node and left me with just what I wanted
tree.xpath('//div[#class="d-flex"]/h4/a/text()[normalize-space()]')
['\n 1.0.1\n ']
I could then grab the first element of the list, and use strip() to remove any further whitespace
XPath final command
tree.xpath('//div[#class="d-flex"]/h4/a/text()[normalize-space()]')[0].strip()
Which left me with exactly what I required:
1.0.1
you can check if text() nodes are empty.
/path/text()[not(.='')]
it may be useful with axes like following-sibling:: if these are no containers, or with child::.
you can use string() or the regex() function of xpath 2.
NOTE: some comments say that xpath cannot do string manipulation... even if it's not really designed for that you can do basic things: contains(), starts-with(), replace().
if you want to check whitespace nodes it's much harder, as you will generally have a nodelist result set, and most xpath functions, like match or replace, only operate one node.
you can separate node and string manipulation
So you may use xpath to retrieve a container, or a list of text nodes, and then process it with another language. (java, php, python, perl for instance).
Basically I need to scrape some text that has nested tags.
Something like this:
<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>
And I want an expression that will produce this:
This is an example bolded text
I have been struggling with this for hour or more with no result.
Any help is appreciated
The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order.
You want to call the XPath string() function on the div element.
string(//div[#id='theNode'])
You can also use the normalize-space function to reduce unwanted whitespace that might appear due to newlines and indenting in the source document. This will remove leading and trailing whitespace and replace sequences of whitespace characters with a single space. When you pass a nodeset to normalize-space(), the nodeset will first be converted to it's string-value. If no arguments are passed to normalize-space it will use the context node.
normalize-space(//div[#id='theNode'])
// if theNode was the context node, you could use this instead
normalize-space()
You might want use a more efficient way of selecting the context node than the example XPath I have been using. eg, the following Javascript example can be run against this page in some browsers.
var el = document.getElementById('question');
var result = document.evaluate('normalize-space()', el, null ).stringValue;
The whitespace only text node between the span and b elements might be a problem.
Use:
string(//div[#id='theNode'])
When this expression is evaluated, the result is the string value of the first (and hopefully only) div element in the document.
As the string value of an element is defined in the XPath Specification as the concatenation in document order of all of its text-node descendants, this is exactly the wanted string.
Because this can include a number of all-white-space text nodes, you may want to eliminate contiguous leading and trailing white-space and replace any such intermediate white-space by a single space character:
Use:
normalize-space(string(//div[#id='theNode']))
XSLT - based verification:
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="/">
"<xsl:copy-of select="string(//div[#id='theNode'])"/>"
===========
"<xsl:copy-of select="normalize-space(string(//div[#id='theNode']))"/>"
</xsl:template>
</xsl:stylesheet>
when this transformation is applied on the provided XML document:
<div id='theNode'> This is an
<span style="color:red">example</span>
<b>bolded</b> text
</div>
the two XPath expressions are evaluated and the results of these evaluations are copied to the output:
" This is an
example
bolded text
"
===========
"This is an example bolded text"
If you are using scrapy in python, you can use descendant-or-self::*/text(). Full example:
txt = """<div id='theNode'>
This is an <span style="color:red">example</span> <b>bolded</b> text
</div>"""
selector = scrapy.Selector(text=txt, type="html") # Create HTML doc from HTML text
all_txt = selector.xpath('//div/descendant-or-self::*/text()').getall()
final_txt = ''.join( _ for _ in all_txt).strip()
print(final_txt) # 'This is an example bolded text'
How about this :
/div/text()[1] | /div/span/text() | /div/b/text() | /div/text()[2]
Hmmss I am not sure about the last part though. You might have to play with that.
normal code
//div[#id='theNode']
to get all text but if they become split then
//div[#id='theNode']/text()
Not sure but if you provide me the link I will try
I am trying to match the value of the following HTML snippet:
<input name="example" type="hidden" value="matchTextHere" />
with the following:
x = response.match(/<input name="example" type="hidden" value="^.+$" \/>/)[0]
why is this not working? it doesn't match 'matchTextHere'
edit:
when i use:
x = response.match(/<input name="example" type="hidden" value="(.+)" \/>/)[0]
it matches the whole html element, and not just the value 'matchTextHere'
^ matches start of a line and $ matches end of the line. Change ^.+$ to \w+ and it will work for values that doesn't contain any symbols. Make it a parenthetical group to capture the value - (\w+)
Update: to match anything between the quotes (assuming that there aren't any quotes in the value), use [^"]+. If there are escaped quotes in the value, it is a different ballgame. .+ will work in this case, but it will be slower due to backtracking. .+ first matches upto the end of the string (because . matches even a "), then looks for a " and fails. Then it comes back one position and looks for a " and fails again - and so on until it finds the " - if there was one more attribute after value, then you will get matchTextHere" nextAttr="something as the match.
x = response.match(/<input name="example" type="hidden" value="([^"]+)" \/>/)[1]
That being said, the regex will fail if there is an extra space between any of the attribute values. Parsing html with regex is not a good idea - and if you must use regex, you can allow extra spaces using \s+
/<input\s+name="example"\s+type="hidden"\s+value="([^"]+)"\s*\/>/
Because you have a start-of-line token (^) and an end-of-line token ($) in your regular expression. I think you meant to capture the value, this might solve your problem: value="(.+?)".
Beware, though, that processing html with regular expressions is not a good idea, it can even drive you crazy. Better use an html parser instead.
You don't need the ^ and $:
x = response.match(/<input name="example" type="hidden" value=".+" \/>/)[0]
you just need to change [0] to [1]
response='<input name="example" type="hidden" value="matchTextHere" />'
puts response.match(/<input name="example" type="hidden" value="(.*?)" \/>/)[1]
matchTextHere