Xpath text between tags - xpath

Any idea how i would get the text between 2 tags using Xpath code? specifically the 3, bd, 1, ba.
<p class="MuiTypography-root RoofCard__RoofCardNameStyled-niegej-8 hukPZu MuiTypography-body1" xpath="1">
<span class="NumberFormatWithStyle__NumberFormatStyled-sc-1yvv7lw-0 jVQRaZ inline-block md">$65,000</span></p>
**"3" == $0
" bd, " == $0
"1" == $0
" ba | " == $0**
<span class="NumberFormatWithStyle__NumberFormatStyled-sc-1yvv7lw-0 jVQRaZ inline-block md" xpath="1">926</span>
tried:

In fact from your sample that's a simple text() node after p:
//p/following-sibling::text()[1]
but of course you'll need to parse it. This will return almost that you need:
values = response.xpath('//p/following-sibling::text()[1]').re(r'"([^"]+)"')

Related

Clarification of Nokogiri::NodeSet XML Content based on 'puts node' and 'puts node.inspect'

I rarely use xpath() but when I do I keep tripping myself up on interpreting content of Nokogiri::Nodesets and believe I now know where I have always gone wrong.
Simply put when I do a 'puts NodeSet' I have always assumed that I could search the Nodeset based on the returned XML. But the first tag returned does not appear to actually part of the node XML.
'puts n1' returns XML that has a SPAN as the first element of the XML, but if I then do an search n1.xpath('SPAN') or n1.xpath('SPAN/DIV') no nodes are found. n1.xpath('DIV') returns the output I expect and proves no SPAN tag in the XML.
The only way I can logically explain this to myself is if assume that the first xml tag of a 'puts node' is the "Node Name" and not part of the node XML. This works for me going forward but am I missing something that is going to bite me elsewhere.
CODE:
docxml = Nokogiri::XML(<<EOT)
<DIV><SPAN><DIV id='1'><H1>-H1-</H1><h1>-h1-</h1></DIV>
<DIV id='2'><H2>-H2-</H2> <h2>-h2-</h2></DIV>
<DIV id='3'><H3>-H3-</H3><h3>-h3-</h3></DIV>
</SPAN></DIV>
EOT
n0 = docxml.xpath('DIV')
n1 = n0.xpath('SPAN')
n2 = n1.xpath('DIV')
n3 = n2.xpath('*')
n4 = n3.xpath('*')
puts "n1:xpath('SPAN'): \n#{n1.xpath('SPAN')}\n#{'^'*80} \nn1 XML:\n#{n1}\n#{'^'*80}\
\nn1:inspect \n#{n1.inspect}\n#{'^'*80}\n"
OUTPUT:
=begin
n1:xpath('SPAN'):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
n1 XML:
<SPAN>
<DIV id="1"> <H1>-H1-</H1> <h1>-h1-</h1> </DIV>
<DIV id="2"> <H2>-H2-</H2> <h2>-h2-</h2> </DIV>
<DIV id="3"> <H3>-H3-</H3> <h3>-h3-</h3> </DIV>
</SPAN>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
n1:inspect
[#<Nokogiri::XML::Element:0x1c10964 name="SPAN"
children=[
#<Nokogiri::XML::Element:0x1c10820 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x18fff90 name="id" value="1">]
children=[#<Nokogiri::XML::Element:0x1c1064c name="H1" children=[#<Nokogiri::XML::Text:0x1c1ffe8 "-H1-">]>,
#<Nokogiri::XML::Element:0x1c10604 name="h1" children=[#<Nokogiri::XML::Text:0x1c1fdcc "-h1-">]>
]>,
#<Nokogiri::XML::Element:0x1c107d8 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x1c1fc10 name="id" value="2">]
children=[#<Nokogiri::XML::Element:0x1c105bc name="H2" children=[#<Nokogiri::XML::Text:0x1c1f874 "-H2-">]>,
#<Nokogiri::XML::Text:0x1c1f778 " ">,
#<Nokogiri::XML::Element:0x1c10574 name="h2" children=[#<Nokogiri::XML::Text:0x1c1f5f8 "-h2-">]
>]>,
#<Nokogiri::XML::Element:0x1c10790 name="DIV" attributes=[#<Nokogiri::XML::Attr:0x1c1f43c name="id" value="3">]
children=[#<Nokogiri::XML::Element:0x1c1052c name="H3" children=[#<Nokogiri::XML::Text:0x1c1f0a0 "-H3-">]>,
#<Nokogiri::XML::Element:0x1c104e4 name="h3" children=[#<Nokogiri::XML::Text:0x1c1ee90 "-h3-">]
>]
>]
>]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
=end
Now that I have had some sleep this works for me.
'nodeset = xpath(tag1/tag2)' returns a 'nodeset' containing member node 'tag2'
'puts nodeset' displays the 'tag2' node member
'nodeset.xpath('*')' returns the content of 'tag2
'nodeset.xpath('tag2')' invalid as 'tag2' is not part of the content of 'tag2'

replacing html tag and its content using ruby gsub

I am trying to replace a <p>..</p> tag content in html content with empty string by doing the following.
string = \n <img alt=\"testing artice breaking news\" src=\"something.com" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n "
When I did
string.gsub!(/<p.*?>|<\/p>/, '')
It just replaced the <p> and </p> with empty string but the content remained. How can I remove both the tag and its content ?
Apparently, your regex does not match <p>...</p> (<p> and its content). Try this:
string.gsub!(/<p>.*<\/p>/, '')
test = '\n <img alt=\"testing artice breaking news\" src=\"something.com" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n "'
test.gsub(/<p>.*<\/p>/, '')
Return
"\\n <img alt=\\\"testing artice breaking news\\\" src=\\\"something.com\" />\\n \\n \""
Also, please consider #Tom Lord's comment, you can use Nokogiri to manipulate HTML.
First of all, consider using HTML parsers when parsing HTML, see How do I remove a node with Nokogiri?.
If you want to do it with a regex, you can use
string.gsub(/<p(?:\s[^>]*)?>.*?<\/p>/m, '')
See the Rubular regex demo. This will work with tags that cannot be nested. Details:
<p(?:\s[^>]*)?> - <p, and an optional sequence of a whitespace and zero or more chars other than > (as many as possible), and then >
.*? - due to /m, any zero or more chars as few as possible
<\/p> - </p> string.
If the tags can be nested, you still can use a regex:
tagname = "p"
rx = /<#{tagname}(?:\s[^>]*)?>(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)*<\/#{tagname}>/
p string.gsub(rx, '')
# => "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n \n"
See the Rubular regex demo. Details:
<#{tagname} - < and tag name
(?:\s[^>]*)?> - an optional sequence of whitespace and then zero or more chars other than <
(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)* - zero or more occurrences of
(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)* - zero or more chars other than < and then zero or more sequences of < that is not followed with tag name + > or whitespace or / + tag name + > followed with zero or more chars other than < chars
|
\g<0> - the whole regex pattern recursed
<\/#{tagname}> - </ + tag name + >.
See a Ruby demo:
string = "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n <p>\n \tnew vision content for testing rss feeds\n </p>\n"
p string.gsub(/<p(?:\s[^>]*)?>.*?<\/p>/m, '')
tagname = "p"
rx = /<#{tagname}(?:\s[^>]*)?>(?:[^<]*(?:<(?!#{tagname}[\s>]|\/#{tagname}>)[^<]*)*|\g<0>)*<\/#{tagname}>/m
p string.gsub(rx, '')```
# => "\n <img alt=\"testing artice breaking news\" src=\"something.com\" />\n \n"

MVC Kendo Grid ClientTemplate string not working

I'm trying to make this work but I don't understand what's wrong with this template string.
"#=" + column.ColumnName + "# #if(NumeroGermi != \"0\" && #=" + column.ColumnName + "# == \"Positivo\") {# <span class=\"fa fa-cog\"> </span> #} #"
I've noticed that the bit that doesn't make it work is the second if condition.
This way it will work, but i still need the second condition:
"#=" + column.ColumnName + "# #if(NumeroGermi != \"0\") {# <span class=\"fa fa-cog\"> </span> #} #"
Give this a go:
"#=" + column.ColumnName + "# # if(NumeroGermi != \"0\" && " + column.ColumnName + " == \"Positivo\") { # <span class=\"fa fa-cog\"> </span> # } #"
I don't believe you need the encapsulation for the second condition.

How to set a label to a variable in Watir

I am trying to set some html contents to a variable so i can perform some if statements. But I get this instead:
Fail #<Watir::Browser:0x00000004440b98>
It looks like my variable isnt set to text i want to set.
my html:
<label class="col-lg-12 control-label ng-binding" ng-show="productionReport.Status == 2 && productionReport.ReadyForPublishDate" style="">Text 1</label>
My Watir code:
msgText = 'Text 1'
msgText2 = #browser.label(:xpath, '/html/body/div[1]/div[3]/div/div/div/div/div/form/div/div/div[2]/label')
if (msgText == msgText2)
puts 'Pass' "#{msgText2}"
else
puts 'Fail' "#{msgText2}"
end
The problem is that msgText2 (ie #browser.label) is being set to a Watir::Label element rather than its text.
To get the text of the label, you need to call the text method. For example:
msgText2_element = #browser.label(:xpath, '/html/body/div[1]/div[3]/div/div/div/div/div/form/div/div/div[2]/label')
msgText2 = msgText2_element.text

Moving chunks of data in a file with awk

I'm moving my bookmarks from kippt.com to pinboard.in.
I exported my bookmarks from Kippt and for some reason, they were storing tags (preceded by #) and description within the same field. Pinboard keeps tags and description separated.
This is what a Kippt bookmark looks like after export:
<DT>This is a title
<DD>#tag1 #tag2 This is a description
This is what it should look like before importing into Pinboard:
<DT>This is a title
<DD>This is a description
So basically, I need to replace #tag1 #tag2 by TAGS="tag1,tag2" and move it on the first line within <A>.
I've been reading about moving chunks of data here: sed or awk to move one chunk of text betwen first pattern pair into second pair?
I haven't been to come up with a good recipe so far. Any insight?
Edit:
Here's an actual example of what the input file looks like (3 entries out of 3500):
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
This might not be the most beautiful solution, but since it seems to be a one-time-thing it should be sufficient.
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
If some parts of the code are not clear, just tell me. You can of course use python to write the lines to a file instead of printing them, or even modify the original file.
Edit: Added if-clause so that empty <DD> lines won't show up in the result.
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
OUTPUT:
<DT>Phabricator
<DD>
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD> Self-driving tour of Iceland

Resources