How can I reach this node with Nokogiri? - ruby

Here's the start of my html:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 12 (filtered medium)">
<!--[if !mso]><style>v\\:* {behavior:url(#default#VML);}\no\\:* {behavior:url(#default#VML);}\nw\\:* {behavior:url(#default#VML);}\n.shape {behavior:url(#default#VML);}\n</style><![endif]--><style><!--\n/* Font Definitions */\n#font-face\n\t{font-family:"Cambria Math";\n\tpanose-1:2 4 5 3 5 4 6 3 2 4;}\n#font-face\n\t{font-family:Calibri;\n\tpanose-1:2 15 5 2 2 2 4 3 2 4;}\n#font-face\n\t{font-family:Tahoma;\n\tpanose-1:2 11 6 4 3 5 4 4 2 4;}\n/* Style Definitions */\np.MsoNormal, li.MsoNormal, div.MsoNormal\n\t{margin:0in;\n\tmargin-bottom:.0001pt;\n\tfont-size:12.0pt;\n\tfont-family:"Times New Roman","serif";}\na:link, span.MsoHyperlink\n\t{mso-style-priority:99;\n\tcolor:blue;\n\ttext-decoration:underline;}\na:visited, span.MsoHyperlinkFollowed\n\t{mso-style-priority:99;\n\tcolor:purple;\n\ttext-decoration:underline;}\np\n\t{mso-style-priority:99;\n\tmso-margin-top-alt:auto;\n\tmargin-right:0in;\n\tmso-margin-bottom-alt:auto;\n\tmargin-left:0in;\n\tfont-size:12.0pt;\n\tfont-family:"Times New Roman","serif";}\nspan.EmailStyle18\n\t{mso-style-type:personal-reply;\n\tfont-family:"Calibri","sans-serif";\n\tcolor:#1F497D;}\n.MsoChpDefault\n\t{mso-style-type:export-only;\n\tfont-size:10.0pt;}\n#page WordSection1\n\t{size:8.5in 11.0in;\n\tmargin:1.0in 1.0in 1.0in 1.0in;}\ndiv.WordSection1\n\t{page:WordSection1;}\n--> </style>
<!--[if gte mso 9]><xml>\n<o:shapedefaults v:ext="edit" spidmax="1026" />\n</xml><![endif]--> <!--[if gte mso 9]> <xml>\n<o:shapelayoutv:ext="edit">\n<o:idmapv:ext="edit"data="1"/>\n</o:shapelayout></xml><![endif]-->
</head>
<body lang="EN-US" link="blue" vlink="purple">
<div class="WordSection1">
<p class="MsoNormal"><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><p> </p></span></p>
<p class="MsoNormal"><a name="_MailEndCompose"><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";color:#1F497D'><p> </p></span></a></p>
<div><div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in"><p class="MsoNormal"><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'> EMAIL SENDER NAME [mailto:EMAILADDRESS#FAKE.COM] <br><b>Sent:</b>!! DATE I NEED TO GRAB HERE !! <br><b>To:</b> EMAIL ADDRESS HERE <br><b>Subject:</b> SUBJECT LINE HERE <p></p></span></p></div></div>
I need to grab the date the email was sent. Here's what I've tried:
label_tag_name = 'div div p span br b'
if label_tag = #doc.at_css(%Q{#{label_tag_name}:contains("#{label}:")})
#attributes[field] = label_tag.text.gsub("#{label}:",'').gsub("\\n", "").strip
end
I also tried some shorter paths in the label_tag_name, basically adding another HTML tag to the beginning.
Every time though, the sent date is coming back nil.

The bit of your source you're interested in is (I've removed attributes for clarity):
<div>
<div>
<p>
<b>
<span>From:</span>
</b>
<span> EMAIL SENDER NAME [mailto:EMAILADDRESS#FAKE.COM] <br>
<b>Sent:</b>!! DATE I NEED TO GRAB HERE !! <br>
<b>To:</b> EMAIL ADDRESS HERE <br>
<b>Subject:</b> SUBJECT LINE HERE <p></p>
</span></p></div></div>
Note that br tags in HTML are self closing, so it's pointless looking for child elements of them.
The target could be described with the css div div p span, but note that there are two nodes that match that, and at_css returns the first. You could use div div p>span to specify only spans that are immediate children on the p. The actual target is a text node inside this element (there's only one matching span in the document now). In particular, it's the next element after the first b tag. So if we expand the css selector to div div p>span b, we can use the Nokogiri next method to get the target string:
date_string = #doc.at_css('div div p>span b').next
If you want the other fields, you could use css instead of at_css:
date_string = #doc.css('div div p>span b')[0].next
to_string = #doc.css('div div p>span b')[1].next
subject_string = #doc.css('div div p>span b')[2].next
I'll leave getting the sender name for something for you to do!

There isn't much to navigate on in that document. Use a selector that gets you to the closest point reliably then grab the text with a regex:
> doc.css("div.WordSection1 p.MsoNormal span").text[/Sent:\n(.*)/, 1]
=> " !! DATE I NEED TO GRAB HERE !! To:"

I'd start with this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<head>
<title></title>
</head>
<body>
<div class="WordSection1">
<div>
<div>
<b>Sent:</b>!! DATE I NEED TO GRAB HERE !!<br>
<b>To:</b> EMAIL ADDRESS HERE<br>
<b>Subject:</b> SUBJECT LINE HERE</span></p>
</div>
</div>
</div>
</body>
</html>
EOT
text = doc.at('div.WordSection1').text
sent_date = text[/Sent:(.+)To:/, 1].strip
puts sent_date
Which outputs this:
!! DATE I NEED TO GRAB HERE !!
The sample HTML is a mess so you can't easily see the particular trees you want in that forest. Strip out everything that isn't essential for navigation, then build your search.
And, while a parser is a great tool, sometimes it's easier to use it to get to the text you want, then grab the particular thing via a string search.

Related

UiPath selector not working in 'Click Button' Activity

I have the version of UiPath Studio Pro 2020.10.6, and I used Chrome.
I can't find the common Selector on two pages:
First case - failed one
Second case - great one
The selectors offered by the tool are the following:
Case 1:
<html app='chrome.exe' title='Sephora X Coach - Palette de fards à paupières Rexy de SEPHORA COLLECTION ≡ SEPHORA' />
<webctrl id='add-to-cart' tag='BUTTON' type='submit' />
Case 2:
<html app='chrome.exe' title='Kit maquillage des yeux de SEPHORA COLLECTION ≡ SEPHORA' />
<webctrl id='add-all-to-cart' tag='BUTTON' type='submit' />
Removing the title tag does not solve the problem.
The use of wild-cards does not work.
I looked via the "Fuzzy Search":https://docs.uipath.com/studio/docs/fuzzy-search-capabilities
I did the test in Python of the script of the page for the first case: https://www.datacamp.com/community/tutorials/fuzzy-string-python
The result is like in the picture above:
But it's not working too for the first case, since I put the level at 0.1
<html app='chrome.exe' title='"+SelectorString+ "' matching:title='fuzzy' fuzzylevel:title='0.3' /><webctrl id='add-all-to-cart' tag='BUTTON' type='submit' />
with
SelectorString = "Sephora X Coach"
I have no more idea, is the fact that in my first case the button is in a form (when looking at the code of the HTML page) and not in the second case?
Thank you in advance for your help.
I find the solution, I change the selector like that:
<html app='chrome.exe' />
<webctrl id='add[-all]*-to-cart' matching:id='regex' tag='BUTTON' type='submit' />

How to get the second to last script closing tag using Nokogiri

I need to get the second to last script closing tag using Nokogiri.
Example code:
<head>
<script src="first.js"></script>
<script src="second.js"></script>
<!-- How to place some scripts here? -->
<script>
// init load
</script>
</head>
I tried code like this doc.css('/html/head/script')[-2]. However, it places code inside the tags.
It's not completely clear what you want because you didn't give us an expected result, but this seems like what you're saying:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<head>
<script src="first.js"></script>
<script src="second.js"></script>
<!-- How to place some scripts here? -->
<script>
// init load
</script>
</head>
</html>
EOT
doc.css('script')[-2].add_next_sibling("\n<script src='new_script.js'></script>")
Which results in:
doc.to_html
# => "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n" +
# "<html>\n" +
# " <head>\n" +
# "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n" +
# " <script src=\"first.js\"></script>\n" +
# " <script src=\"second.js\"></script>\n" +
# "<script src=\"new_script.js\"></script>\n" +
# " <!-- How to place some scripts here? -->\n" +
# " <script>\n" +
# " // init load\n" +
# " </script>\n" +
# " </head>\n" +
# "</html>\n"
Nokogiri's XML::Node documentation is full of useful methods. I'd recommend reading it many times.
Nokogiri doesn't know about closing tags. After parsing it knows there's an object and that object has siblings in the hierarchy, so we can search for the objects and then, in this case, insert a new node. If you ask it to output the HTML, then based on the rules for HTML it will supply closing tags, even if they were not there in the first place.

WebDriver Capture Text by XPath

I am attempting to capture a line of text for an automated WebDriver test to use it in a comparison later on. However, I cannot find an XPath that will work with WebDriver. I have used the text() function before to capture text that is not in a tag, but in this instance that is not working. Here is the HTML, note that this text will never be the same, so I cannot use contains or similar functions.
<div id="content" class="center ui-content" data-role="content" role="main">
<div data-iscroll="scroller">
<div class="ui-corner-all ui-controlgroup ui-controlgroup-vertical" data-role="controlgroup">
<a class="ui-btn ui-corner-top ui-btn-hover-c" style="text-align: left" data-role="button" onclick="onDocumentClicked(21228772, "document.php?loan=********&folderseq=0&itemnum=21228772&pageCount=3&imageTypeName=1003 Application - Final&firstInitial=&lastName=")" href="#" data-corners="true" data-shadow="true" data-iconshadow="true" data-wrapperels="span" data-theme="c">
<span class="ui-btn-inner ui-corner-top">
<span class="ui-btn-text">
<img class="checkMark checkMark21228772 notViewedCompletely" width="15" height="15" title="You have not yet viewed this document." src="../images/white_dot.gif"/>
1003 Application - Final. (Jan 11 2012 5:04PM)
</span>
</span>
</a>
In this example, the text I am attempting to capture is: 1003 Application - Final. (Jan 11 2012 5:04PM)
I have inspected the element with Firebug and I have tried the following XPaths with no success.
html/body/div[1]/div[2]/div/div/a[1]/span/span
html/body/div[1]/div[2]/div/div/a[1]/span/span/text()
The WebDriver test is being written in C#.
You can either use this
driver.FindElement(By.XPath(".//div[#id='content']/following-sibling::span[#class='ui-btn-text']")
or
var elem = driver.FindElement(By.Id("Content"));
string text = string.Empty;
if(elem!=null) {
var textElem = elem.FindElement(By.Xpath(".//following-sibling::span[#class='ui-btn-text']"));
if(textElem!=null) text = textElem.Text();
}
I was able to solve this issue by removing the span tags from the XPath.
GetText("html/body/div[3]/div[2]/div/div/a[1]", SelectorType.XPath);
python webdriver code looks something like
driver.find_element_by_xpath("//span[#class='ui-btn-text']").text
But locator may be not uniqe, because I can't see all the code
PS Try to never use locators like html/body/div[1]/div[2]/div/div/a[1]/span/span
Approach:
Find the CSS Selector from the Given DOM
Derived CSS:css=#content div.ui-controlgroup > a[onclick*='onDocumentClicked'] > span > span
Use the C# Library Method to get the Text.

Get Text between two tags using nokogiri

My HTML structure is
<div class="line">
<h2>Header</h2>
<h3>Mailing Address</h3>
2349 Glorem ipsun lorem ipsum CA 95833<br>
<br>
Phone: 111-111-2111 Fax: 111-511-1111<br>
<a onfocus="blur()" target="_blank"" href="">some text</a><br>
<a onfocus="blur()" target="_blank" href="">some address</a><br>
<div><p></p></div>
<h3>Contact(s)</h3>
</div>
The HTML page contains several <div class=line></div> elements. For each div i need to extract Phone and Fax in a array with other data. I tried using
doc.css("div#ctl00_cphContent_divBrowseByMember").each do |div|
div.css("div.line").each do |line|
line.xpath('//text()[preceding-sibling::br and following-sibling::a]').text.strip
end
end
It returns nothing and returns time out error.
If I try as
line.xpath('//text()[preceding-sibling::br and following-sibling::a]')[0].text.strip
will return same Phone and fax for all other divs. Please suggest any other solution that will help me.
The easy way:
phone, fax = line.text.scan /\d{3}-\d{3}-\d{4}/

XPath Query to select hyperlink

The following is a subset of xml from a twitter atom feed:
<entry>
<id>tag:search.twitter.com,2005:18232030105964545</id>
<published>2010-12-24T09:10:29Z</published>
<link type="text/html" rel="alternate" href="http://twitter.com/KTNKenya/statuses/18232030105964545"/>
<title>Synovate Poll: PM Raila Odinga remains the preffered presidential candidate at 42% while Uhuru Kenyatta is at 14%... http://fb.me/yjmMbmBx</title>
<content type="html">Synovate Poll: PM <b>Raila</b> Odinga remains the preffered presidential candidate at 42% while Uhuru Kenyatta is at 14%... <a href="http://fb.me/yjmMbmBx">http://fb.me/yjmMbmBx</a></content>
<updated>2010-12-24T09:10:29Z</updated>
<link type="image/png" rel="image" href="http://a3.twimg.com/profile_images/701825859/NEW_KTN_normal.png"/>
<google:location>nairobi, kenya</google:location>
<twitter:geo>
</twitter:geo>
<twitter:metadata>
<twitter:result_type>recent</twitter:result_type>
</twitter:metadata>
<twitter:source><a href="http://www.facebook.com/twitter" rel="nofollow">Facebook</a></twitter:source>
<twitter:lang>en</twitter:lang>
<author>
<name>KTNKenya (KTN Kenya)</name>
<uri>http://twitter.com/KTNKenya</uri>
</author>
</entry>
From the <title>...</title> element, i need to select the hyperlink http://fb.me/yjmMbmBx via an XPath query. How do I do it? Is it possible?
*I'm an XPath newbie.
Thanks.
You have two options:
Use <title> (xpath: "/entry/title/text()") and get the URL yourself (e.g. using regex or finding the last instance of "http://" in the string.
Get the data first:
/entry/content[#type="html"]/text()
Then you need to parse this as HTML and extract any tags, and use the href attribute of those tags. How you do this last part depends on the language/environment you are doing this in.
Update: Added basic example code for option 1 above, as requested:
xmlpp::Element *node = parser.get_document()->get_root_node();
xmlpp::NodeSet results = node->find("/entry/title/text()");
xmlpp::ContentNode* content = dynamic_cast<xmlpp::ContentNode*>(results.front());
std::string text = content->get_content();
std::string link = "";
int res = text.rfind("http://");
if(res == text.npos)
res = text.rfind("https://");
if(res != text.npos)
link = text.substr(res);
With atom prefix bound to http://www.w3.org/2005/Atom namespace URI, use:
/atom:feed/atom:entry/atom:title[contains(.,'http://')]
This selects every atom:title element child of atom:entry, having the string "http://" contained in its string value.

Resources