Help with regex / ruby - ruby

Hey guys, so I'm making a script to featch words/results off of this site (http://grecni.com/texttwist.php), So I already have the http request post ready, ect.
Only thing I need now is to fetch out the words, So I'm working with an html source that looks like so:
<html>
<head>
<title>Text Twist Unscrambler</title>
<META NAME="keywords" CONTENT="Text,Twist,Text Twist,Unscramble,Free,Source,php">
</head>
<body>
<font face="arial,helvetica" size="3">
<p>
<b>3 letter words</b><br>sae sac ess aas ass sea ace sec <p>
<b>4 letter words</b><br>cess secs seas ceca sacs case asea casa aces caca <p>
<b>5 letter words</b><br>cacas casas caeca cases <p>
<b>6 letter words</b><br>access <br><br>
Found 23 words in 0.22962 seconds
<form action="texttwist.php" method="post">
enter scrambled letters and I'll return all word combinations<br>
<input type="text" name="l" value="asceacas" size="20" maxlength="20">
<input type="submit" name="button" value="unscramble">
<input type="button" name="clear" value="clear" onClick="this.form.l.value='';">
</form><p>
<a href=texttwist.phps>php source</a>
- it's kinda ugly, but it's fast<p>
<a href=/>back to my page</a>
</body>
</html>
I'm trying to fetch the words like "sae", "sav", "secs", "seas", "casas", ect.
Any help?
This is the farthest i've gotten, don't know what to do from here.: link text
Any suggestions? Help?

Use a HTML parser like Nokogiri.

If you want any kind of robustness you really want a parser, as mentioned by Adrian, Nokogiri is most popular solution.
If you insist, aware of the madness that you may be in for as the page becomes more complex the following may help:
Search for a line that matches
/^<b>\d+ letter words/
and then you can dig out the bits like so:
a = line.split(/<br>/)[1] # the second half
a.gsub!('<p>', '') # take out the trailing <p>
res = a.split(' ')# this is your data
That being said, this isn't anything you want in production code. You'll be surprised how learning a parser will change how you see this problem.

Related

How to get specific xpath tag value

<div class="container">
<span class="price">
<bdi> 140 </bdi>
</span>
<span class="price">
<del>
<bdi>90</bdi>
</del>
<ins>
<bdi> 120 </bdi>
</ins>
</span>
</div>
I want to scrape a site which html formatting like below. Here I dont want to bdi tag value which is under del tag and want bdi tag value which is under span class and ins tag. Is there any path to figure it out?
Don't pretty much usual //span/ins/bdi/text() work for you?
This is "text of <bdi> which parent is <ins> which parent is <span>"?
CSS variant span>ins>bdi::text should also work I suppose.
Sorry, haven't noticed that you need two values. In that case .xpath('//bdi[not(parent::del)]/text()').extract() will work well.

Xpth extract plain email text

I'm trying to extract the email text from a list but without success.
In particular I've used this code
//li/div/p//*[contains(., '#')]
but strangely it doesn't work! Could you help me?
Here's the code exemple
<li class="bgmp_list-item">
<h3 class="bgmp_list-placemark-title">
Name1
</h3>
<div class="bgmp_list-description">
<p class="">
<strong class="">Responsible:</strong> John Doe <br>
<strong class="">Site:</strong> <a title="www.exemple.com" href="http://www.exemple.com" onclick="javascript:_gaq.push(['_trackEvent','outbound-article','www.2ld.it']);" target="_blank" class="">www.2ld.it</a>
<br>
<strong class="">Email:</strong> some_email#email.com
<br><strong class="">Address:</strong> 3, Main Street 00000, London <br>
<strong>Tel:</strong> 00 000000 <strong>Fax:</strong> 0000000
</p>
</div>
You're almost there but not quite. For the sample code the correct xpath would be
//p/text()[contains(.,'#')]
Not to reinvent the wheel here is a very good explanation on it on another answer
By using p//*[contains(., '#')] you apply the predicate on individual child elements of <p>, while there is no such child element because
the target email address text is direct child of <p>. This is one of the reason why the intial XPath didn't work. Applying the predicate on <p> directly should work :
//li/div/p[contains(., '#')]
but that will return the <p> element. If you need to return only the text node that contains email address, then the predicate should be applied on individual text nodes within <p>, as mentioned in the other answer :
//li/div/p/text()[contains(., '#')]

accessing a <p> in Watir, which doesn't have attribute

I have this html code.
<div class="main" data-reactid=".0.2.1.1">
<div contenteditable="true" data-reactid=".0.2.1.1.0" autocomplete="off">
<p>
<br>
</p>
</div>
</div>
I have to write in tag. For this I wrote as:
paragraph(:article_title) {div_element(:class=>'main').div(:index=>1).paragraph(:index=>1)}
but it is giving an error. I don't understand what is wrong in this.
There are a couple of problems:
Watir uses a 0-based index. As a result, div(:index=>1) actually means to find the 2nd div tag. As this does not exist, you will get an unable to locate element error.
div and paragraph are not methods defined in the page-object gem. You will get deprecation errors when you try to use them. It should be div_element and paragraph_element respectively.
Try doing:
paragraph(:article_title) {div_element(:class=>'main').div_element(:index=>0).paragraph_element(:index=>0)}
More simply, since :index => 0 is implied:
paragraph(:article_title){div_element(:class=>'main').div_element.paragraph_element}
As there is only one paragraph element, you could further simplify it to:
paragraph(:article_title) {div_element(:class=>'main').paragraph_element}

Shortest match in Regex [duplicate]

This question already has answers here:
Find shortest matches between two strings
(4 answers)
Closed 3 years ago.
This is my regex:
/<strong>.*ingredients.*<\/ul>/im
Assuming the source code:
<strong>Contest closes on Thursday May 10th 2012 at 9pm PST</strong></div>
<br />
<br />
<br />
* I am not affiliated with Blue Marble Brands or Ines Rosales Tortas in any way. I am not sponsored by them and did not receive any compensation to write this post...I just simply think the Tortas are wonderful!<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<img border="0" height="480" mea="true" src="http://1.bp.blogspot.com/-35J5vNrXkqE/T6htXTafrmI/AAAAAAAAA5E/g2mtiuSpSmw/s640/food+003.JPG" width="640" /></div>
<br />
<strong><span style="font-size: large;">Ingredients:</span></strong><br />
<ul>
<li>Ines Rosales Rosemary and Thyme Tortas</li>
<li>Pizza Sauce (ready made in a jar)</li>
<li>Roma Tomatoes</li>
<li>Roasted Red Peppers </li>
<li>Marinated Artichoke Hearts</li>
<li>Olives (I used Pitted Spanish Manzanilla Olives)</li>
<li>Daiya Vegan Mozzarella Cheese</li>
</ul>
<span style="font-size: large;"><strong>Directions:</strong></span><br />
<br />
Spread small amount of pizza sauce over Torta.
the Regex is greedy and grabs everything from <strong>Contest...</ul> but the shortest match should yield <strong><span style="font-size: large;">Ingredients...</ul>
this is my gist: https://gist.github.com/3660370
::EDIT::
Please allow flexibility inbetween strong tag and ingredients, and ingredients and ul.
Try this:
/<strong><span.*ingredients.*<\/ul>/im
Please refrain from regex-ing html. Use Nokogiri or a similar library instead.
This should work:
/(?!<strong>.*<strong>.*<\/ul>)<strong>.*?ingredients.*?<\/ul>/im
Test it here
Basically, the regex is using the negative lookahead to avoid multiple <strong> before <\ul> like this: (?!<strong>.*<strong>.*<\/ul>)
I think this is what you're looking for:
/<strong>(?:(?!<strong>).)*ingredients.*?<\/ul>/im
Replacing the first .* with (?:(?!<strong>).)* allows it to match anything except another <strong> tag before it finds ingredients. After that, the non-greedy .*? causes it to stop matching at the first instance of </ul> it sees. (Your sample only contains the one <UL> element, but I'm assuming the real data could have more.)
The usual warnings apply: there are many ways this regex can be fooled even in perfectly valid HTML, to say nothing of the dreck we usually see out there.

how to access this element

I am using Watir to write some tests for a web application. I need to get the text 'Bishop' from the HTML below but can't figure out how to do it.
<div id="dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view" style="display: block;">
<div class="workprolabel wpFieldLabel">
<span title="Please select a courtesy title from the list.">Title</span> <span class="validationIndicator wpValidationText"></span>
</div>
<span class="wpFieldViewContent" id="dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view_value"><p class="wpFieldValue ">Bishop</p></span>
</div>
Firebug tells me the xpath is:
html/body/form/div[5]/div[6]/div[2]/div[2]/div/div/span/span/div[2]/div[4]/div[1]/span[1]/div[2]/span/p/text()
but I cant format the element_by_xpath to pick it up.
You should be able to access the paragraph right away if it's unique:
my_p = browser.p(:class, "wpFieldValue ")
my_text = my_p.text
See HTML Elements Supported by Watir
Try
//span[#id='dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5b45385e5f45b_view_value']//text()
EDIT:
Maybe this will work
path = "//span[#id='dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5b45385e5f45b_view_value']/p";
ie.element_by_xpath(path).text
And check if the span's id is constant
Maybe you have an extra space in the end of the name?
<p class="wpFieldValue ">
Try one of these (worked for me, please notice trailing space after wpFieldValue in the first example):
browser.p(:class => "wpFieldValue ").text
#=> "Bishop"
browser.span(:id => "dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view_value").text
#=> "Bishop"
It seems in run time THE DIV style changing NONE to BLOCK.
So in this case we need to collect the text (Entire source or DIV Source) and will collect the value from the text
For Example :
text=ie.text
particular_div=text.scan(%r{div id="dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view" style="display: block;(.*)</span></div>}im).flatten.to_s
particular_div.scan(%r{ <p class="wpFieldValue ">(.*)</p> }im).flatten.to_s
The above code is the sample one will solve your problem.

Resources