Remove indentation from code - ruby

I'm trying to create a function that removes extraneous starting tabs from code to make it display more neatly. As in, I would like my function to turn this:
<div>
<div>
<p>Blah</p>
</div>
</div>
into this:
<div>
<div>
<p>Blah</p>
</div>
</div>
(The goal of all this is to create a Rails partial into which I can paste formatted code to be displayed in a pre tag justified to the left).
So far, I've got this, but it's erroring, and I don't know why. Never used gsub before, so I'm guessing the problem is there (though the debugging notes also point at the first "end" line):
def tab_stripped(code)
# find number of tabs in first line
char_array = code.split(//)
counter = 0
char_array.each do |c|
counter ++ if c == "\t"
break if c != "\t"
end
# delete that number of tabs from the beginning of each line
start_tabs = ""
counter.times do
start_tabs += "\t"
end
code.gsub!(start_tabs, '')
code
end
Any ideas?

One from my personal library (with minor modifications):
class String
def unindent; gsub(/^#{scan(/^\s+/).min}/, "") end
end
It is more general than what you are asking for. It takes care of not just tabs, but spaces as well, and it does not adjust to the first line, but to the least indented line.
puts <<X.unindent
<div>
<div>
<p>Blah</p>
</div>
</div>
X
gives:
<div>
<div>
<p>Blah</p>
</div>
</div>

Related

Result of xpath is object text error, how do i get around this in Ruby on a site built around hiding everything?

My company uses ways to hide most data on their website and i'm tying to create a driver that will scan closed jobs to populate an array to create new jobs thus requiring no user input / database access for users.
I did research and it seems this can't be done the way i'm doing it:
# Scan page and place 4 different Users into an array
String name = [nil, nil, nil, nil]
String compare_name = nil
c = 0
tr = 1
while c < 4
String compare_name = driver.find_element(:xpath, '//*
[#id="job_list"]/tbody/tr['+tr.to_s+']/td[2]/span[1]/a/span/text()[2]').gets
if compare_name != name[c]
name[c] = compare_name
c = +1
tr = +1
else if compare_name == name[c]
tr = +1
end
end
end
Also i am a newb learning as i go, so this might not be optimal or whatever just how i've learned to do what i want.
Now the website code for the item i want on the screen:
<span ng-if="job.customer.company_name != null &&
job.customer.company_name != ''" class="pointer capitalize ng-scope" data-
toggle="tooltip" data-placement="top" title="" data-original-title="406-962-
5835">
<a href="/#/edit_customer/903519"class="capitalize notranslate">
<span class="ng-binding">Name Stuff<br>
<!-- ngIf: ::job.customer.is_cip_user --
<i ng-if="::job.customer.is_cip_user" class="fa fa-user-circle-o ng-scope">
::before == $0
</i>
> Diago Stein</span>
</a>
</span>
Xpath can find the Diago Stein area, but because of it being a text object it doesn't work. Now to note something all the class titles, button names, etc are all the same with everything else on the page. They always do that which makes it even harder to scan because those same things are likely elsewhere that might not have anything to do with this area of the site.
Is there any way to grab this text without knowing what might be in the text area based on the HTML? Note "Name Stuff" is the name of a company i hid it with this generic one for privacy.
Thanks for any ideas or suggestions and help.
EDIT: Clarification, i will NOT know the name of the company or the user name (in this case Diago Stein) the entire purpose of this part of the code is to populate an array with the customers name from this table on the closed page.
You can back your XPath up one level to
//*[#id="job_list"]/tbody/tr[' + tr.to_s + ']/td[2]/span[1]/a/span
then grab the innerText. The SPAN is
<span class="ng-binding">Name Stuff<br>
<!-- ngIf: ::job.customer.is_cip_user --
<i ng-if="::job.customer.is_cip_user" class="fa fa-user-circle-o ng-scope">
::before == $0
</i>
> Diago Stein</span>
The problem is that this HTML has some conditionals in it which makes it hard to read, hard to figure out what's actually there. If we strip out the conditional, we are left with
<span class="ng-binding">Name Stuff<br>Diago Stein</span>
If we take the innerText of this, we get
Name Stuff
Diago Stein
What this does is you can split the string by a carriage return and part 0 is the 'Name Stuff' and part 1 is 'Diago Stein'. So you use your locator to find the SPAN, get innerText, split it by a carriage return, and then take the second part and you have your desired string.
This code isn't tested but it should be something like
name = driver.find_element(:xpath => "//*[#id="job_list"]/tbody/tr[' + tr.to_s + ']/td[2]/span[1]/a/span").get_text.split("\n")[1]

Access two elements simultaneously in Nokogiri

I have some weirdly formatted HTML files which I have to parse.
This is my Ruby code:
File.open('2.html', 'r:utf-8') do |f|
#parsed = Nokogiri::HTML(f, nil, 'windows-1251')
puts #parsed.xpath('//span[#id="f5"]//div[#id="f5"]').inner_text
end
I want to parse a file containing:
<span style="position:absolute;top:156pt;left:24pt" id=f6>36.4.1.1. варенье, джемы, конфитюры, сиропы</span>
<div style="position:absolute;top:167.6pt;left:24.7pt;width:709.0;height:31.5;padding-top:23.8;font:0pt Arial;border-width:1.4; border-style:solid;border-color:#000000;"><table></table></div>
<span style="position:absolute;top:171pt;left:28pt" id=f5>003874</span>
<div style="position:absolute;top:171pt;left:99pt" id=f5>ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div style="position:absolute;top:180pt;left:99pt" id=f5>325гр. </div>
<div style="position:absolute;top:167.6pt;left:95.8pt;width:2.8;height:31.5;padding-top:23.8;font:0pt Arial;border-width:0 0 0 1.4; border-style:solid;border-color:#000000;"><table></table></div>
I need to select either <div> or <span> with id==5. With my current XPath selector it's not possible. If I remove //span[#id="f5"], for example, then the divs are selected correctly. I can output them one after another:
puts #parsed.xpath('//div[#id="f5"]').inner_text
puts #parsed.xpath('//span[#id="f5"]').inner_text
but then the order would be a complete mess. The parsed span have to be directly underneath the div from the original file.
Am I missing some basics? I haven't found anything on the web regarding parallel parsing of two elements. Most posts are concerned with parsing two classes of a div for example, but not two different elements at a time.
If I understand this correctly, you can use the following XPath :
//*[self::div or self::span][#id="f5"]
xpathtester demo
The XPath above will find element named either div or span that have id attribute value equals "f5"
output :
<span id="f5" style="position:absolute;top:171pt;left:28pt">003874</span>
<div id="f5" style="position:absolute;top:171pt;left:99pt">ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div id="f5" style="position:absolute;top:180pt;left:99pt">325гр.</div>

Using AWK/Grep/Bash to extract data from HTML

I'm trying to make a Bash script to extract results from an HTML page.
I achieved to get the content of the page with Curl, but the next step is parsing the output, which is problematic.
The interesting content of the page looks like this:
<div class="result">
...
<div class="item">
<div class="item_title">ITEM 1</div>
</div>
...
<div class="item_desc">
ITEM DESCRIPTION 1
</div>
...
</div>
<div class="result">
...
<div class="item">
<div class="item_title">ITEM 2</div>
</div>
...
<div class="item_desc">
ITEM DESCRIPTION 2
</div>
...
</div>
I'd like to output something like:
ITEM1;ITEM DESCRIPTION 1
ITEM2;ITEM DESCRIPTION 2
I know a bit of Grep, but I can't wrap my mind about making it to work here, also some people told me to use Awk, which seems best suited for this kind of task.
I'd appreciate any help.
Thank you very much.
A bare minimal program to handle the HTML, loosely, with no validation, and easily confused by variations in the HTML, is:
sed.script
/ *<div class="item_title">\(.*\)<\/div>/ { s//\1/; h; }
/ *<div class="item_desc">/,/<\/div>/ {
/<div class="item_desc">/d
/<\/div>/d
s/^ *//
G
s/\(.*\)\n\(.*\)/\2;\1/p
}
The first line matches item title lines. The s/// command captures just the part between the <div …> and </div>; the h copies that into the hold space (memory).
The rest of the script matches lines between the item description <div> and its </div>. The first two lines delete (ignore) the <div> and </div> lines. The s/// removes leading spaces; the G appends the hold space to the pattern space after a newline; the s///p captures the part before the newline (the description) and the part after the newline (the title from the hold space), and replaces them with the title and description, separated by a semi-colon, and prints the result.
Example
$ sed -n -f sed.script items.html
ITEM 1;ITEM DESCRIPTION 1
ITEM 2;ITEM DESCRIPTION 2
$
Note the -n; that means "don't print unless told to do so".
You can do it without a script file, but there's less to worry about if you use one. You can probably even squeeze it all onto one line if you're careful. Beware that the ; after the h is necessary with BSD sed and harmless but not crucial with GNU sed.
Modification
There are all sorts of ways to make it more nearly bullet-proof (but it is debatable whether they're worthwhile). For example:
/ *<div class="item_title">\(.*\)<\/div>/
could be revised to:
/^[[:space:]]*<div class="item_title">[[:space:]]*\(.*\)[[:space:]]*<\/div>[[:space:]]*$/
to deal with arbitrary sequences of white space before, in the middle, and after the <div> components. Repeat ad nauseam for the other regexes. You could arrange to have single spaces between words. You could arrange for a multi-line description to be printed just once as a single line, rather than each line segment being printed separately as it would be now.
You could also wrap the whole construct in the file inside:
/^<div class="result">$/,/^<\/div>$/ {
…script as before…
}
And you could repeat that idea so that the item title is only picked inside <div class="item"> and </div>, etc.
Just use awk:
awk -F '<[^>]+>' '
found { sub(/^[[:space:]]*/,";"); print title $0; found=0 }
/<div class="item_title">/ { title=$2 }
/<div class="item_desc">/ { found=1 }
' file
ITEM 1;ITEM DESCRIPTION 1
ITEM 2;ITEM DESCRIPTION 2

preg_match_all skippes one nested tag

if you look at this tag:
$text = '<div class="inner">
<div class="left">
<h4>text </h4>
<p>Abdijstreet 42b<br>2000 city </p>
</div>
<div class="right">
<span class="red">10:00 - 14:00</span>
</div>
</div>'
I use this to preg_match:
preg_match_all("'<div class=\"inner\">(.*?)</div>'si", $text, $match); // de ul tags
$match[1] = array_splice($match[0], 0);
foreach($match[1] as $val) // hele pagina
{
echo $val;
}
Well i tried many things, but i only get whats between and never what i need for , what am i doing wrong?
Are you trying to get everything between the beginning and ending div tags? If so, then you're really close. All you'd need to do is just remove the question mark ? from your expression. The question mark tells the script to stop matching once it finds the next item in the REGEX. In this case, the next item is a closing div tag. So once it finds it, it stops. If you leave it out, it will keep matching until it hits the last div tag it can find.
$text = '<div class="inner">
<div class="left">
<h4>text </h4>
<p>Abdijstreet 42b<br>2000 city </p>
</div>
<div class="right">
<span class="red">10:00 - 14:00</span>
</div>
</div>';
preg_match_all("'<div class=\"inner\">(.*)</div>'si", $text, $match);
print "<pre><font color=red>"; print_r($match); print "</font></pre>";
If you're trying to pull out each item in a div, then you'd probably want to consider using DOM instead of REGEX to tackle this problem. But since you used the preg-match tag, then here it is in REGEX:
preg_match_all('~<div class="(?!inner).*?>\K(.*?)(?=</div>)~ims', $text, $matches);
print "<PRE><FONT COLOR=BLUE>"; print_r($matches[1]); print "</FONT></PRE>";
That gives you this:
Array
(
[0] =>
<h4>text </h4>
<p>Abdijstreet 42b<br>2000 city </p>
[1] =>
<span class="red">10:00 - 14:00</span>
)
Explanation of the REGEX:
<div class=" (?!inner) .*? > \K (.*?) (?=</div>)
^ ^ ^ ^ ^ ^ ^
1 2 3 4 5 6 7
<div class=" Look for a literal opening div tag <div, followed by a space, followed by the word class, followed by an equal sign, followed by a quotation mark.
(?!inner) This is a negative lookahead (?!) that makes sure the word inner is not coming up next.
.*? Matches any one character ., zero or more times *, all the way up until it hits the next item in our regular expression ?. In this case, it will stop once it finds a closing HTML bracket.
> Find a closing HTML bracket.
\K This tells the expression to forget everything it has matched so far and start matching again from here. This basically makes sure that the first part of the expression is there, but does not store it for us to work with.
(.*?) Same as number 3, except we use parenthesis () around it so we can capture it and do something with it later.
(?=</div>) This is a positive lookahead (?=) that makes sure the closing div tag </div> is coming up at the end of the expression, but does not capture it.
Here is a working demo of the code above

how to access this element

I am using Watir to write some tests for a web application. I need to get the text 'Bishop' from the HTML below but can't figure out how to do it.
<div id="dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view" style="display: block;">
<div class="workprolabel wpFieldLabel">
<span title="Please select a courtesy title from the list.">Title</span> <span class="validationIndicator wpValidationText"></span>
</div>
<span class="wpFieldViewContent" id="dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view_value"><p class="wpFieldValue ">Bishop</p></span>
</div>
Firebug tells me the xpath is:
html/body/form/div[5]/div[6]/div[2]/div[2]/div/div/span/span/div[2]/div[4]/div[1]/span[1]/div[2]/span/p/text()
but I cant format the element_by_xpath to pick it up.
You should be able to access the paragraph right away if it's unique:
my_p = browser.p(:class, "wpFieldValue ")
my_text = my_p.text
See HTML Elements Supported by Watir
Try
//span[#id='dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5b45385e5f45b_view_value']//text()
EDIT:
Maybe this will work
path = "//span[#id='dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5b45385e5f45b_view_value']/p";
ie.element_by_xpath(path).text
And check if the span's id is constant
Maybe you have an extra space in the end of the name?
<p class="wpFieldValue ">
Try one of these (worked for me, please notice trailing space after wpFieldValue in the first example):
browser.p(:class => "wpFieldValue ").text
#=> "Bishop"
browser.span(:id => "dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view_value").text
#=> "Bishop"
It seems in run time THE DIV style changing NONE to BLOCK.
So in this case we need to collect the text (Entire source or DIV Source) and will collect the value from the text
For Example :
text=ie.text
particular_div=text.scan(%r{div id="dnn_ctr353_Main_ctl00_ctl00_ctl00_ctl07_Field_048b9dfa-bc64-42e4-8bd5-b45385e5f45b_view" style="display: block;(.*)</span></div>}im).flatten.to_s
particular_div.scan(%r{ <p class="wpFieldValue ">(.*)</p> }im).flatten.to_s
The above code is the sample one will solve your problem.

Resources