Parsing class name in html of span class - ruby

I'm checking results of values to verify that they are correct.
Using watir-webdriver.
In this case javascript generates a color class:
eg:
<span class="storyEdit limeGreen"> x </span>
in ruby currently I'm trying to parse the information from the using .html
so this is something like what I've parse so far
=> <span class=\"storyEdit limeGreen\"> x </span>
I'd like to only return limeGreen so I can say:
color = resultOfParsedSpan
This would be for a few different colours, so I was wondering is there a way to only pull the class name from the html?
If I haven't explained anything well enough, please feel free to let me know so I can add extra information!

Watir let's you do this directly; you do not need to manually parse the HTML yourself. The Element#class_name method will give you the element's class.
Example (assuming it is the first span):
browser.span.class_name
#=> "storyEdit limeGreen"
From that, you would have to parse the string to figure out what color it is. Given that the classes might be in any order and the infinite number of possible colors, I do not believe there is a general way to get just the color. The solution would depend on what you want to do with color and if the possible colors are known ahead of time.

well, a quick approach would be something like this:
span = '<span class="storyEdit limeGreen"> x </span>'
color = $1.split.last if span =~ /class="(.*)"/
but it would be generally better to use some html parsing libraries for this sort of things, like nokogiri or hpricot

Related

Find a node with a specific 'style' attribute xpath in Ruby

So I'm using Mechanize in Ruby to do some website scraping and want to find all nodes with a specific style attribute.
I want to return all nodes with a style attribute that has specific top value on the webpage.
The HTML will look like this:
<div id="c11285" style="position:absolute;top:1px;left:333px;width:65px;height:226px;overflow:hidden;background-color:transparent;z-index:10;border: 1px solid #000" onclick="">
In this case I cannot use the id, because each variation of the page has different ids so I want to search by the top value in the style attribute which in this case is 1px.
I've tried using webPage.search("div['style=top: 1px;']")
However, this does not work as px seems to cause an error.
Any suggestions on how I could achieve this or is this even possible?
It scans all elements and return those which have top:1px in style attribute.
//*[contains(#style, 'top:1px')]

How to use Selenium using Xpath to determine the classes of an element?

I am trying to use xpath within selenium to select a div element that is within a td.
What I am really trying to do is determine the class of the div and if it is either classed LOGO1, LOGO2, LOGO3 and so on. Originally I was going to just snag the image:url to determine with logo.jpg was used but whoever made the target website used one image for each logo type and used css to determine which portion of the image will be displayed. So Imagine 4 images on one sprite image. This is the reason why I have to determine the class of the div instead of digging through the css paths.
In selenium I am using storeElementPresent | /html/body/form/center/table/tbody/tr/td[2]/div[3]/div[2]/fieldset/table/tbody/tr[2]/td/div/table/tbody/tr[${i}]/td[8]/div//class | cardLogo .
The div has multiple classes so I am thinking that this is the issue, but any help is appreciated. Below is the target source. This is source from within the table in the tbody. Selenium has no problems identifying all the way up to td[8] but then fails to gather the div. Please help!
<td class="togglehidefields" style="width:80px;">
<div class="cardlogo LOGO1" style="background-image:url(https://www.somesite.com/merchants/images/image.jpg)"></div>
<span id="ContentPlaceHolder1_grdCCChargebackDetail_lblCardNumber_0">7777</span>
</td>
I was fiddling with selenium.getAttribute() but it kept erroring out, any ideas there?
This <div/> element has one class attribute with one value, but this one is tokenized when parsed as HTML.
As selenium only supports XPath 1.0, you will need to check for classes like this:
//div[contains(#class, "LOGO1") or contains(#class, "LOGO2")]
Extend that pattern as needed and embed it in your expression.
With XPath 2.0 and better, you could tokenize and use the = operator which works on a set-based semantics:
//div[tokenize(#class, ' ') = ("LOGO1", "LOGO2")]
Old post but I'll put the solution I used up just in case it can help anyone.
xpath=//div[contains(#class,'carouselNavNext ')]/.[contains(#class, 'disabled')]
Fire of your contains, and then follow with /. to check children AND the current element.

Scrape website with Ruby based on embedded CSS styles

In the past, I have successfully used Nokogiri to scrape websites using a simple Ruby script. For a current project, I need to scrape a website that only uses inline CSS. As you can imagine, it is an old website.
What possibilities do I have to target specific elements on the page based on the inline CSS of the elements? It seems this is not possible with Nokogiri or have I overlooked something?
UPDATE: An example can be found here. I basically need the main content without the footnotes. The latter have a smaller font size and are grouped below each section.
I'm going to teach you how to fish. Instead of trying to find what I want, it's sometimes a lot easier to find what I don't want and remove it.
Start with this code:
require 'nokogiri'
require 'open-uri'
URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
FOOTNOTE_ACCESSORS = [
'span[style*="font-size: 8.0pt"]',
'span[style*="font-size:8.0pt"]',
'span[style*="font-size: 7.5pt"]',
'span[style*="font-size:7.5pt"]',
'font[size="1"]'
].join(',')
doc = Nokogiri.HTML(open(URL))
doc.search(FOOTNOTE_ACCESSORS).each do |footnote|
footnote.remove
end
File.write(File.basename(URI.parse(URL).path), doc.to_html)
Run it, then open the resulting HTML file in your browser. Scroll through the file looking for footnotes you want to remove. Select part of their text, then use "Inspect Element", or whatever tool you have that will find that selected text in the source of the page. Find something unique in that text that makes it possible to isolate it from the text you want to keep. For instance, I locate footnotes using the font-sizes in <span> and <font> tags.
Keep adding accessors to the FOOTNOTE_ACCESSORS array until you have all undesirable elements removed.
This code isn't complete, nor is it written as tightly as I'd normally do it for this sort of task, but it will give you an idea how to go about this particular task.
This is a version that is a bit more flexible:
require 'nokogiri'
require 'open-uri'
URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
FOOTNOTE_ACCESSORS = [
'span[style*="font-size: 8.0pt"]',
'span[style*="font-size:8.0pt"]',
'span[style*="font-size: 7.5pt"]',
'span[style*="font-size:7.5pt"]',
'font[size="1"]',
]
doc = Nokogiri.HTML(open(URL))
FOOTNOTE_ACCESSORS.each do |accessor|
doc.search(accessor).each do |footnote|
footnote.remove
end
end
File.write(File.basename(URI.parse(URL).path), doc.to_html)
The major difference is the previous version assumed all entries in FOOTNOTE_ACCESSORS were CSS. With this change XPath can also be used. The code will take a little bit longer to run as the entries are iterated over, but the ability to dig in with XPath might make it worthwhile for you.
You can do something like:
doc.css('*[style*="foo"]')
That will select any element with foo appearing anywhere in it's style attribute.

Can I assign multiple values to the appliedCharacterStyle property of InDesign DOM's Text object?

I am working on an ExtendScript script which we use to prepare InDesign files for export to XHTML. Basically, we just go around applying character styles where we need them (have a look at this simplified example):
app.activeDocument.findGrep()[0].appliedCharacterStyle = "customStyle";
When we export the result to XHTML using InDesign's Export to XHTML feature, we get something like this:
<span class="customStyle">I</span>
which is exactly what we want. The problem arising now is that we sometimes want to apply many different styles to a single character, so we end up doing something like this:
var t = app.activeDocument.findGrep()[0];
t.appliedCharacterStyle = "customStyle1";
t.appliedCharacterStyle = "customStyle2";
Obviously, customStyle2 overrides customStyle1, which defeats the purpose. Is there any way around this?
Note: I tried using applyCharacterStyle instead, but that method doesn't take strings as parameter, only CharacterStyle objects.
Is "customStyle" just a css class or the name of a saved style? I don't really use inDesign so this is speculation but it looks like you could modify individual properties of the CharacterStyle object like
var myStyle = new CharacterStyle();
myStyle.fillColor = "blue";
myStyle.fontStyle = "verdana";
...
Or something then you should be able to apply it like this
t.applyCharacterStyle(myStyle);
This is just an educated guess based on my experience with extendscript and photoshop, Sorry if it's way off-base.

CodeIgniter santizing POST values

I have a text area in which I am trying to add youtube embed code and other HTML tags. $this->input->post is converting the <iframe> tags to < and > respectively but not the <h1> and <h2> tags.
Any idea how I can store these values?
If you only have a small number of forms that you need to allow iframes in, I would just write a function to restore the iframe (while validating that it's a valid YouTube embed code).
You can also turn off global_xss_filtering in your config (or not implement it if you're using it), but that's not the ideal solution (turning off all of your security to get one thing to work is generally a horrible idea).
$config['global_xss_filtering'] = FALSE;
To see all of the tags that get filtered out, look in the CI_Input class and search for the '$naughty' variable. You'll see a pipe-delimited list (don't change anything in this class).
Why don't you avoid CIs auto sanitizing and use something like htmlspecialchars($_POST['var']); ? Or make a helper function for sanitizing youtube urls...
Or you could either just ask for the video ID code or parse the code from what you are getting.
This would let you use both the URL or the embed code.
Also storing just the ID takes less space in you database, and you could write a helper function to output the embed code/url.
In this case, use $_POST instead of $this->input->post to get the original text area value, and then use HTML Purifier to clean the contents without losing the <iframe> tag you want.
You will need to check HTML Purifier documentation for details. Please, check this specific documentation page about "Embedding YouTube Videos".

Resources