Get all attributes or xPath of element in Webdriver - ruby

I'm trying to do simple monkey test for my web page, which get all active elements on page and click on them in random order.
When i do this I want to write a log to know, on which element my test click and on which test crashed
So I want log file to look like this
01.01.11 11.01.01 Clicked on Element <span id='myspan' class ='myclass .....>
01.01.11 11.01.01 Clicked on Element <span id='button' class ='myclass title = 'Button'.....>
or
01.01.11 11.01.01 Clicked on Element //*[#id='myspan']
01.01.11 11.01.01 Clicked on Element //*[#id='button']
Is it any way to do in Webdriver + Ruby?

I don't think there is a way but you could always do something like this (with watir-webdriver):
browser.divs.each do |div|
puts '<span ' + ['id','class','title'].map{|x| "#{x}='#{div.attribute_value(x)}'"}.join(' ') + '>'
end

WebDriver does not provide this type of functionality, you would have to get the page source and do some of your own parsing - I've done this in Html Agility Pack with C#, you would need to find a similar library for ruby (see: Options for HTML scraping?)

You can do this:
Get all elements, which are clickable
For example, find all links, find all clickable spans. Put those candidates in a list
Randomly pick a element in that candidate list
Click the very element and write some log

I tweaked #pguardiario's answer to come up with this method:
def get_element_dom_info(#e)
if #e.class != Selenium::WebDriver::Element
raise "No valid element passed: #{#e.class}"
end
#attrs = ['id', 'class', 'title', 'href', 'src', 'type', 'name']
return "<" + #e.tag_name + #attrs.map{ |x| " #{x}='#{#e.attribute(x)}'" if #e.attribute(x) && #e.attribute(x) != "" }.join('') + ">"
end
Of course, it expects that the single parameter you pass into it is an actual Selenium element. Also, it doesn't include every possible attribute, but that's the majority of them (and you can always add your extra attributes if needed).
I suppose you can integrate this via some code like this:
def clickElement(*args)
... # parse vars
#e = #driver.find_element(...)
puts get_timestamp + " Clicked on Element: " + get_element_dom_info(#e)
end
UPDATE
I recently realized that I could get the full html of the element using native javascript (d'oh!). You have to use a hack to get the "outerHTML". Here is my new method:
def get_element_dom_info(how, what)
e = #driver.find_element(how, what)
# Use native javascript to return element dom info by creating a wrapper
# that the element node is cloned to and we check the innerHTML of the parent wrapper.
return #driver.execute_script("var f = document.createElement('div').appendChild(arguments[0].cloneNode(true)); return f.parentNode.innerHTML", e)
end

Related

Process Jekyll content to replace first occurrence of any post title with a hyperlink of the post with that title

What I'm trying to do
I am building a Jekyll ruby plugin that will replace the first occurrence of any word in the post copy text content with a hyperlink linking to the URL of a post by the same name.
The problems I'm having
I've gotten this to work but I can't figure out two problems in the process_words method:
How to only search for a post title in the main content copy text of the post, and not the meta tags before the post or the table of contents (which is also generated before main post copy text)? I can't get this to work with Nokigiri, even though that seems to be the tool of choice here.
If a post's URL is not at post.data['url'], where is it?
Also, is there a more efficient, cleaner way to do this?
The current code works but will replace the first occurrence even if it's the value of an HTML attribute, like an anchor or a meta tag.
Example result
We have a blog with 3 posts:
Hobbies
Food
Bicycles
And in the "Hobbies" post body text, we have a sentence with each word appearing in it for the first time in the post, like so:
I love mountain biking and bicycles in general.
The plugin would process that sentence and output it as:
I love mountain biking and bicycles in general.
My current code (UPDATED 1)
# _plugins/hyperlink_first_word_occurance.rb
require "jekyll"
require 'uri'
module Jekyll
# Replace the first occurance of each post title in the content with the post's title hyperlink
module HyperlinkFirstWordOccurance
POST_CONTENT_CLASS = "page__content"
BODY_START_TAG = "<body"
ASIDE_START_TAG = "<aside"
OPENING_BODY_TAG_REGEX = %r!<body(.*)>\s*!
CLOSING_ASIDE_TAG_REGEX = %r!</aside(.*)>\s*!
class << self
# Public: Processes the content and updates the
# first occurance of each word that also has a post
# of the same title, into a hyperlink.
#
# content - the document or page to be processes.
def process(content)
#title = content.data['title']
#posts = content.site.posts
content.output = if content.output.include? BODY_START_TAG
process_html(content)
else
process_words(content.output)
end
end
# Public: Determines if the content should be processed.
#
# doc - the document being processes.
def processable?(doc)
(doc.is_a?(Jekyll::Page) || doc.write?) &&
doc.output_ext == ".html" || (doc.permalink&.end_with?("/"))
end
private
# Private: Processes html content which has a body opening tag.
#
# content - html to be processes.
def process_html(content)
content.output = if content.output.include? ASIDE_START_TAG
head, opener, tail = content.output.partition(CLOSING_ASIDE_TAG_REGEX)
else
head, opener, tail = content.output.partition(POST_CONTENT_CLASS)
end
body_content, *rest = tail.partition("</body>")
processed_markup = process_words(body_content)
content.output = String.new(head) << opener << processed_markup << rest.join
end
# Private: Processes each word of the content and makes
# the first occurance of each word that also has a post
# of the same title, into a hyperlink.
#
# html = the html which includes all the content.
def process_words(html)
page_content = html
#posts.docs.each do |post|
post_title = post.data['title'] || post.name
post_title_lowercase = post_title.downcase
if post_title != #title
if page_content.include?(" " + post_title_lowercase + " ") ||
page_content.include?(post_title_lowercase + " ") ||
page_content.include?(post_title_lowercase + ",") ||
page_content.include?(post_title_lowercase + ".")
page_content = page_content.sub(post_title_lowercase, "#{ post_title.downcase }")
elsif page_content.include?(" " + post_title + " ") ||
page_content.include?(post_title + " ") ||
page_content.include?(post_title + ",") ||
page_content.include?(post_title + ".")
page_content = page_content.sub(post_title, "#{ post_title }")
end
end
end
page_content
end
end
end
end
Jekyll::Hooks.register %i[posts pages], :post_render do |doc|
# code to call after Jekyll renders a post
Jekyll::HyperlinkFirstWordOccurance.process(doc) if Jekyll::HyperlinkFirstWordOccurance.processable?(doc)
end
Update 1
Updated my code with #Keith Mifsud's advice. Now using either the sidebar's aside element or the page__content class to select body content to work on.
Also improved checking and replacing the correct term.
PS: The code base example I started with working on my plugin was #Keith Mifsud's jekyll-target-blank plugin
this code looks very familiar :) I suggest you look into the Rspecs test file to test against your issues: https://github.com/keithmifsud/jekyll-target-blank
I'll try to answer your questions, sorry I couldn't test these myself the time of writing.
How to only search for a post title in the main content copy text of the post, and not the meta tags before the post or the table of contents (which is also generated before main post copy text)? I can't get this to work with Nokigiri, even though that seems to be the tool of choice here.
Your requirements here are:
1) Ignore content outside the <body></body> tags.
This seems to already be implemented in the process_html() method. This method is stating the only process the body_content and it should work as it is. Have you got tests for it? How are you debugging it? The same string splitting works in my plugin. I.e. only content inside the body is processed.
2) Ignore content inside the Table of Contents (TOC).
I suggest you extend the process_html() method by further splitting the body_content variable. Search for content in between the opening and closing tags of your TOC (by id, css class etc..) and exclude it, then add it back in it's position before or after process_words string.
3) Whether to use the Nokigiri plugin?
This plugin is great for parsing html. I think you are parsing strings and then creating html. So vanilla Ruby and the URI plugin should suffice. You can still use it if you want but it won't be any faster then splitting strings in ruby.
If a post's URL is not at post.data['url'], where is it?
I think you should a have method to get all all post titles and then match the "words" against the array. You can get all the posts collection from the doc itself doc.site.posts and foreach post return the title. The the process_words() method can check each work to see if it matched an item from the array. But what if the title is made of more than one word?
Also, is there a more efficient, cleaner way to do this?
So far so good. I'll start with getting the issues fixed and then refactor for speed and coding standards.
Again I suggest you use testing to help you with this.
Let me know if I can help more :)

Posting data on website using Mechanize Nokogiri Selenium

I need to post data on a website through a program.
To achieve this I am using Mechanize Nokogiri and Selenium.
Here's my code :
def aeiexport
# first Mechanize is submitting the form to identify yourself on the website
agent = Mechanize.new
agent.get("https://www.glou.com")
form_login_AEI = agent.page.forms.first
form_login_AEI.util_vlogin = "42"
form_login_AEI.util_vpassword = "666"
# this is suppose to submit the form I think
page_compet_list = agent.submit(form_login_AEI, form_login_AEI.buttons.first)
#to be able to scrap the page you end up on after submitting form
body = page_compet_list.body
html_body = Nokogiri::HTML(body)
#tds give back an array of td
tds = html_body.css('.L1').xpath("//table/tbody/tr[position()>1]/td")
# Checking my array of td with some condition
tds.each do |td|
link = td.children.first # Select the first children
if link.html = "2015 32 92 0076 012"
# Only consider the html part of the link, if matched follow the previous link
previous_td = td.previous
previous_url = previous_td.children.first.href
#following the link contained in previous_url
page_selected_compet = agent.get(previous_url)
# to be able to scrap the page I end up on
body = page_selected_compet.body
html_body = Nokogiri::HTML(body)
joueur_access = html_body.search('#tabs0head2 a')
# clicking on the link
joueur_access.click
rechercher_par_numéro_de_licence = html_body.css('.L1').xpath("//table/tbody/tr/td[1]/a[1]")
pure_link_rechercher_par_numéro_de_licence = rechercher_par_numéro_de_licence['href']
#following pure_link_rechercher_par_numéro_de_licence
page_submit_licence = agent.get(pure_link_rechercher_par_numéro_de_licence)
body_submit_licence = page_submit_licence.body
html_body = Nokogiri::HTML(body_submit_licence)
#posting my data in the right field
form.field_with(:name => 'lic_cno[0]') == "9511681"
1) So far what do you think about this code, Do you think there is an error in there
2) This part is the one I am really not sure about : I have posted my data in the right field but now I need to submit it. The problem is that the button I need to click is like this:
<input type="button" class="button" onclick="dispatchAndSubmit(document.JoueurRechercheForm, 'rechercher');" value="Rechercher">
it triggers a javascript function onclick. I am triying Selenium to trigger the click event. Then I end up on another page, where I need to click a few more times.. I tried this:
driver.find_element(:value=> 'Rechercher').click
driver.find_element(:name=> 'sel').click
driver.find_element(:value=> 'Sélectionner').click
driver.find_element(:value=> 'Inscrire').click
But so far I have not succeeded in posting the data.
Could you please tell me if selenium will enable me to do what I need to do. If can I do it ?
At a glance your code can use less indentation and more white space/empty lines to separate the internal logic of AEIexport (which should be changed to aei_export since Ruby uses snake case for method names. You can find more recommendations on how to style ruby code here).
Besides the style of your code, an error I found at the beginning of your method is using an undefined variable page when defining form_login_AEI.
For your second question, I'm not familiar with Selenium; however since it does use a real web browser it can handle JavaScript. Watir is another possible solution.
An alternative would be to view the page source (i.e. in Firebug) and understand what the JavaScript on the page does. Then use Mechanize to follow the link manually.

Parsing element text with capybara-webkit

I'm new to Ruby and Capybara and I'm trying to use capybara-webkit to scrape a website. All of the data I'm interested in lies in td tags with certain properties.
Where form is a particular form element I'm looking at, the following code works:
form.all('td').detect do |td|
if td['valign'] == 'top' && td['nowrap'] != 'nowrap'
print "#{td.text}\n"
end
end
The contents of all of the td elements I'm interested in are printed out correctly. However, when I try to then parse the text with a regex:
form.all('td').detect do |td|
if td['valign'] == 'top' && td['nowrap'] != 'nowrap'
print "#{td.text}\n"
val1, val2 = td.match(/(\d)(\d)/).captures # The real regex is more complex
end
end
...suddenly only the first td element is read/parsed. I've tried even just pushing each td.text value into an array for later parsing, but the same thing occurs. I've even tried making a clone of the td.text string and operating on that—no luck. There doesn't seem to be any sort of timeout on the page that would change the HTML elements. Absolutely no clue what could be causing this.
Any thoughts?

how to get attribute values using nokogiri

I have a webpage whose DOM structure I do not know...but i know the text which i need to find in that particular webpage..so in order to get its xpath what i do is :
doc = Nokogiri::HTML(webpage)
doc.traverse { |node|
if node.text?
if node.content == "my text"
path << node.path
end
end
}
puts path
now suppose i get an output like ::
html/body/div[4]/div[8]/div/div[38]/div/p/text()
so that later on when i access this webpage again i can do this ::
doc.xpath("#{path[0]}")
instead of traversing the whole DOM tree everytime i want the text
I want to do some further processing , for that i need to know which of the element nodes in the above xpath output have attributes associated with them and what are their attribute values. how would i achieve that? the output that i want is
#=> output desired
{ p => p_attr_value , div => div_attr_value , div[38] => div[38]_attr_value.....so on }
I am not facing the problem in searching the nodes where "my text" lies.. I wanted to have the full xpath of "my text" node..thts why i did the whole traversal...now after finding the full xpath i want the attributes associated with the each element node that I came across while getting to the "my text" node
constraints are ::I cant use any of the developer tools available in a web browser
PS :: I am newbie in ruby and nokogiri..
To select all attributes of an element that is selected using the XPath expression someExpr, you need to evaluate a new XPath expression:
someExpr/#*
where someExpr must be substituted with the real XPath expression used to select the particular element.
This selects all attributes of all (we assume that's just one) elements that are selected by the Xpath expression someExpr
For example, if the element we want is selected by:
/a/b/c
then all of its attributes are selected by:
/a/b/c/#*

Extracting HTML5 data attributes from a tag

I want to extract all the HTML5 data attributes from a tag, just like this jQuery plugin.
For example, given:
<span data-age="50" data-location="London" class="highlight">Joe Bloggs</span>
I want to get a hash like:
{ 'data-age' => '50', 'data-location' => 'London' }
I was originally hoping use a wildcard as part of my CSS selector, e.g.
Nokogiri(html).css('span[#data-*]').size
but it seems that isn't supported.
Option 1: Grab all data elements
If all you need is to list all the page's data elements, here's a one-liner:
Hash[doc.xpath("//span/#*[starts-with(name(), 'data-')]").map{|e| [e.name,e.value]}]
Output:
{"data-age"=>"50", "data-location"=>"London"}
Option 2: Group results by tag
If you want to group your results by tag (perhaps you need to do additional processing on each tag), you can do the following:
tags = []
datasets = "#*[starts-with(name(), 'data-')]"
#If you want any element, replace "span" with "*"
doc.xpath("//span[#{datasets}]").each do |tag|
tags << Hash[tag.xpath(datasets).map{|a| [a.name,a.value]}]
end
Then tags is an array containing key-value hash pairs, grouped by tag.
Option 3: Behavior like the jQuery datasets plugin
If you'd prefer the plugin-like approach, the following will give you a dataset method on every Nokogiri node.
module Nokogiri
module XML
class Node
def dataset
Hash[self.xpath("#*[starts-with(name(), 'data-')]").map{|a| [a.name,a.value]}]
end
end
end
end
Then you can find the dataset for a single element:
doc.at_css("span").dataset
Or get the dataset for a group of elements:
doc.css("span").map(&:dataset)
Example:
The following is the behavior of the dataset method above. Given the following lines in the HTML:
<span data-age="50" data-location="London" class="highlight">Joe Bloggs</span>
<span data-age="40" data-location="Oxford" class="highlight">Jim Foggs</span>
The output would be:
[
{"data-location"=>"London", "data-age"=>"50"},
{"data-location"=>"Oxford", "data-age"=>"40"}
]
You can do this with a bit of xpath:
doc = Nokogiri.HTML(html)
data_attrs = doc.xpath "//span/#*[starts-with(name(), 'data-')]"
This gets all the attributes of span elements that start with 'data-'. (You might want to do this in two steps, first to get all the elements you're interested in, then extract the data attributes from each in turn.
Continuing the example (using the span in your question):
hash = data_attrs.each_with_object({}) do |n, hsh|
hsh[n.name] = n.value
end
puts hash
produces:
{"data-age"=>"50", "data-location"=>"London"}
Try looping through element.attributes while ignoring any attribue that does not start with a data-.
The Node#css docs mention a way to attach a custom psuedo-selector. This might look like the following for selecting nodes with attributes starting with 'data-':
Nokogiri(html).css('span:regex_attrs("^data-.*")', Class.new {
def regex_attrs node_set, regex
node_set.find_all { |node| node.attributes.keys.any? {|k| k =~ /#{regex}/ } }
end
}.new)

Resources