I am attempting to log into a site via Mechanize that is proving to be a bit of a challenge. I can get past the first two form pages, but then after submitting an ID and password, I get a third page prior to being able to go into the main content that I'm attempting to scrape.
In my case I have the following Ruby source that gets me to the point where I'm hitting a roadblock:
agent = Mechanize.new
start_url = 'https://sub.domain.tld/action'
# Get first page
page = agent.get(start_url)
# Fill in the first ID field and submit form
login_form = page.forms.first
id_field = login_form.field_with(:name => "ctl00$PageContent$Login1$IdTextBox")
id_field.value = "XXXXXXXXXXX"
# Get the next password page & submit form:
page = agent.submit(login_form, login_form.buttons.first)
login_form = page.forms.first
password_field = login_form.field_with(:name => "ctl00$PageContent$Login1$PasswordTextBox")
password_field.value = "XXXXXXXXXXX"
# Get the next page and...
page = agent.submit(login_form, login_form.buttons.first)
# Try to go into the main content, just to hit a wall. :s
page = agent.click(page.link_with(:text => /Skip to main/))
The contents of the third page as per the mechanize agent output is:
https://gist.github.com/5ed57292c8f6532352fd
As you may note from this, it seems that one should be able to simply use agent.click on the first link to get on into the main content. Unfortunately, this is simply looping back to this page. Each time this loads I see that it is loading a new page, but it ends up having the precise same contents each time. Something is preventing me from being able to go on in through this multi-factor login to the main content it would seem, but I can't put my finger on what that might be.
Here is the page.content from that third request: http://f.cl.ly/items/252y261c303R0m2P1R0j/source.html
Any thoughts on what may be stopping me from going forward to content here?
Related
Goal: In irb, open a series of hyperlinks in new tabs and save a screenshot of each.
Code:
require "rubygems"
require "selenium-webdriver"
browser = Selenium::WebDriver.for:firefox
browser.get 'https://company.com'
browser.find_element(:name, "username").send_keys("myUsername")
browser.find_element(:name, "password").send_keys("myPassword")
browser.find_element(:name, "ibm-submit").click
body = browser.find_element(:tag_name => 'body')
body.send_keys(:control, 't')
parent = browser.find_element(:xpath, "//div[#id='someid']")
children = parent.find_elements(:xpath,"//a")
children.each do |i| ;
body.send_keys(:control, 't')
i.click
browser.save_screenshot("{i}")
end
Problem:
Selenium::WebDriver::Error::StaleElementReferenceError: Element not found in the cache - perhaps the page has changed since it was looked up
Question: What am I doing wrong?
Basically you can't share WebElements across pages, yet you're trying to access body across multiple tabs. Try not to think of them as self-contained objects, but as proxies to something on a real page.
The solution is to only ever perform actions on the 'current' page. In your case that means sending the Ctrl-T event on the tab that you've created. Having done that the first time, you're switched to the new tab. You then need to re-perform the lookup:
newTabsBody = browser.find_element(:tag_name => 'body')
and then:
newTabsBody.send_keys(:control, 't')
to create the next one. Continue for each child.
I'm attempting to store a CodeIgniter captcha helper challenge word in a Session variable, as follows:
$cap = create_captcha($val /* array of params */); // Works fine
$this->session->set_userdata('cap-word', $cap['word']);
If I echo $cap['word'] before and after the session set, it is correct (i.e. what was displayed in the browser). If I retrieve the session variable immediately after setting it, it's also correct.
However what gets stored in the session is completely different - it's the right length (character count) but a totally different string. Hence, when I try to retrieve the userdata on the server side (captcha validation) it gets the wrong value.
(I'm configured for 'sess_use_database' and inspecting the user data values in phpMyAdmin after each page load. Cookie encryption is disabled.)
Debug attempts:
I've tried prepending / appending known strings to the captcha challenge word before storing it in the session. The known strings make it into the session user data just fine, but are prepended / appended to an incorrect captcha word.
I've tried replacing the challenge word entirely with a fixed string. This makes the round trip through the session no problem.
I've tried saving the captcha challenge word to a different string variable (rather than passing it directly from $cap['word]') with the same result; only the challenge portion gets "munged" on landing in the session.
After debugging, my code looks more like:
$cap = create_captcha($vals);
echo("<br>cap = ");
print_r($cap);
echo("<br>cap['word'] = " . $cap['word']);
$theword = $cap['word'];
echo("<br> Before theword = " . $theword);
$this->session->set_userdata('cap-word', $theword . 'abcdefg');
echo("<br> After theword = " . $theword);
echo("<br> Session output = " . $this->session->userdata('cap-word'));
This produces the following output in the browser:
cap = Array ( [word] => 5CZaDeHm [time] => 1436765602.678 [image] =>
{the image})
cap['word'] = 5CZaDeHm
Before theword = 5CZaDeHm
After theword = 5CZaDeHm
Session output = 5CZaDeHmabcdefg
However, what's stored in the session table userdata fields (and, thus, what pops out when I call $this->session->userdata('cap-word') on submit request) is:
a:2:{s:9:"user_data";s:0:"";s:8:"cap-word";s:15:"3g5hb1I3abcdefg";}
Hence, the substring '5CZaDeHm' within $theword has been seemingly replaced by '3g5hb1I3' during the call to $this->session->set_userdata. I have no idea why, or even how this is possible?!
Update 2015-07-13 07:50 EDT: As usual, Occam's Razor applies. Turns out on each page load, my controller is being called twice, generating 2 captcha images with 2 corresponding challenge words. One of these appeared in the browser, the other in the session's database row. Now to figure out why...
Turns out the CodeIgniter controller was being called twice because I used a relative path in the href parameter of a CSS tag. This resulted in the browser attempting to load the stylesheet relative to the page, which triggered a 2nd load of the page.
I need to post data on a website through a program.
To achieve this I am using Mechanize Nokogiri and Selenium.
Here's my code :
def aeiexport
# first Mechanize is submitting the form to identify yourself on the website
agent = Mechanize.new
agent.get("https://www.glou.com")
form_login_AEI = agent.page.forms.first
form_login_AEI.util_vlogin = "42"
form_login_AEI.util_vpassword = "666"
# this is suppose to submit the form I think
page_compet_list = agent.submit(form_login_AEI, form_login_AEI.buttons.first)
#to be able to scrap the page you end up on after submitting form
body = page_compet_list.body
html_body = Nokogiri::HTML(body)
#tds give back an array of td
tds = html_body.css('.L1').xpath("//table/tbody/tr[position()>1]/td")
# Checking my array of td with some condition
tds.each do |td|
link = td.children.first # Select the first children
if link.html = "2015 32 92 0076 012"
# Only consider the html part of the link, if matched follow the previous link
previous_td = td.previous
previous_url = previous_td.children.first.href
#following the link contained in previous_url
page_selected_compet = agent.get(previous_url)
# to be able to scrap the page I end up on
body = page_selected_compet.body
html_body = Nokogiri::HTML(body)
joueur_access = html_body.search('#tabs0head2 a')
# clicking on the link
joueur_access.click
rechercher_par_numéro_de_licence = html_body.css('.L1').xpath("//table/tbody/tr/td[1]/a[1]")
pure_link_rechercher_par_numéro_de_licence = rechercher_par_numéro_de_licence['href']
#following pure_link_rechercher_par_numéro_de_licence
page_submit_licence = agent.get(pure_link_rechercher_par_numéro_de_licence)
body_submit_licence = page_submit_licence.body
html_body = Nokogiri::HTML(body_submit_licence)
#posting my data in the right field
form.field_with(:name => 'lic_cno[0]') == "9511681"
1) So far what do you think about this code, Do you think there is an error in there
2) This part is the one I am really not sure about : I have posted my data in the right field but now I need to submit it. The problem is that the button I need to click is like this:
<input type="button" class="button" onclick="dispatchAndSubmit(document.JoueurRechercheForm, 'rechercher');" value="Rechercher">
it triggers a javascript function onclick. I am triying Selenium to trigger the click event. Then I end up on another page, where I need to click a few more times.. I tried this:
driver.find_element(:value=> 'Rechercher').click
driver.find_element(:name=> 'sel').click
driver.find_element(:value=> 'Sélectionner').click
driver.find_element(:value=> 'Inscrire').click
But so far I have not succeeded in posting the data.
Could you please tell me if selenium will enable me to do what I need to do. If can I do it ?
At a glance your code can use less indentation and more white space/empty lines to separate the internal logic of AEIexport (which should be changed to aei_export since Ruby uses snake case for method names. You can find more recommendations on how to style ruby code here).
Besides the style of your code, an error I found at the beginning of your method is using an undefined variable page when defining form_login_AEI.
For your second question, I'm not familiar with Selenium; however since it does use a real web browser it can handle JavaScript. Watir is another possible solution.
An alternative would be to view the page source (i.e. in Firebug) and understand what the JavaScript on the page does. Then use Mechanize to follow the link manually.
I need to detect if a remote page changed. I wrote:
a = JSON.parse open('http://en.wikipedia.org/wiki/Main_Page').read
b = JSON.parse open('http://en.wikipedia.org/wiki/Main_Page').read
The page was not changed, but a == b returned false. Is it possible to detect if the page changed or not?
What have you put JSON.parse there for?? Do you expect the wikipedia mainpage to be json-encoded?
require 'open-uri'
a = open('http://en.wikipedia.org/wiki/Main_Page').read
b = open('http://en.wikipedia.org/wiki/Main_Page').read
puts a == b
# ⇒ true
Whether you have the dynamically created pages (produced by CMS or likewise), you need to examine the web page content and explicitly cast the page to, say, canonical view: cut all the temporary information off and compare the static parts only.
I have been trying to write a script that may help me to comment from command line.(The sole reason why I want to do this is its vacation time here and I want to kill time).
I often visit and post on this site.So I am starting with this site only.
For example to comment on this post I used the following script
require "uri"
require 'net/http'
def comment()
response = Net::HTTP.post_form(URI.parse("http://www.geeksforgeeks.org/wp-comments-post.php"),{'author'=>"pikachu",'email'=>"saurabh8c#gmail.com",'url'=>"geekinessthecoolway.blogspot.com",'submit'=>"Have Your Say",'comment_post_ID'=>"18215",'comment_parent'=>"0",'akismet_comment_nonce'=>"70e83407c8",'bb2_screener_'=>"1330701851 117.199.148.101",'comment'=>"How can we generalize this for a n-ary tree?"})
return response.body
end
puts comment()
Obviously the values were not hardcoded but for sake of clearity and maintaining the objective of the post i am hardcoding them.
Beside the regular fields that appear on the form,the values for the hidden fields i found out from wireshark when i posted a comment the normal way.I can't figure out what I am missing?May be some js event?
Edit:
As few people suggested using mechanize I switched to python.Now my updated code looks like:
import sys
import mechanize
uri = "http://www.geeksforgeeks.org/"
request = mechanize.Request(mechanize.urljoin(uri, "archives/18215"))
response = mechanize.urlopen(request)
forms = mechanize.ParseResponse(response, backwards_compat=False)
response.close()
form=forms[0]
print form
control = form.find_control("comment")
#control=form.find_control("bb2_screener")
print control.disabled
# ...or readonly
print control.readonly
# readonly and disabled attributes can be assigned to
#control.disabled = False
form.set_all_readonly(False)
form["author"]="Bulbasaur"
form["email"]="ashKetchup#gmail.com"
form["url"]="9gag.com"
form["comment"]="Y u no put a captcha?"
form["submit"]="Have Your Say"
form["comment_post_ID"]="18215"
form["comment_parent"]="0"
form["akismet_comment_nonce"]="d48e588090"
#form["bb2_screener_"]="1330787192 117.199.144.174"
request2 = form.click()
print request2
try:
response2 = mechanize.urlopen(request2)
except mechanize.HTTPError, response2:
pass
# headers
for name, value in response2.info().items():
if name != "date":
print "%s: %s" % (name.title(), value)
print response2.read() # body
response2.close()
Now the server returns me this.On going through the html code of the original page i found out there is one more field bb2_screener that i need to fill if I want to pretend like a browser to the server.But the problem is the field is not written inside the tag so mechanize won't treat it as a field.
Assuming you have all the params correct, you're still missing the session information that the site stores in a cookie. Consider using something like mechanize, that'll deal with the cookies for you. It's also more natural in that you tell it which fields to fill in with which data. If that still doesn't work, you can always use a jackhammer like selenium, but then technically you're using a browser.