How do I log in to a site remotely using a script? - ruby

I'm trying to write a script to automate a repetitive task I currently do manually. I would like to log in to a site remotely and inspect the returned page.
I've been doing this by sending a direct POST request (the site is PHP, I'm pretty sure it's Joomla) with my login details and data for the other fields of the form from the front page, but I'm getting either sockaddrinfo errors on the Net:HTTP Ruby library when I try a HTTP.post() (with data as a param=val1&param2=val2 string), and a rejected redirect to home page if I use HTTP.post_form (using a Hash)
I'm willing to do this in any language, really, I just picked Ruby since it's a favorite for quick scripting. If you have any ideas in bash, Python, etc. I'd be happy to try it.
I've tried variations on some examples, to no avail. Have any of you tried something like this with success? Any stumbling blocks we beginners run into frequently?
Thanks for your time ^_^
-Paul

Try mechanize:
http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html

Have a look at mechanize (Python) which is written with your problem in mind:
import re
from mechanize import Browser
br = Browser()
br.open("http://www.example.com/")
# follow second link with element text matching regular expression
response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)

Related

HtmlUnit. How can I get site content updated by ajax and websockets?

I need to fetch comments from this site https://russian.rt.com/, for example, for this news: https://russian.rt.com/sport/article/486467-rossiya-hokkei-zoloto-olimpiady
So I try this:
String url = "https://russian.rt.com/sport/article/486467-rossiya-hokkei-zoloto-olimpiady";
try (WebClient client = new WebClient(BrowserVersion.FIREFOX_52)) {
client.getOptions().setJavaScriptEnabled(true);
client.getOptions().setThrowExceptionOnScriptError(false);
client.getOptions().setThrowExceptionOnFailingStatusCode(false);
client.setAjaxController(new NicelyResynchronizingAjaxController());
HtmlPage rtPage = client.getPage(agencyURL);
HtmlElement comBlock = rtPage.getFirstByXPath("//ul[#class='sppre_messages-list']");
} ...
But HtmlElement comBlock is always null.
I've tried waiting for javascript to complete by
client.waitForBackgroundJavaScript(10*1000);
- scrolling page:
client.getCurrentWindow().setInnerHeight(60000);
or
rtPage.executeJavaScript("window.scrollBy(0,600)");
- getting elements at the bottom of the page and clicking them.
But neither of that helped and HtmlElement comBlock after all these operations is always null.
Maybe comments module uses some kind of websockets and this is not even possible?
Can anyone help me, please?
Have done some short tests with this site. At first i have seen a NPE when calling the site. This is fixed now in HtmlUnit. Usually i will inform via Twitter (www.twitter.com/HtmlUnit) if a new snapshot build is available. After that fix i faced many more javascript problems. Looks like the page does a lot of javascript including some uggly things. If you like to get this fixed it will be a great help if you can isolate simple cases that show the problems to give us a chance to fix HtmlUnit (there is more info about this on the HtmlUnit home page).
Sorry for not having a direct solution but as for many open source projects we need help from the community to do all the work.

Python Selenium webscraping mutiple pages(increment page number in url)

I have little to none programming experience and I just started learning Python two weeks ago it was a pain running under windows (e.g. environment variable etc something I didn't really know what it is until two weeks ago).
I am using seleniumto try to web scrape information.
basically the url(mainly jscript) pages changes incrementally:
e.g.
http://sssss.proxy.sssss/sample#/detail/0/1
http://sssss.proxy.sssss/sample#/detail/0/2
http://sssss.proxy.sssss/sample#/detail/1/1
http://sssss.proxy.sssss/sample#/detail/1/2
http://sssss.proxy.sssss/sample#/detail/2/1
http://sssss.proxy.sssss/sample#/detail/2/2
http://sssss.proxy.sssss/sample#/detail/50/1
http://sssss.proxy.sssss/sample#/detail/50/2
I want to pragmatically and systemically webscrape specific content(find by xpath) under each page and its subpage. However, I don't know how loop works (e.g. for i to xxx) in this case because every url has to be "get" by browser web driver. (does it mean the loop will initiate the browser to open every single page or will it happen within python shell like in request package)
There are method for scraping content for url that is fixed. But in my case the url does changes so I assume it can be done differently.
Please enlighten me
With thanks,
Iverson

Use of link-Checker (ruby)

Has anyone used the link-checker gem?
I don't want to use it in a project I want to write a small script to test links on a web app.
I cant seem to figure out how to use it. Trying to require it doesn't work but saying gem 'link-checker' does result in true.
I'm getting nowhere trying to play with it in IRB. Can someone let me know what I am missing?
Did you read the documentation? Link-checker is a small script designed to check links already.
That page shows examples of it running from the command-line, not from inside IRB or Ruby code. In other words, it is a command-line app, not code you require:
Usage:
Just give it the target that you want it to scan. For example, if you have an Octopress site then your output HTML is in the public directory, so call it with:
check-links 'public'
Or if you want to check the links on a live site, then give it a URL instead:
check-links 'http://www.ryanalynporter.com'
If you don’t pass any target, then the default is to scan the “./” directory. If you have a Jekyll site that you deploy to GitHub Pages, then you can check the links with just:
check-links

Ruby: How to screen-scrape the result of an Ajax request

I have written a ruby script to screen scrape something using the 'open-uri' and 'hpricot' gems - everything works great so far.
But now I have to screen scrape something which is returned after a form is submitted via a javascript function (called by an 'onchange' event handler from a drop-down menu):
function submit_form() {
document.list.action="/some/sort/of/path";
document.list.submit();
}
AFAIK, open-uri lets you submit only GET requests. And if I'm not mistaken, a POST request would be needed here.
So my question is: what do I need to install and to 'require' and how would the ruby code then look like (to make that POST request) - sorry, I'm still pretty much of a n00b...
Thank you very much for your help!
Tom
I think you definitely should use Mechanize. It provides a nifty interface to interact with remote pages, forms on them, and so forth (see this example).
The Ruby standard library has the http class, which naturally supports the POST operation.
Net::HTTP.post_form(URI.parse('http://www.example.com/some/sort/of/path')
If you find the API there less than optimal, then take a look at the httparty gem
Finally, while hpricot is a great gem, it isn't actively developed any longer. You should consider moving to nokogiri which practically replaces hpricot and improves upon it.

How to receive return from HTML textbox using ruby?

I am creating a chat client/server system in Ruby.
My server will be hosted on a laptop or something (this is a class project, so not much processing power will be needed) and then I plan for the client to take place in a web browser.
I will feed it the HTML for two textboxes: one in which the user can type and the other will display the chat history.
My problem is that while I can easily feed the HTML code to the browser and get it to display the chat (navigate to the ip address:port) I can't figure out how I can return what is input in the textbox to the server.
Does anybody know how I could do this?
I'd suggest using a lightweight framework like Sinatra to handle this. It's simple enough to get things done quickly without a lot of required reading, but powerful enough to expand your chat application significantly, should you want.
The downside of using a web-based client is that the chat log will only be refreshed on the client after they ask the server for the newest information; namely, at each page refresh, instead of in real time.
You can get around this with some slick Javascript (mostly XMLHTTPRequest) to ask for new content at a regular interval, like how Stack Overflow shows you when new answers have been posted as you're typing an answer of your own.
It sounds like you need a basic knowledge of how CGIs work. Once you know that you will find it easier to work with Sinatra, as #echoback recommended, or Padrino, Rails, or work with other languages.
This is a pretty basic CGI. It generates a simple form, along the lines of what you were talking about, then walks through the environment table passed to Ruby by the web server, sorts by the keys, and outputs a table in sorted order. Most of the fields directly apply to either the web server itself, or to the CGI, such as the query sent by the browser, along with its headers sent to the server saying what its capabilities are:
#!/usr/bin/env ruby
puts "Content-Type: text/html"
puts
puts "<html><head><style type='text/css'>body{font-family: monospace;}</style></head><body>"
puts "<form name='foo' action='test_cgi.rb'>"
puts "<input type='textinput' name='inputbox'></input><br />"
puts "<textarea name='textareabox'></textarea><br />"
puts "<input type='submit'></input>"
puts "</form>"
puts "<h4>ENVIRONMENT:</h4>"
puts "<table>"
ENV.keys.sort.each do |k|
puts "<tr><td>#{k}</td><td>#{ENV[k]}</td></tr>"
end
puts "</table>"
puts "</body></html>"
Copy that code, store it into a Ruby file called test_cgi.rb, then set the executable bit on the file. Move that file into the cgi-bin directory of your web server on your machine. Use your browser to access the file (http://localhost:8080/cgi-bin/test_cgi.rb or something similar), and watch the output in the table change as you enter different values in the form and submit them.
Once you understand that round-trip from server to browser to server you'll be in a good place to learn how Sinatra builds on Rack to supply more features, more easily, than doing it all yourself with a CGI.

Resources