a script to log into webpage - bash

I want to write a script to log in and interact with a web page, and a bit at a loss as to where to start. I can probably figure out the html parsing, but how do I handle the login part? I was planning on using bash, since that is what I know best, but am open to any other suggestions. I'm just looking for some reference materials or links to help me get started. I'm not really sure if the password is then stored in a cookie or whatnot, so how do I assess the situation as well?
Thanks,
Dan

Take a look a cURL, which is generally available in a Linux/Unix environment, and which lets you script a call to a web page, including POST parameters (say a username and password), and lets you manage the cookie store, so that a subsequent call (to get a different page within the site) can use the same cookie (so your login will persist across calls).

I did something like that at work some time ago, I had to login in a page and post the same data over and over...
Take a look at here. I used wget because I did not get it working with curl.

Search this site for screen scraping. It can get hairy since you will need to deal with cookies, javascript and hidden fields (viewstate!). Usually you will need to scrape the login page to get the hidden fields and then post to the login page. Have fun :D

Related

having chrome extensions pulling data from ruby db

I would like to build a chrome extension (CE) that pulls data from a ruby db for a specific user. So, in a basic example, if an user submits their favorite color as 'red' and sport as 'tennis' into the db from the core website, when they click the CE, 'red' and 'tennis' will show up no matter where they are on the internet.
Any guidance on how to build something like this? Seems quite simple but am not sure how the CE files fit in with the ruby folder framework.
Also, is it possible to write to a ruby database from a popped out CE? i.e. - submitting 'red' and 'tennis' from the CE to the ruby database to go along with the previous example. Any guidance?
Cheers
This is a very general question so it sounds like you will need to learn a lot. Which can be a good thing :)
Here are the general steps you need:
Look into building an API for your ruby application. This will allow you to get data from your database. For example, you can
make an app where you go to http://yoursite.com/api/favorites and that will return a list of all favorites as JSON. Then in your Chrome Extension you can parse the JSON and display the results to the user. You will probably want to do this using an ajax call (see jquery.ajax for an easy way to use ajax).
Assuming you want user accounts, your user will need to be logged in. Then you can use your user's cookies to verify that they are logged in and show them custom info. i.e. going to http://yoursite.com/api/favorites will just show the favorites for that user, not for everyone.
Finally, submitting things to the database...you can have another route where users can send stuff. For example, if you go to http://yoursite.com/api/favorites/add?color=red then it will add the color red to that user's favorites. You will need to write all the logic for adding stuff to the database...again, it might help you to go through a rails tutorial and then look at building an API.
Related to #3, look into RESTful APIs. A good convention is that if you issue a GET request, you're asking for data, but if you issue a POST request, you are adding data (in your case, creating a new favorite).
Finally, for terminology: it's not a "ruby" database, it's just a database. You can access a database using almost any language, and it sounds like you are accessing it using ruby right now :)
If you only need to store data for one machine browsing anywhere online, chrome has a storage api that would work great.
If you do need a ruby server, I would recommend looking at sinatra.

How to download a secured webpage

I wish to programmatically download a webpage which requires a log in to view. Is there any sane way of doing this? By looking at HTTP headers and such, I can see the username / password being passed as POST data, but requesting a page with this info attached isn't good enough. I think cookies are involved too, and it looks like they contain some kind of encrypted authorisation data.
Is there any way of faking this? Language isn't too important here, but something like Perl that can be run on Linux with relative ease would be nice. Or maybe a command line browser could be scripted?
Yes, you can do this via the curl command-line tool or the CURL library. You need to figure out what's supposed to be in the cookies, and then pass them with curl's -b option or the equivalent CURL API.
You can also perform HTTP Basic authentication via CURL.
If the page is really sophisticated, you'll have to do HTML parsing or even JS interpretation to extract the cookie data beforehand. That's still doable, but not with CURL alone.
As a general note, anything a web browser can do can be scripted. Turing-completeness and all that. "Unscriptable" captive portals like BlueSocket sells are a load of bunk; they're basically just obfuscated web pages. They'll slow you down but can never, ever stop you - they have to give you the keys in order to work!
Php's CURL would do it. Also check here if this solution is right for you.

With Google's #! mess, what effect would a redirect on the converted URL have?

So Google takes:
http://www.mysite.com/mypage/#!pageState
and converts it to:
http://www.mysite.com/mypage/?_escaped_fragment_=pageState
...So... Would be it fair game to redirect that with a 301 status to something like:
http://www.mysite.com/mypage/pagestate/
and then return an HTML snapshot?
My thought is if you have an existing html structure, and you just want to add ajax as a progressive enhancement, this would be a fair way to do it, if Google just skipped over _escaped_fragment_ and indexed the redirected URL. Then your ajax links are configured by javascript, and underneath them are the regular links that go to your regular site structure.
So then when a user comes in on a static url (ie http://www.mysite.com/mypage/pagestate/ ), the first link he clicks takes him to the ajax interface if he has javascript, then it's all ajax.
On a side note does anyone know if Yahoo/MSN onboard with this 'spec' (loosely used)? I can't seem to find anything that says for sure.
If you redirect the "?_escaped_fragment_" URL it will likely result in the final URL being indexed (which might result in a suboptimal user experience, depending on how you have your site setup). There might be a reason to do it like that, but it's hard to say in general.
As far as I know, other search engines are not yet following the AJAX-crawling proposal.
You've pretty much got it. I recently did some tests and experimented with sites like Twitter (which uses #!) to see how they handle this. From what I can tell they handle it like you're describing.
If this is your primary URL
http://www.mysite.com/mypage/#!pageState
Google/Facebook will go to
http://www.mysite.com/mypage/?_escaped_fragment_=pageState
You can setup a server-side 301 redirect to a prettier URL, perhaps something like
http://www.mysite.com/mypage/pagestate/
On these HTML snapshot pages you can add a client-side redirect to send most people back to the dynamic version of the page. This ensures most people share the dynamic URL. For example, if you try to go to http://twitter.com/brettdewoody it'll redirect you to the dynamic (https://twitter.com/#!/brettdewoody) version of the page.
To answer your last question, both Google and Facebook use the _escaped_fragment_ method right now.

Automate website log-in and form filling?

I'm trying to log in to a website and save an HTML page automatically (I want to be able to do this on a regular time interval). From the surface, this is a typical modern website where, if the user navigates directly to a "locked" URL, a log-in form pops up, and after logging in, the user is redirected to the intended page.
I gave mechanize a shot (http://wwwsearch.sourceforge.net/mechanize/) but it wasn't finding some form elements which were needed for login (hidden elements that have some values put in by a javascript function that runs when the user clicks the "log in" button).
I played a bit with the "web browser" control in .NET but quickly lost interest because I couldn't even get it to submit a query on the Google page.
I don't care what the language is; I'll learn it to solve this problem. At a minimum it has to work in Windows.
A simple example, say, typing in a query into the Google search box would be a great bonus.
In my experience, the most reliable way is to use javascript. It works well in .Net. To test, browse to the following addresses one after another in Firefox or Internet Explorer:
http://www.google.com
javascript:function f(){document.forms[0]['q'].value='stackoverflow';}f();
javascript:document.forms[0].submit()
That performs a search for "stackoverflow" on Google. To do it in VB .Net using the webbrowser control, do this:
WebBrowser1.Navigate("http://www.google.com")
Do While WebBrowser1.IsBusy OrElse WebBrowser1.ReadyState <> WebBrowserReadyState.Complete
Threading.Thread.Sleep(1000)
Application.DoEvents()
Loop
WebBrowser1.Navigate("javascript:function%20f(){document.forms[0]['q'].value='stackoverflow';}f();")
Threading.Thread.Sleep(2000) 'wait for javascript to run
WebBrowser1.Navigate("javascript:document.forms[0].submit()")
Threading.Thread.Sleep(2000) 'wait for javascript to run
Notice how the space in the URL is converted to %20. I'm not certain if this is necessary but it can't hurt. It is important that the first javascript be in a function. The calls to Sleep() are to wait for Google to load and also for the javascript stuff. The Do While Loop might run forever if the page fails to load so for automation purposes have a counter that will timeout after, say, 60 seconds.
Of course, for Google you can just navigate directly to www.google.com?q=stackoverflow but if your site has hidden input fields, etc, then this is the way to go. Only works for HTML sites - flash is a whole other matter.
If I understand you right, you want to log in to only one webpage, and that form always stays the same. You could either reverse engineer the java script, or debug it via a javascript debugger in the browser (e.g. firebug for firefox). Or you can fill in the form in your browser and look at the http request via a network packet sniffer. Once you have all required form data to submit, you can do the same with your program (thats what I did the last time I had a pretty similar task to do). dont forget to store all cookie data you requested back from the webserver and send it with the next request, to 'stay logged in'.
Its being already discussed here.
Basically its gist is you can use selenium, an open source web automation tool, which has api library available in various languages like java, ruby, etc.
Neoload can handle the form filling with authentication, assuming you don't want to collect data, just perform actions. It's a web stress tool, so it's not really meant to be used as a time-based service, but you COULD just leave it running.
I've used Ruby and Watir (a web app testing suite) for something similar, but it was a very small task (basically visiting URLs from a text file and downloading an image).
There's also an extension called iMacros that can do some automation, but I'm not personally familiar with it (just aware of it).
"I'm trying to log in to a website and save an HTML page automatically"
SAVEAS TYPE=HTM FOLDER=C: FILE=page.html
https://addons.mozilla.org/en-US/firefox/addon/imacros-for-firefox/?src=search
This commands played in iMacros addon will save the page on C: drive and name it page.html
Also,
URL GOTO=www.website.com
Goes on the particular website you want to save. You can also use scripting in iMacros and set different websites in macro.

Screen scraping an ASP.NET web page to retrieve data displayed in the grid view

I am using RUBY to screen scrap a web page (created in asp.net) which uses gridview to display data. I am successfully able to read the data displayed on page-1 of the grid but unable to figure out how I can move to the next page in the grid to read all the data.
Problem is the page number hyperlinks are not normal hyperlinks (with URL) but instead are javascript hyperlink which causes postback to the same page..
An example of the hyperlink:-
6
I recommend using Watir, a ruby library designed for browser testing, if you're already using ruby for processing. For one thing, it gives you a much nicer interface to the DOM elements on the page, and it makes clicking links like this easier:
ie.link(:text, '6').click
Then, of course you have easier methods for navigating the table as well. It's easy enough to automate this process:
1..total_number_of_pages.each do |next_page|
ie.link(:text, next_page).click
# table processing goes here
end
I don't know your use case, but this approach has its advantages and disadvantages. For one thing, it actually runs a browser instance, so if this is something you need to frequently run quietly in the background in completely automated way, this may not be the best approach. On the other hand, if it's ok to launch a browser instance, then you don't have to worry about all that postback nonsense, and you can just click the link as if you were a user.
Watir: http://wtr.rubyforge.org/
You'll need to figure out the actual URL.
Option 1a: Open the page in a browser with good developer support (e.g. firefox with the web development tools) and look through the source to find where _doPostBack is defined. Figure out what URL it's constructing. Note that it might not be in the main page source, but instead in something that the page loads.
Option 1b: Ditto, but have ruby do it. If you're fetching the page with Net:HTTP you've got the tools to find the definition of __doPostBack already (the body as a string, ruby's grep, and the ability to request additional files, such as those in script tags).
Option 2: Monitor the traffic between a browser and the page (e.g. with a logging proxy) to find out what the URL is.
Option 3: Ask the owner of the web page.
Option 4: Guess. This may not be as bad as it sounds (e.g. if the original URL ends with "...?page=1" or something) but in general this is the least likely to work.
Edit (in response to your comment on the other question):
Assuming you're using the Net:HTTP library, you can do a postback by just replacing your get with a post, e.g. my_http.post(my_url) instead of my_http.get(my_url)
Edit (in response to danieltalsky's answer):
watir may be a really good solution for you (I'm kicking myself for not having thought of it), but be aware that you may have to manually fire the event or go through other hoops to get what you want. As a specific gotcha, with any asynchronous fetch like this you need to make sure that the full response has come back before you scrape it; that isn't a problem when you're doing the request inline yourself.
You will have to perform the postback. The data is pass with a form POST back to the server. Like Markus said use something like FireBug or the Developer Tools in IE 8 and fiddler to watch the traffic. But honestly this is a web form using the bloated GridView and you will be in for a fun adventure. ;)
You'll need to do some investigation in order to figure out what HTTP request the javascript execution is performing. I've used the Mozilla browser with the Firebug plugin and also the "Live HTTP Headers" plugin to help determine what is going on. It will likely become clear to you which requests you will need to make in order to traverse to the next page. Make sure you pay attention to any cookies getting set.
I've had really good success using Mechanize for scraping. It wraps all of the HTTP communication, html parsing and searching(using Nokogiri), redirection, and holding onto cookies. But it doesn't know how to execute Javascript, which is why you will need to figure out what http request to perform on your own.

Resources