How to download a secured webpage - https

I wish to programmatically download a webpage which requires a log in to view. Is there any sane way of doing this? By looking at HTTP headers and such, I can see the username / password being passed as POST data, but requesting a page with this info attached isn't good enough. I think cookies are involved too, and it looks like they contain some kind of encrypted authorisation data.
Is there any way of faking this? Language isn't too important here, but something like Perl that can be run on Linux with relative ease would be nice. Or maybe a command line browser could be scripted?

Yes, you can do this via the curl command-line tool or the CURL library. You need to figure out what's supposed to be in the cookies, and then pass them with curl's -b option or the equivalent CURL API.
You can also perform HTTP Basic authentication via CURL.
If the page is really sophisticated, you'll have to do HTML parsing or even JS interpretation to extract the cookie data beforehand. That's still doable, but not with CURL alone.
As a general note, anything a web browser can do can be scripted. Turing-completeness and all that. "Unscriptable" captive portals like BlueSocket sells are a load of bunk; they're basically just obfuscated web pages. They'll slow you down but can never, ever stop you - they have to give you the keys in order to work!

Php's CURL would do it. Also check here if this solution is right for you.

Related

Is there such a thing as an http shell?

I'll soon be doing a presentation on the basics of HTTP for colleagues where I work.
I've done this sort of thing a number of times, and one thing I like to do is telnet directly to an http server and send the various headers that way. The idea is to show the simplicity of the protocol, and remove browsers from the discussion.
In the past, I've used a text document to copy the headers from because of typos and timeouts. So, it goes something like this:
telnet to somewebserver.com:80
For the first go, simply type in GET, etc. This emphasizes the fact that it's simple, just text, etc.
For later requests copy and paste the request from a text document.
Etc...
It would be nice if there was a way to replay previous commands, similar to the way various shells' history works. However, searching for http [interactive] shell is a bleak wasteland of irrelevance.
Does such a thing exist? Or am I off base in my search terms? Any advice is welcome, including suggestions about other tools or tips for building my own.
I'll likely be doing the presentation on a Macintosh.
Thanks!
Greg
My answer is based on #BenjaminW's.
HTTPie appears to do what I want.
https://httpie.org/
I suggest you use PostMan, I've used it to develop RESTful APIs for several years, it can send any kind of http request, sync your browser's cookie, I think it's also a great tool to manage your http request.

A plugin for manipulating JavaScript/HTML code

I need a tool that can parse and insert code to the JavaScript/HTML code before the browser starts to interpret the code. I've been thinking using a proxy to do it. But now I'd like to know whether I could implement such functionality in a Firefox plug-in?
Sounds like Greasemonkey to me.
What does Greasemonkey do?
Greasemonkey lets you add JavaScript code (called "user scripts") to any web page, which will run when its HTML code has loaded. Compared to writing extensions, user scripts often offer a light-weight alternative, requiring no browser restart on user script installation nor removal, and work with the common DOM API familiar to any web developer (with somewhat elevated privileges for doing cross domain XMLHttpRequest requests and storing small portions of private data). User scripts work more or less like bookmarklets automatically invoked for any URLs matching one or more glob patterns.
http://wiki.greasespot.net/FAQ
I'm pretty sure something like TemperData might work. Or maybe Fiddler, but that's an application with additional hooks that enable it to work with Firefox.
TemperData: https://addons.mozilla.org/en-US/firefox/addon/966/
Fiddler: http://www.fiddler2.com/fiddler2/
Of course both work on a network level, so they may be a bit more arcane than what you'd need.

a script to log into webpage

I want to write a script to log in and interact with a web page, and a bit at a loss as to where to start. I can probably figure out the html parsing, but how do I handle the login part? I was planning on using bash, since that is what I know best, but am open to any other suggestions. I'm just looking for some reference materials or links to help me get started. I'm not really sure if the password is then stored in a cookie or whatnot, so how do I assess the situation as well?
Thanks,
Dan
Take a look a cURL, which is generally available in a Linux/Unix environment, and which lets you script a call to a web page, including POST parameters (say a username and password), and lets you manage the cookie store, so that a subsequent call (to get a different page within the site) can use the same cookie (so your login will persist across calls).
I did something like that at work some time ago, I had to login in a page and post the same data over and over...
Take a look at here. I used wget because I did not get it working with curl.
Search this site for screen scraping. It can get hairy since you will need to deal with cookies, javascript and hidden fields (viewstate!). Usually you will need to scrape the login page to get the hidden fields and then post to the login page. Have fun :D

Screen scraping an ASP.NET web page to retrieve data displayed in the grid view

I am using RUBY to screen scrap a web page (created in asp.net) which uses gridview to display data. I am successfully able to read the data displayed on page-1 of the grid but unable to figure out how I can move to the next page in the grid to read all the data.
Problem is the page number hyperlinks are not normal hyperlinks (with URL) but instead are javascript hyperlink which causes postback to the same page..
An example of the hyperlink:-
6
I recommend using Watir, a ruby library designed for browser testing, if you're already using ruby for processing. For one thing, it gives you a much nicer interface to the DOM elements on the page, and it makes clicking links like this easier:
ie.link(:text, '6').click
Then, of course you have easier methods for navigating the table as well. It's easy enough to automate this process:
1..total_number_of_pages.each do |next_page|
ie.link(:text, next_page).click
# table processing goes here
end
I don't know your use case, but this approach has its advantages and disadvantages. For one thing, it actually runs a browser instance, so if this is something you need to frequently run quietly in the background in completely automated way, this may not be the best approach. On the other hand, if it's ok to launch a browser instance, then you don't have to worry about all that postback nonsense, and you can just click the link as if you were a user.
Watir: http://wtr.rubyforge.org/
You'll need to figure out the actual URL.
Option 1a: Open the page in a browser with good developer support (e.g. firefox with the web development tools) and look through the source to find where _doPostBack is defined. Figure out what URL it's constructing. Note that it might not be in the main page source, but instead in something that the page loads.
Option 1b: Ditto, but have ruby do it. If you're fetching the page with Net:HTTP you've got the tools to find the definition of __doPostBack already (the body as a string, ruby's grep, and the ability to request additional files, such as those in script tags).
Option 2: Monitor the traffic between a browser and the page (e.g. with a logging proxy) to find out what the URL is.
Option 3: Ask the owner of the web page.
Option 4: Guess. This may not be as bad as it sounds (e.g. if the original URL ends with "...?page=1" or something) but in general this is the least likely to work.
Edit (in response to your comment on the other question):
Assuming you're using the Net:HTTP library, you can do a postback by just replacing your get with a post, e.g. my_http.post(my_url) instead of my_http.get(my_url)
Edit (in response to danieltalsky's answer):
watir may be a really good solution for you (I'm kicking myself for not having thought of it), but be aware that you may have to manually fire the event or go through other hoops to get what you want. As a specific gotcha, with any asynchronous fetch like this you need to make sure that the full response has come back before you scrape it; that isn't a problem when you're doing the request inline yourself.
You will have to perform the postback. The data is pass with a form POST back to the server. Like Markus said use something like FireBug or the Developer Tools in IE 8 and fiddler to watch the traffic. But honestly this is a web form using the bloated GridView and you will be in for a fun adventure. ;)
You'll need to do some investigation in order to figure out what HTTP request the javascript execution is performing. I've used the Mozilla browser with the Firebug plugin and also the "Live HTTP Headers" plugin to help determine what is going on. It will likely become clear to you which requests you will need to make in order to traverse to the next page. Make sure you pay attention to any cookies getting set.
I've had really good success using Mechanize for scraping. It wraps all of the HTTP communication, html parsing and searching(using Nokogiri), redirection, and holding onto cookies. But it doesn't know how to execute Javascript, which is why you will need to figure out what http request to perform on your own.

Reverse Engineer A Web Form

I have a web site which I download 2-3 MB of raw data from that then feeds into an ETL process to load it into my data mart. Unfortunately the data provider is the US Dept. of Ag (USDA) and they do not allow downloading via FTP. They require that I use a web form to select the elements I want, click through 2-3 screens and eventually click to download the file. I'd like to automate this download process. I am not a web developer but somehow it seems that I should be able to use some tool to tell me exactly what put/get/magic goes from the final request to the server. If I had a tool that said, "pass these parameters to this url and wait for a response" I could then hack something together in Perl to automate this process.
I realize that if I deconstructed all 5 of their pages and read through the JavaScript includes and tapped my heals together 3 times I could get this info from what I have access to. But I want a faster and more direct path that does not require me to manually parse all their JS.
Restatement of the final question: Is there a tool or method that will show clearly what the final request request sent from a web form was and how it was structured?
A tamperer's best friends (these are firefox extensions, you could also use something like Wireshark)
HTTPFox
Tamper Data
Best of luck
Use Fiddler2 as a proxy to see what is being passed back and forth. I've done this with success in other similar circumstances
Home page is here: http://www.fiddler2.com/fiddler2/
As with the other responses, except my tool of choice is Charles
What about using a web testing toolkit, like Watir and Ruby ?
Easy to fill in the forms.. just use the output..
Use WatiN and combine it with WatiN TestRecorder (Google for it)
It can "simulate" a user sitting in front of the browser punching in values which you can supply from your own C# code...

Resources