Reverse Engineer A Web Form - etl

I have a web site which I download 2-3 MB of raw data from that then feeds into an ETL process to load it into my data mart. Unfortunately the data provider is the US Dept. of Ag (USDA) and they do not allow downloading via FTP. They require that I use a web form to select the elements I want, click through 2-3 screens and eventually click to download the file. I'd like to automate this download process. I am not a web developer but somehow it seems that I should be able to use some tool to tell me exactly what put/get/magic goes from the final request to the server. If I had a tool that said, "pass these parameters to this url and wait for a response" I could then hack something together in Perl to automate this process.
I realize that if I deconstructed all 5 of their pages and read through the JavaScript includes and tapped my heals together 3 times I could get this info from what I have access to. But I want a faster and more direct path that does not require me to manually parse all their JS.
Restatement of the final question: Is there a tool or method that will show clearly what the final request request sent from a web form was and how it was structured?

A tamperer's best friends (these are firefox extensions, you could also use something like Wireshark)
HTTPFox
Tamper Data
Best of luck

Use Fiddler2 as a proxy to see what is being passed back and forth. I've done this with success in other similar circumstances
Home page is here: http://www.fiddler2.com/fiddler2/

As with the other responses, except my tool of choice is Charles

What about using a web testing toolkit, like Watir and Ruby ?
Easy to fill in the forms.. just use the output..

Use WatiN and combine it with WatiN TestRecorder (Google for it)
It can "simulate" a user sitting in front of the browser punching in values which you can supply from your own C# code...

Related

Monitoring AJAX requests between a Flash applet and a server via a Google Chrome extension

I am playing a Flash-only game that uses AJAX to communicate with the server. The problem is that all the data is "drawn" and most of it not copy/pastable, so I end up retyping URLs and similar stuff from parts of it (i.e., from the chat).
I thought I'd make a simple page action extension for Chrome that would intercept all the AJAX communication between the game and the server, the way Developer tools can do it, and display only the data I'm interested in (parsing URLs and similar stuff is a no-brainer).
However, looking around the internet, I've found no info on how to do this. Many sites (including answers to some questions here) mention using Developer Tools (I'd prefer having a page action extension, simple enough to share with other players, but any other automation is welcome as well), some mention chrome.webRequest (which seems to be able to provide only the headers),...
I also thought of making a content script along the lines of this answer, but since I'm trying to read the data between a Flash applet (not a web page) and a server, I don't think injecting a JavaScript code is possible.
So, my question is: can this be done and, if yes, how?
In case anyone got the wrong idea, the aim of this is only to monitor the communication and extract the parts I'd want to be able to copy/paste, not change any data (i.e., the purpose is simplification of the game play, not cheating).

how to verify a omniture tag fired correctly using tamper data

Sorry for asking a newbie question here. I've searched on the web but everyone else seems to know the answer so I can't find any definite words. I need to verify my omniture tag fires correctly in tamper data. Should I look for a call from adobetag.com? Or what other URL call should I look for?
If your implementation is Omniture's 3rd party cookie, the request will be sent to 2o7.net or omtrdc.net depending on the version of SiteCatalyst you are using (and whether or not it is through their TagManager). If it is a first party implementation, it will be to your own domain that you worked with Omniture ClientCare to setup.
You can certainly look for those domains in any number of addons or other programs (built-in net console, firebug, httpfox, charles proxy - basically anything that can see requests being made) but FYI there are a couple of options that make it easier to see Omniture requests.
If you are using FireFox and have the firebug addon, there is an extension to firebug called omnibug. It adds an extra tab to firebug that will show you requests to Omniture made, and even makes it an easy to read format (It also reports other ones like GA and WebTrends).
Alternatively, Omniture provides a debugging tool called DigitalPulse. Basically you create a bookmark and for the location/url you put the javascript code snippet. Then on the page you click the bookmark and it pops up a window showing info about any omniture requests made.
Filter to requests containing b/ss
They are the requests being sent to Omniture SiteCatalyst.
See these references if you need more information -
http://blogs.adobe.com/digitalmarketing/analytics/validate-your-mobile-app-measurement-implementation/
http://blogs.adobe.com/digitalmarketing/analytics/custom-link-tracking-capturing-user-actions/
http://emptymind.org/validating-page-tags-with-httpfox/
http://tech.groups.yahoo.com/group/webanalytics/message/24232
In short look for: "/b/ss" however this guy wrote a very in depth article on how to do that with HTTPFox it's a good read with screenshots: http://www.mikewebguy.com/2013/08/21/httpfox-helps-you-verify-what-data-is-being-sent-to-your-analytics-provider/

Ways to programatically check if a website is up and functioning as expected

I know this is an open ended question, but hopefully it will get some good answers before the thread is locked...
I'm wondering what methods there are to programmatically check (language agnostic) if a website is online from a client perspective (assume you can't make changes to the site/server, but you can rely on certain behaviours of the site.)
The result of each method could stack to provide a measure of certainty that the site is up/down - that is, a method does not have to provide a definite indication if the site is up/down on its own.
Some common tests just to check 'upness' may be:
Ping the site (which in the case of shared hosting isn't very
indicative)
Send a http head/get request and check the status
Others I can think of to check that the site is up and functioning:
Check you received a well formed html response i.e. html to html
tags, if the site is experiencing trouble it may spit an error and
exit without writing the rest of the page (not all that reliable
though because the site may handle most errors in a better way)
Check certain content is or is not on the page, i.e. perhaps there is some content that is always present on your pages, or always present in the case of an error
Can anybody think of any other methods that could be used to help determine if a site is in fact up/down and functioning/not functioning correctly from within a program?
If your get request on a page that displays info from database comes back with status 200 and matching keywords are found, you can be pretty certain that your site is up and running.
And you don't really need to write your own script to do that. There are free services such as GotSiteMonitor, Pingdom, UptimeRobot etc. allows you to monitor your site.
Based your set of test on the unit tests priciple. It is normally used in programming to test classes, modules or other artefacts after changes have been made. You can use any of the available frameworks, so don't have to reinvent the wheel. You must describe (implement) tests to be run, in your case a typical test should request a url inside the page and then do some evaluations like:
call result (for example return code of curl execution)
http return code
http headers
response mime type
response size
response content (test against a regular expression)
This way you can add, remove and modify single tests without having to care about the framework, once you are up. You can also chain tests, so perform a login in one test and virtually click a button in subsequent test.
There are also tools to handle such test runs automatically including visualization of results, statistics and the like.
OK, it sounds like you want to test and monitor your website from a customer experience perspective rather than purely establishing if a server is up (using ping for example). An effective way to replicate the customer experience is to simulate tests against the site using one of the headless browser testing tools (phantomJS is great a great choice) as they will render the page fully (including images, CSS, JS etc.) giving you a real page load time. These tools also allow you to make assertions on all aspects of the HTML content and HTTP response.
pingdom recently started offering a (paid for) service to perform these exact types of checks for alongside their existing monitoring solution. The demo is worth looking at, their interface for writing the actual tests is very nice.

Automate website log-in and form filling?

I'm trying to log in to a website and save an HTML page automatically (I want to be able to do this on a regular time interval). From the surface, this is a typical modern website where, if the user navigates directly to a "locked" URL, a log-in form pops up, and after logging in, the user is redirected to the intended page.
I gave mechanize a shot (http://wwwsearch.sourceforge.net/mechanize/) but it wasn't finding some form elements which were needed for login (hidden elements that have some values put in by a javascript function that runs when the user clicks the "log in" button).
I played a bit with the "web browser" control in .NET but quickly lost interest because I couldn't even get it to submit a query on the Google page.
I don't care what the language is; I'll learn it to solve this problem. At a minimum it has to work in Windows.
A simple example, say, typing in a query into the Google search box would be a great bonus.
In my experience, the most reliable way is to use javascript. It works well in .Net. To test, browse to the following addresses one after another in Firefox or Internet Explorer:
http://www.google.com
javascript:function f(){document.forms[0]['q'].value='stackoverflow';}f();
javascript:document.forms[0].submit()
That performs a search for "stackoverflow" on Google. To do it in VB .Net using the webbrowser control, do this:
WebBrowser1.Navigate("http://www.google.com")
Do While WebBrowser1.IsBusy OrElse WebBrowser1.ReadyState <> WebBrowserReadyState.Complete
Threading.Thread.Sleep(1000)
Application.DoEvents()
Loop
WebBrowser1.Navigate("javascript:function%20f(){document.forms[0]['q'].value='stackoverflow';}f();")
Threading.Thread.Sleep(2000) 'wait for javascript to run
WebBrowser1.Navigate("javascript:document.forms[0].submit()")
Threading.Thread.Sleep(2000) 'wait for javascript to run
Notice how the space in the URL is converted to %20. I'm not certain if this is necessary but it can't hurt. It is important that the first javascript be in a function. The calls to Sleep() are to wait for Google to load and also for the javascript stuff. The Do While Loop might run forever if the page fails to load so for automation purposes have a counter that will timeout after, say, 60 seconds.
Of course, for Google you can just navigate directly to www.google.com?q=stackoverflow but if your site has hidden input fields, etc, then this is the way to go. Only works for HTML sites - flash is a whole other matter.
If I understand you right, you want to log in to only one webpage, and that form always stays the same. You could either reverse engineer the java script, or debug it via a javascript debugger in the browser (e.g. firebug for firefox). Or you can fill in the form in your browser and look at the http request via a network packet sniffer. Once you have all required form data to submit, you can do the same with your program (thats what I did the last time I had a pretty similar task to do). dont forget to store all cookie data you requested back from the webserver and send it with the next request, to 'stay logged in'.
Its being already discussed here.
Basically its gist is you can use selenium, an open source web automation tool, which has api library available in various languages like java, ruby, etc.
Neoload can handle the form filling with authentication, assuming you don't want to collect data, just perform actions. It's a web stress tool, so it's not really meant to be used as a time-based service, but you COULD just leave it running.
I've used Ruby and Watir (a web app testing suite) for something similar, but it was a very small task (basically visiting URLs from a text file and downloading an image).
There's also an extension called iMacros that can do some automation, but I'm not personally familiar with it (just aware of it).
"I'm trying to log in to a website and save an HTML page automatically"
SAVEAS TYPE=HTM FOLDER=C: FILE=page.html
https://addons.mozilla.org/en-US/firefox/addon/imacros-for-firefox/?src=search
This commands played in iMacros addon will save the page on C: drive and name it page.html
Also,
URL GOTO=www.website.com
Goes on the particular website you want to save. You can also use scripting in iMacros and set different websites in macro.

Screen scraping an ASP.NET web page to retrieve data displayed in the grid view

I am using RUBY to screen scrap a web page (created in asp.net) which uses gridview to display data. I am successfully able to read the data displayed on page-1 of the grid but unable to figure out how I can move to the next page in the grid to read all the data.
Problem is the page number hyperlinks are not normal hyperlinks (with URL) but instead are javascript hyperlink which causes postback to the same page..
An example of the hyperlink:-
6
I recommend using Watir, a ruby library designed for browser testing, if you're already using ruby for processing. For one thing, it gives you a much nicer interface to the DOM elements on the page, and it makes clicking links like this easier:
ie.link(:text, '6').click
Then, of course you have easier methods for navigating the table as well. It's easy enough to automate this process:
1..total_number_of_pages.each do |next_page|
ie.link(:text, next_page).click
# table processing goes here
end
I don't know your use case, but this approach has its advantages and disadvantages. For one thing, it actually runs a browser instance, so if this is something you need to frequently run quietly in the background in completely automated way, this may not be the best approach. On the other hand, if it's ok to launch a browser instance, then you don't have to worry about all that postback nonsense, and you can just click the link as if you were a user.
Watir: http://wtr.rubyforge.org/
You'll need to figure out the actual URL.
Option 1a: Open the page in a browser with good developer support (e.g. firefox with the web development tools) and look through the source to find where _doPostBack is defined. Figure out what URL it's constructing. Note that it might not be in the main page source, but instead in something that the page loads.
Option 1b: Ditto, but have ruby do it. If you're fetching the page with Net:HTTP you've got the tools to find the definition of __doPostBack already (the body as a string, ruby's grep, and the ability to request additional files, such as those in script tags).
Option 2: Monitor the traffic between a browser and the page (e.g. with a logging proxy) to find out what the URL is.
Option 3: Ask the owner of the web page.
Option 4: Guess. This may not be as bad as it sounds (e.g. if the original URL ends with "...?page=1" or something) but in general this is the least likely to work.
Edit (in response to your comment on the other question):
Assuming you're using the Net:HTTP library, you can do a postback by just replacing your get with a post, e.g. my_http.post(my_url) instead of my_http.get(my_url)
Edit (in response to danieltalsky's answer):
watir may be a really good solution for you (I'm kicking myself for not having thought of it), but be aware that you may have to manually fire the event or go through other hoops to get what you want. As a specific gotcha, with any asynchronous fetch like this you need to make sure that the full response has come back before you scrape it; that isn't a problem when you're doing the request inline yourself.
You will have to perform the postback. The data is pass with a form POST back to the server. Like Markus said use something like FireBug or the Developer Tools in IE 8 and fiddler to watch the traffic. But honestly this is a web form using the bloated GridView and you will be in for a fun adventure. ;)
You'll need to do some investigation in order to figure out what HTTP request the javascript execution is performing. I've used the Mozilla browser with the Firebug plugin and also the "Live HTTP Headers" plugin to help determine what is going on. It will likely become clear to you which requests you will need to make in order to traverse to the next page. Make sure you pay attention to any cookies getting set.
I've had really good success using Mechanize for scraping. It wraps all of the HTTP communication, html parsing and searching(using Nokogiri), redirection, and holding onto cookies. But it doesn't know how to execute Javascript, which is why you will need to figure out what http request to perform on your own.

Resources