The data on the webpage is displayed dynamically and it seems that checking for every change in the html and extracting the data is a very daunting task and also needs me to use very unreliable XPaths. So I would want to be able to extract the data from the XHR packets.
I hope to be able to extract information from XHR packets as well as generate 'XHR' packets to be sent to the server.
The extracting information part is more important for me because the sending of information can be handled easily by automatically triggering html elements using casperjs.
I'm attaching a screenshot of what I mean.
The text in the response tab is the data I need to process afterwards. (This XHR response has been received from the server.)
This is not easily possible, because the resource.received event handler only provides meta data like url, headers or status, but not the actual data. The underlying phantomjs event handler acts the same way.
Stateless AJAX Request
If the ajax call is stateless, you may repeat the request
casper.on("resource.received", function(resource){
// somehow identify this request, here: if it contains ".json"
// it also also only does something when the stage is "end" otherwise this would be executed two times
if (resource.url.indexOf(".json") != -1 && resource.stage == "end") {
var data = casper.evaluate(function(url){
// synchronous GET request
return __utils__.sendAJAX(url, "GET");
}, resource.url);
// do something with data, you might need to JSON.parse(data)
}
});
casper.start(url); // your script
You may want to add the event listener to resource.requested. That way you don't need to way for the call to complete.
You can also do this right inside of the control flow like this (source: A: CasperJS waitForResource: how to get the resource i've waited for):
casper.start(url);
var res, resData;
casper.waitForResource(function check(resource){
res = resource;
return resource.url.indexOf(".json") != -1;
}, function then(){
resData = casper.evaluate(function(url){
// synchronous GET request
return __utils__.sendAJAX(url, "GET");
}, res.url);
// do something with the data here or in a later step
});
casper.run();
Stateful AJAX Request
If it is not stateless, you would need to replace the implementation of XMLHttpRequest. You will need to inject your own implementation of the onreadystatechange handler, collect the information in the page window object and later collect it in another evaluate call.
You may want to look at the XHR faker in sinon.js or use the following complete proxy for XMLHttpRequest (I modeled it after method 3 from How can I create a XMLHttpRequest wrapper/proxy?):
function replaceXHR(){
(function(window, debug){
function args(a){
var s = "";
for(var i = 0; i < a.length; i++) {
s += "\t\n[" + i + "] => " + a[i];
}
return s;
}
var _XMLHttpRequest = window.XMLHttpRequest;
window.XMLHttpRequest = function() {
this.xhr = new _XMLHttpRequest();
}
// proxy ALL methods/properties
var methods = [
"open",
"abort",
"setRequestHeader",
"send",
"addEventListener",
"removeEventListener",
"getResponseHeader",
"getAllResponseHeaders",
"dispatchEvent",
"overrideMimeType"
];
methods.forEach(function(method){
window.XMLHttpRequest.prototype[method] = function() {
if (debug) console.log("ARGUMENTS", method, args(arguments));
if (method == "open") {
this._url = arguments[1];
}
return this.xhr[method].apply(this.xhr, arguments);
}
});
// proxy change event handler
Object.defineProperty(window.XMLHttpRequest.prototype, "onreadystatechange", {
get: function(){
// this will probably never called
return this.xhr.onreadystatechange;
},
set: function(onreadystatechange){
var that = this.xhr;
var realThis = this;
that.onreadystatechange = function(){
// request is fully loaded
if (that.readyState == 4) {
if (debug) console.log("RESPONSE RECEIVED:", typeof that.responseText == "string" ? that.responseText.length : "none");
// there is a response and filter execution based on url
if (that.responseText && realThis._url.indexOf("whatever") != -1) {
window.myAwesomeResponse = that.responseText;
}
}
onreadystatechange.call(that);
};
}
});
var otherscalars = [
"onabort",
"onerror",
"onload",
"onloadstart",
"onloadend",
"onprogress",
"readyState",
"responseText",
"responseType",
"responseXML",
"status",
"statusText",
"upload",
"withCredentials",
"DONE",
"UNSENT",
"HEADERS_RECEIVED",
"LOADING",
"OPENED"
];
otherscalars.forEach(function(scalar){
Object.defineProperty(window.XMLHttpRequest.prototype, scalar, {
get: function(){
return this.xhr[scalar];
},
set: function(obj){
this.xhr[scalar] = obj;
}
});
});
})(window, false);
}
If you want to capture the AJAX calls from the very beginning, you need to add this to one of the first event handlers
casper.on("page.initialized", function(resource){
this.evaluate(replaceXHR);
});
or evaluate(replaceXHR) when you need it.
The control flow would look like this:
function replaceXHR(){ /* from above*/ }
casper.start(yourUrl, function(){
this.evaluate(replaceXHR);
});
function getAwesomeResponse(){
return this.evaluate(function(){
return window.myAwesomeResponse;
});
}
// stops waiting if window.myAwesomeResponse is something that evaluates to true
casper.waitFor(getAwesomeResponse, function then(){
var data = JSON.parse(getAwesomeResponse());
// Do something with data
});
casper.run();
As described above, I create a proxy for XMLHttpRequest so that every time it is used on the page, I can do something with it. The page that you scrape uses the xhr.onreadystatechange callback to receive data. The proxying is done by defining a specific setter function which writes the received data to window.myAwesomeResponse in the page context. The only thing you need to do is retrieving this text.
JSONP Request
Writing a proxy for JSONP is even easier, if you know the prefix (the function to call with the loaded JSON e.g. insert({"data":["Some", "JSON", "here"],"id":"asdasda")). You can overwrite insert in the page context
after the page is loaded
casper.start(url).then(function(){
this.evaluate(function(){
var oldInsert = insert;
insert = function(json){
window.myAwesomeResponse = json;
oldInsert.apply(window, arguments);
};
});
}).waitFor(getAwesomeResponse, function then(){
var data = JSON.parse(getAwesomeResponse());
// Do something with data
}).run();
or before the request is received (if the function is registered just before the request is invoked)
casper.on("resource.requested", function(resource){
// filter on the correct call
if (resource.url.indexOf(".jsonp") != -1) {
this.evaluate(function(){
var oldInsert = insert;
insert = function(json){
window.myAwesomeResponse = json;
oldInsert.apply(window, arguments);
};
});
}
}).run();
casper.start(url).waitFor(getAwesomeResponse, function then(){
var data = JSON.parse(getAwesomeResponse());
// Do something with data
}).run();
I may be late into the party, but the answer may help someone like me who would fall into this problem later in future.
I had to start with PhantomJS, then moved to CasperJS but finally settled with SlimerJS. Slimer is based on Phantom, is compatible with Casper, and can send you back the response body using the same onResponseReceived method, in "response.body" part.
Reference: https://docs.slimerjs.org/current/api/webpage.html#webpage-onresourcereceived
#Artjom's answer's doesn't work for me in the recent Chrome and CasperJS versions.
Based on #Artjom's answer and based on gilly3's answer on how to replace XMLHttpRequest, I have composed a new solution that should work in most/all versions of the different browsers. Works for me.
SlimerJS cannot work on newer version of FireFox, therefore no good for me.
Here is the the generic code to add a listner to load of XHR (not dependent on CasperJS):
var addXHRListener = function (XHROnStateChange) {
var XHROnLoad = function () {
if (this.readyState == 4) {
XHROnStateChange(this)
}
}
var open_original = XMLHttpRequest.prototype.open;
XMLHttpRequest.prototype.open = function (method, url, async, unk1, unk2) {
this.requestUrl = url
open_original.apply(this, arguments);
};
var xhrSend = XMLHttpRequest.prototype.send;
XMLHttpRequest.prototype.send = function () {
var xhr = this;
if (xhr.addEventListener) {
xhr.removeEventListener("readystatechange", XHROnLoad);
xhr.addEventListener("readystatechange", XHROnLoad, false);
} else {
function readyStateChange() {
if (handler) {
if (handler.handleEvent) {
handler.handleEvent.apply(xhr, arguments);
} else {
handler.apply(xhr, arguments);
}
}
XHROnLoad.apply(xhr, arguments);
setReadyStateChange();
}
function setReadyStateChange() {
setTimeout(function () {
if (xhr.onreadystatechange != readyStateChange) {
handler = xhr.onreadystatechange;
xhr.onreadystatechange = readyStateChange;
}
}, 1);
}
var handler;
setReadyStateChange();
}
xhrSend.apply(xhr, arguments);
};
}
Here is CasperJS code to emit a custom event on load of XHR:
casper.on("page.initialized", function (resource) {
var emitXHRLoad = function (xhr) {
window.callPhantom({eventName: 'xhr.load', eventData: xhr})
}
this.evaluate(addXHRListener, emitXHRLoad);
});
casper.on('remote.callback', function (data) {
casper.emit(data.eventName, data.eventData)
});
Here is a code to listen to "xhr.load" event and get the XHR response body:
casper.on('xhr.load', function (xhr) {
console.log('xhr load', xhr.requestUrl)
console.log('xhr load', xhr.responseText)
});
Additionally, you can also directly download the content and manipulate it later.
Here is the example of the script I am using to retrieve a JSON and save it locally :
var casper = require('casper').create({
pageSettings: {
webSecurityEnabled: false
}
});
var url = 'https://twitter.com/users/username_available?username=whatever';
casper.start('about:blank', function() {
this.download(url, "hop.json");
});
casper.run(function() {
this.echo('Done.').exit();
});
Having problem with prototype ajax and setTimeout. Here is my code shortened:
//new ajax request
....onComplete: function (transport) { //json as this -> array[$i].something
var json = transport.responseJSON;
var $i = 0;
window.setTimeout(function () {
SLOW();
},
500); //display every json[$i] with custom delay
function SLOW() {
$i++;
if (json[$i].something !== null) { //insert in proper div id in the html document
window.setTimeout(function () {
$('document_div' + json[$i].something).innerHTML = json[$i].something_to_display;
},
500);
window.setTimeout(function () {
$('document_div' + json[$i].something).innerHTML = json[$i].something_to_display;
},
1000);...window.setTimeout(function () {
SLOW();
},
500);
} else {
//stop and continue
}
Getting this error: json[$i] is undefined.
EDIT: looks like i'm getting this error on second timeout, the first one changes the div correctly.
Done.
Solution was to re-var json again before using it in setTimeout.
var json_something = json[$i].something; //and so on...
var json_something_to_display = json[$i].something_to_display
window.setTimeout(function() { $('document_div'+json_something).innerHTML = json_something_to_display; }, 500);
Can somebody explain why this is needed? Why varing json is not enough and it disapears somewhere after one window.setTimeout function?
Hi I can see this has been discussed but after perusing the issues/answers I still don't seem to be able to get even this simple AJAX call to bump out of ready state 1.
Here's the Javascript I have:
<script language="javascript" type="text/javascript">
var request;
function createRequest()
{
try
{
request = new XMLHttpRequest();
} catch (trymicrosoft) {
try {
request = new ActiveXObject("Msxml2.XMLHTTP");
} catch (othermicrosoft) {
try {
request = new ActiveXObject("Microsoft.XMLHTTP");
} catch (failed) {
request = false;
}
}
}
if (!request)
alert("Error initializing XMLHttpRequest!");
}
function loadClassesBySchool()
{
//get require web form pieces for this call
createRequest(); // function to get xmlhttp object
var schoolId = getDDLSelectionValue("ddlSchools");
var grade = getDDLSelectionValue("ddlGrades");
var url = "courses.php?grades=" + escape(grade) + "&schoolId=" + escape(schoolId);
//open server connection
request.open("GET", url, true);
//Setup callback function for server response
//+++read on overflow that some fixed the issue with an onload event this simply had
//+++the handle spitback 2 readystate = 1 alerts
request.onload = updateCourses();
request.onreadystatechanged = updateCourses();
//send the result
request.send();
}
function updateCourses()
{
alert('ready state changed' + request.readyState);
}
function getDDLSelectionValue(ddlID)
{
return document.getElementById(ddlID).options[document.getElementById(ddlID).selectedIndex].value;
}
</script>
The PHP is HERE just a simple print which if i navigate to in the browser (IE/Chrome) loads fine:
<?php
print "test";
?>
I'm quite new at this but seems like I can't get the most bare bones AJAX calls to work, any help as to how work past this would be greatly appreciated.
All I get out of my callback function 'updateCourses' is a 1...
Well after more digging I actually gave up and switched over to jQuery which should for all intents and purposes be doing the EXACT same thing except for the fact that jQuery works... I was just less comfortable with it but so be it.
Here's the jQuery to accomplish the same:
function loadCoursesBySchool(){
var grades = getDDLSelectionValue("ddlGrades");
var schoolId = getDDLSelectionValue("ddlSchools");
jQuery.ajax({
url: "courses.php?grades=" + grades + "&schoolId=" + schoolId,
success: function (data) {
courseDisplay(data);
}
});
}
function courseDisplay(response)
{
//check if anything was setn back!?
if(!response)
{
$("#ddlCourses").html("");
//do nothing?
}
else
{
//empty DLL
$("#ddlCourses").html("");
//add entries
$(response).appendTo("#ddlCourses");
}
}
I am getting this one error when I use the Mozilla validator:
This is the JS file:
const STATE_START = Components.interfaces.nsIWebProgressListener.STATE_START;
const STATE_STOP = Components.interfaces.nsIWebProgressListener.STATE_STOP;
// Version changes:
// It used to get the lists from a PHP file, but that was putting too much of a strain on the servers
// now it uses xml files.
// Randomizes the servers to load balance
// Mozilla editor suggested no synchronous file gets, so changed it to asynchronous
// Added one more server to help with the updates (Ilovemafiaafire.net)
// Edited some redirect code that some idiots were spreading FUD about.
var xmlDoc = null;
var quickFilter_100_count_redirect_url='http://www.mafiaafire.com/help_us.php';
var countXmlUrl = 0;
//var xmlUrl = 'http://elxotica.com/xml-update/xml-list.php';
var xmlUrl = new Array(4);
xmlUrl[0] = 'http://mafiaafire.com/xml-update/mf_xml_list.xml';
xmlUrl[1] = 'http://ifucksexygirls.com/xml-update/mf_xml_list.xml';
xmlUrl[2] = 'http://ezee.se/xml-update/mf_xml_list.xml';
xmlUrl[3] = 'http://ilovemafiaafire.net/mf_xml_list.xml';
xmlUrl.sort(function() {return 0.5 - Math.random()})
var realXmlUrl = xmlUrl[countXmlUrl];
var notificationUrl = 'http://mafiaafire.com/xml-update/click_here_for_details.php';
var root_node = null;
var second_node = null;
var timervar = null;
var mafiaafireFilterUrl = '';
//Calling the interface for preferences
var prefManager = Components.classes["#mozilla.org/preferences-service;1"].getService(Components.interfaces.nsIPrefBranch);
var quickfilter_mafiaafire =
{
// get the domain name from the current url
get_domain_name:function()
{
var urlbar = window.content.location.href;
domain_name_parts = urlbar.match(/:\/\/(.[^/]+)/)[1].split('.');
if(domain_name_parts.length >= 3){
domain_name_parts[0] = '';
}
var dn = domain_name_parts.join('.');
if(dn.indexOf('.') == 0)
return dn.substr(1);
else
return dn;
},
// send ajax request to server for loading the xml
request_xml:function ()
{
//alert(countXmlUrl);
http_request = false;
http_request = new XMLHttpRequest();
if (http_request.overrideMimeType) {
http_request.overrideMimeType('text/xml');
}
if (!http_request)
{
return false;
}
http_request.onreadystatechange = this.response_xml;
http_request.open('GET', realXmlUrl, true);
http_request.send(null);
xmlDoc = http_request.responseXML;
},
// receive the ajax response
response_xml:function ()
{
if (http_request.readyState == 4)
{
if(http_request.status == 404 && countXmlUrl<=3)
{
countXmlUrl++;
//alert(xmlUrl[countXmlUrl]);
realXmlUrl = xmlUrl[countXmlUrl];
quickfilter_mafiaafire.request_xml();
}
if (http_request.status == 200)
{
xmlDoc = http_request.responseXML;
}
}
},
filterUrl:function()
{
var urlBar = window.content.location.href;
//check if url bar is blank or empty
if (urlBar == 'about:blank' || urlBar == '' || urlBar.indexOf('http')<0)
return false;
//1. get domain
processing_domain = this.get_domain_name();
//alert(processing_domain);
//Couldn't fetch the XML config, so returning gracefully
if(xmlDoc == null)
return false;
try
{
root_node = '';
// Parsing the xml
root_node = xmlDoc.getElementsByTagName('filter');
for(i=0;i<=root_node.length;i++)
{
second_node = '';
second_node = root_node[i];
if(second_node.getElementsByTagName('realdomain')[0].firstChild.nodeValue == processing_domain)
{
this.notificationBox();
mafiaafireFilterUrl = '';
mafiaafireFilterUrl = second_node.getElementsByTagName('filterdomain')[0].firstChild.nodeValue;
timervar = setTimeout("quickfilter_mafiaafire.redirectToAnotherUrl()",1500);
//window.content.location.href = second_node.getElementsByTagName('filterdomain')[0].firstChild.nodeValue;
//this.redirectToAnotherUrl(this.filterUrl);
//timervar = setInterval("quickfilter_mafiaafire.redirectToAnotherUrl(quickfilter_mafiaafire.filterUrl)",1000);
}
}
}
catch(e){
//alert(e.toString());
}
},
// This function is called for showing the notification
notificationBox:function()
{
try{
// Firefox default notification interface
var notificationBox = gBrowser.getNotificationBox();
notificationBox.removeAllNotifications(false);
notificationBox.appendNotification('You are being redirected', "", "chrome://quickfilter/content/filter.png", notificationBox.PRIORITY_INFO_HIGH, [{
accessKey: '',
label: ' click here for details',
callback: function() {
// Showing the notification Bar
window.content.location.href = notificationUrl;
}
}]);
}catch(e){}
},
redirectToAnotherUrl:function()
{
var qucikFilterRedirectCount = '';
//Read the value from preferrences
qucikFilterRedirectCount = prefManager.getCharPref("extensions.quickfilter_redirect_count");
//alert(qucikFilterRedirectCount);
if(qucikFilterRedirectCount % 15 == 0)
{
// Disable for now, can comment this entire section but this is the easier fix incase we decide to enable it later
//window.content.location.href = quickFilter_100_count_redirect_url+"?d="+mafiaafireFilterUrl;
window.content.location.href = mafiaafireFilterUrl;
}
else
{
window.content.location.href = mafiaafireFilterUrl;
}
qucikFilterRedirectCount = parseInt(qucikFilterRedirectCount)+1;
prefManager.setCharPref("extensions.quickfilter_redirect_count",qucikFilterRedirectCount);
}
}
var quickfilter_urlBarListener = {
QueryInterface: function(aIID)
{
if (aIID.equals(Components.interfaces.nsIWebProgressListener) ||
aIID.equals(Components.interfaces.nsISupportsWeakReference) ||
aIID.equals(Components.interfaces.nsISupports))
return this;
throw Components.results.NS_NOINTERFACE;
},
//Called when the location of the window being watched changes
onLocationChange: function(aProgress, aRequest, aURI)
{
// This fires when the location bar changes; that is load event is confirmed
// or when the user switches tabs. If you use myListener for more than one tab/window,
// use aProgress.DOMWindow to obtain the tab/window which triggered the change.
quickfilter_mafiaafire.filterUrl();
},
//Notification indicating the state has changed for one of the requests associated with aWebProgress.
onStateChange: function(aProgress, aRequest, aFlag, aStatus)
{
if(aFlag & STATE_START)
{
// This fires when the load event is initiated
}
if(aFlag & STATE_STOP)
{
// This fires when the load finishes
}
},
//Notification that the progress has changed for one of the requests associated with aWebProgress
onProgressChange: function() {},
//Notification that the status of a request has changed. The status message is intended to be displayed to the user.
onStatusChange: function() {},
//Notification called for security progress
onSecurityChange: function() {},
onLinkIconAvailable: function() {}
};
var quickfilter_extension = {
init: function()
{
//Initiating the progressListerner
gBrowser.addProgressListener(quickfilter_urlBarListener, Components.interfaces.nsIWebProgress.NOTIFY_STATE_DOCUMENT);
//Load the block list xml form server
quickfilter_mafiaafire.request_xml();
},
uninit: function()
{
// Remove the progressListerner
gBrowser.removeProgressListener(quickfilter_urlBarListener);
}
};
// window.addEventListener("load", function () { TheGreatTest1.onFirefoxLoad(); }, false);
// this function is Called on window Onload event
window.addEventListener("load", function(e) {
quickfilter_extension.init();
}, false);
window.addEventListener("unload", function(e) {
quickfilter_extension.uninit();
}, false);
Can you tell me how to squash that error please?
It looks like the offending line is setTimeout("quickfilter_mafiaafire.redirectToAnotherUrl()",1500);
The setTimeout function can take a string (which then essentially gets eval'd) or a function (which gets called). Using a string is not recommended, for all the same reasons that using eval is not recommended. See https://developer.mozilla.org/en/DOM/window.setTimeout
In this case, the simplest fix would be to change it to setTimeout(function() { quickfilter_mafiaafire.redirectToAnotherUrl(); },1500);
XmlHttpRequest works through callbacks. So how can I return a value? I tried to set a global variable, but that doesn't seem to be working.
var response = null; // contains the most recent XmlHttpRequest response
// loads the info for this username on the page
function loadUsernameInfo(username) {
getUserInfo(username);
var profile = response;
if (profile) {
// do stuff
}
else {
indicateInvalidUsername(username);
}
}
getUserInfo() can't return a result, because of the callback:
function getUserInfo(username) {
var request = createRequest();
request.onreadystatechange = userObjFromJSON;
var twitterURL = "http://twitter.com/users/show/" + escape(username) + ".json";
var url = "url.php?url=" + twitterURL;
request.open("GET", url, true);
request.send(null);
}
The callback:
function userObjFromJSON() {
if (this.readyState == 4) {
alert(this.responseText);
response = this.responseText;
}
}
How can I get the response back to loadUsernameInfo()?
You can do synchronous requests, though it is not recommended - the A is for Asynchronous... But the general idea to implement this correctly would be:
var response = null; // contains the most recent XmlHttpRequest response
// loads the info for this username on the page
function loadUsernameInfo(username) {
getUserInfo(username, onLoadUsernameComplete);
}
function getUserInfo(username, oncomplete) {
var request = createRequest();
request.__username = username;
request.onreadystatechange = oncomplete;
var twitterURL = "http://twitter.com/users/show/" + escape(username) + ".json";
var url = "url.php?url=" + twitterURL;
request.open("GET", url, true);
request.send(null);
}
function onLoadUsernameComplete(req) {
if (req.readyState == 4) {
// only if "OK"
if (req.status == 200) {
var profile = req.responseXML;
if (profile) {
// do stuff
}
else {
indicateInvalidUsername(req.__username);
}
}
}
}