Node request, cheerio - how to handle additional ajax load - ajax

I'm using node, request and cheerio, to fetch data from a html page. This has not been any problem but one page loads additional data through ajax to fill different containers. These are empty and undefined when the initial request is done, how do I handle this the best way?
request(url, function (error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html);
forum_url = $('.this.url.is.loaded.separatly.with.ajax'[1].attr('href');
}
});

Cheerio isn't really designed with ajax in mind. If you are able to extract the urls that need to be downloaded, you would likely have to maintain multiple seperate $ objects, as it's unlikely they can be merged easily.
Usually, in cases where you need to execute javascript found on a scraped page, we would turn to Phantom.js. Phantom is a headless browser that you control using javascript, it's pretty cool.
You can check out some Phantom.js web scraping code here: http://code4node.com/snippet/web-scraping-with-node-and-phantomjs

Related

get html from dom after javascript has executed

I may be saying this with incorrect terminology so correct me if I'm wrong please.
Here's what I want to do: I'm trying to scrape a website's comments section but the comments are loaded via an ajax call after the page has fully loaded. When I try to scrape the HTML from the site via:
res, err:= http.Get(url)
if err != nil {
// handle error
}
defer res.Body.Close()
But it obviously gets the html before the ajax call. How do I go about getting the html after the ajax call?
This is completely off the top of my head, but would I need to basically create a js-renderer in code for this? My guess is that the JS needs to execute somehow. Any suggestions / libraries / examples on how to go about this? I'd prefer this to be in go but it could be realistically in any language.
you can use the headless browsers like http://phantomjs.org/ to get page, execute all javascripts on it and scrape the comments.
This example can help : https://github.com/ariya/phantomjs/blob/master/examples/phantomwebintro.js
But phantomjs is separate binary application, maybe installing it can be not so trivial.
Also you you can research the page using Firebug, see the requests being send to fetch comments, and emulate this call in go.
Maybe the page loads comments via javascript code like this
$.get( "/ajax/comments", function( data ) {
$( ".comments" ).html( data );
});
so you can fetch and parse the /ajax/comments page using go
Recently I had the same issue and GoQuery helped a lot
I tried the first site came from the net, where comments are loaded by JS event and wrote you a small snippet. You may try and check it out.
doc,_ := goquery.NewDocument("http://www.ihg.com/holidayinn/hotels/us/en/san-francisco/sfocc/hoteldetail/hotel-reviews?scmisc=hotel_details_reviews_link_bottom")
html_contents,_ := doc.Html()
fmt.Println(html_contents)
This will initially shows all the comments below main content of the page, which are loaded by JS event.
Good Luck!
If you own the site or can easily determine (or generate) the URI of the call that loads the comments, it's probably easier to make that same AJAX call yourself rather than bother with DOM parsing or arbitrary JS execution.
At that point Go would actually be a good language to use, since its JSON and XML standard libraries are excellent for unmarshalling that kind of data.

Does Ajax always require the use of node.js?

I learning about the use of AJAX in web development, and I need to know if AJAX always require the use of node.js, or JQUERY?
Thanks.
That is a very broad question, so the answer might be broad as well:
The short answer: Ajax does not require jQuery nor Node.js.
In practice, Ajax is a technology for asynchronous operations utilized by Javascript send data to and retrieve from a server asynchronously(1). Ajax is fully available in plain, vanilla Javascript, and it works as follows (example taken from Wikipedia, see sources):
// This is the client-side script.
// Initialize the Http request.
var xhr = new XMLHttpRequest();
xhr.open('get', 'send-ajax-data.php');
// Track the state changes of the request.
xhr.onreadystatechange = function() {
var DONE = 4; // readyState 4 means the request is done.
var OK = 200; // status 200 is a successful return.
if (xhr.readyState === DONE) {
if (xhr.status === OK) {
alert(xhr.responseText); // 'This is the returned text.'
} else {
alert('Error: ' + xhr.status); // An error occurred during the request.
}
}
};
// Send the request to send-ajax-data.php
xhr.send(null);
This is a classic example, showing both how to use Ajax with vanilla Javascript, and also why it's much easier with other means such as jQuery, shortening the same snippet to just:
$.ajax({
url: "http://fiddle.jshell.net/favicon.png",
}).done(function(data) {
// Do something with data.
});
Sources (including vanilla Ajax examples):
Wikipedia: Ajax
A Guide to Vanilla Ajax Without jQuery
jQuery: ajax()
There is no need to use node.js to perform an Ajax request. You can make an Ajax request even using vanilla Javascript. However, jQuery made the Ajax request is very easy and cross-browser compatible with just some lines of code. So, I recommend you to stick with jQuery instead of using vanilla Javascript.
You can find more information regarding the jQuery Ajax feature here: http://api.jquery.com/jquery.ajax/
You can also find more information about the vanilla Javascript Ajax request feature here:
http://www.w3schools.com/ajax/
No, most browsers supply means to perform asynchronous javascript requests but libraries such as jQuery partly came about to smooth over the differences between browsers, making ajax a lot more portable.
Modern browsers generally don't have so great differences, so portability is probably is less of an issue, but using libraries has become common practice.

How to use NodeJS with node-rest-client methods to post dynamic data to front end HTML

I am rather new to NodeJS so hopefully I am able to articulate my question(s) properly. My goal is to create a NodeJS application that will use the node-rest-client to GET data and asynchronously display it in HTML on client side.
I have several node-rest-client methods created and currently I am calling my GET data operation when a user navigates to the /getdata page. The response is successfully logged to the console but I'm stumbling on the best method to dynamically populate this data in an HTML table on the /getdata page itself. I'd like to follow Node best practices, ensure durability under high user load and ultimately make sure I'm not coding a piece of junk.
How can I bind data returned from my Express routes to the HTML front end?
Should I use separate "router.get" routes for each node-rest-method?
How can I bind a GET request to a button and have it GET new data when clicked?
Should I consider using socket.io, angularjs and ajax to pipe data from the server side to client side?
-Thank you for reading.
This is an example of the route that is currently rendering the getdata page as well as calling my getDomains node-rest-client method. The page is rendering correct and the data returned by getDomains is successfully printed to the console, however I'm having trouble getting the data piped to the /getdata page.
router.get('/getdata', function(req, res) {
res.render('getdata', {title: 'This is the get data page'});
console.log("Rendering:: Starting post requirement");
args = {
headers:{"Cookie":req.session.qcsession,"Accept":"application/xml"},
};
qcclient.methods.getDomains(args, function(data, response){
var theProjectsSTRING = JSON.stringify(data);
var theProjectsJSON = JSON.parse(theProjectsSTRING);
console.log('Processing JSON.Stringify on DATA');
console.log(theProjectsSTRING);
console.log('Processing JSON.Parse on theProjectsSTRING');
console.log('');
console.log('Parsing the array ' + theProjectsJSON.Domains.Domain[0].$.Name );
});
});
I've started to experiment with creating several routes for my different node-rest-client methods that will use res.send to return the data and the perhaps I could bind an AJAX call or use angularjs to parse the data and display it to the user.
router.get('/domaindata', function(req, res){
var theProjectsSTRING;
var theProjectsJSON;
args = {
headers:{"Cookie": req.session.qcsession,"Accept":"application/xml"},
};
qcclient.methods.getDomains(args, function(data, response){
//console.log(data);
theProjectsSTRING = JSON.stringify(data);
theProjectsJSON = JSON.parse(theProjectsSTRING);
console.log('Processing JSON.Stringify on DATA');
console.log(theProjectsSTRING);
console.log('Processing JSON.Parse on theProjectsSTRING');
console.log('');
console.log('Parsing the array ' + theProjectsJSON.Domains.Domain[0].$.Name );
res.send(theProjectsSTRING);
});
});
I looked into your code. You are using res.render(..) and res.send(..). First of all you should understand the basic request-response cycle. Request object gives us the values passed from routes, and response will return values back after doing some kind of processing on request values. More particularly in express you will be using req.params and req.body if values are passed through the body of html.
So all response related statements(res.send(..),res.json(..), res.jsonp(..), res.render(..)) should be at the end of your function(req,res){...} where you have no other processing left to be done, otherwise you will get errors.
As per the modern web application development practices in javascript, frameworks such as Ruby on Rails, ExpressJS, Django, Play etc they all work as REST engine and front end routing logic is written in javascript. If you are using AngularJS then ngRoute and open source ui-router makes work really easy. If you look closely into some of the popular MEAN seed projects such as mean.io, mean.js even they use the ExpressJS as REST engine and AngularJS does the rest of heavyweight job in front end.
Very often you will be sending JSON data from backend so for that you can use res.json(..). To consume the data from your endpoints you can use angularjs ngResource service.
Let's take a simplest case, you have a GET /domaindata end point :
router.get('/domaindata',function(req,res){
..
..
res.json({somekey:'somevalue'});
});
In the front end you can access this using angularJS ngResource service :
var MyResource = $resource('/domaindata');
MyResource.query(function(results){
$scope.myValue = results;
//myValue variable is now bonded to the view.
});
I would suggest you to have a look into the ui-router for ui front end routing.
If you are looking for sample implementation then you can look into this project which i wrote sometime back, it can also give you an overview of implementing login, session management using JSON Web Token.
There are lot of things to understand, let me know if you need help in anything.

How to get a HTTPRequest JSON response without using any kind of template?

I am new to Django but i am advanced programmer in other frameworks.
What i intend to do:
Press a form button, triggering Javascript that fires a Ajax request which is processed by a Django View (creates a file) that return plain simple JSON data (the name of the file) - and that is appended as a link to a DOM-Element named 'downloads'.
What i achieved so far instead:
Press the button, triggering js that fires a ajax request which is process by a Django view (creates a file) that return the whole page appended as a duplicate to the DOM-Element named 'downloads' (instead of simple JSON data).
here is the extracted code from the corresponding Django view:
context = {
'filename': filename
}
data['filename'] = render_to_string(current_app+'/json_download_link.html', context)
return HttpResponse(json.dumps(data), content_type="application/json")
I tried several variants (like https://stackoverflow.com/a/2428119/850547), with and without RequestContext object; different rendering strats.. i am out of ideas now..
It seems to me that there is NO possibility to make ajax requests without using a template in the response.. :-/ (but i hope i am wrong)
But even then: Why is Django return the main template (with full DOM) that i have NOT passed to the context...
I just want JSON data - not more!
I hope my problem is understandable... if you need more informations let me know and i will add them.
EDIT:
for the upcoming questions - json_download_link.html looks like this:
Download
But i don't even want to use that!
corresponding jquery:
$.post(url, form_data)
.done(function(result){
$('#downloads').append(' Download CSV')
})
I don't understand your question. Of course you can make an Ajax request without using a template. If you don't want to use a template, don't use a template. If you just want to return JSON, then do that.
Without having any details of what's going wrong, I would imagine that your Ajax request is not hitting the view you think it is, but is going to the original full-page view. Try adding some logging in the view to see what's going on.
There is no need to return the full template. You can return parts of template and render/append them at the frontend.
A template can be as small as you want. For example this is a template:
name.html
<p>My name is {{name}}</p>
You can return only this template with json.dumps() and append it on the front end.
What is your json_download_link.html?
assuming example.csv is string
data ={}
data['filename'] = u'example.csv'
return HttpResponse(simplejson.dumps(data), content_type="application/json")
Is this what you are looking for?

How can I prevent IE Caching from causing duplicate Ajax requests?

We are using the Dynamic Script Tag with JsonP mechanism to achieve cross-domain Ajax calls. The front end widget is very simple. It just calls a search web service, passing search criteria supplied by the user and receiving and dynamically rendering the results.
Note - For those that aren’t familiar with the Dynamic Script Tag with JsonP method of performing Ajax-like requests to a service that return Json formatted data, I can explain how to utilise it if you think it could be relevant to the problem.
The service is WCF hosted on IIS. It is Restful so the first thing we do when the user clicks search is to generate a Url containing the criteria. It looks like this...
https://.../service.svc?criteria=john+smith
We then use a dynamically created Html Script Tag with the source attribute set to the above Url to make the request to our service. The result is returned and we process it to show the results.
This all works fine, but we noticed that when using IE the service receives the request from the client Twice. I used Fiddler to monitor the traffic leaving the browser and sure enough I see two requests with the following urls...
Request 1: https://.../service.svc?criteria=john+smith
Request 2: https://.../service.svc?criteria=john+smith&_=123456789
The second request has been appended with some kind of Id. This Id is different for every request.
My immediate thought is it was something to do with caching. Adding a random number to the end of the url is one of the classic approaches to disabling browser caching. To prove this I adjusted the cache settings in IE.
I set “Check for newer versions of stored pages” to “Never” – This resulted in only one request being made every time. The one with the random number on the end.
I set this setting value back to the default of “Automatic” and the requests immediately began to be sent twice again.
Interestingly I don’t receive both requests on the client. I found this reference where someone is suggesting this could be a bug with IE. The fact that this doesn’t happen for me on Firefox supports this theory.
Can anyone confirm if this is a bug with IE? It could be by design.
Does anyone know of a way I can stop it happening?
Some of the more vague searches that my users will run take up enough processing resource to make doubling up anything a very bad idea. I really want to avoid this if at all possible :-)
I just wrote an article on how to avoid caching of ajax requests :-)
It basically involves adding the no cache headers to any ajax request that comes in
public abstract class MyWebApplication : HttpApplication
{
protected MyWebApplication()
{
this.BeginRequest += new EventHandler(MyWebApplication_BeginRequest);
}
void MyWebApplication_BeginRequest(object sender, EventArgs e)
{
string requestedWith = this.Request.Headers["x-requested-with"];
if (!string.IsNullOrEmpty(requestedWith) && requestedWith.Equals(”XMLHttpRequest”, StringComparison.InvariantCultureIgnoreCase))
{
this.Response.Expires = 0;
this.Response.ExpiresAbsolute = DateTime.Now.AddDays(-1);
this.Response.AddHeader(”pragma”, “no-cache”);
this.Response.AddHeader(”cache-control”, “private”);
this.Response.CacheControl = “no-cache”;
}
}
}
I eventually established the reason for the duplicate requests. As I said, the mechanism I chose to use for making Ajax calls was with Dynamic Script Tags. I build the request Url, created a new Script element and assigned the Url to the src property...
var script = document.createElement(“script”);
script.src = https://....”;
Then to execute the script by appending it to the Document Head. Crucially, I was using the JQuery append function...
$(“head”).append(script);
Inside the append function JQuery was anticipating that I was trying to make an Ajax call. If the type of element being appended is a Script, then it executes a special routine that makes an Ajax request using the XmlHttpRequest object. But the script was still being appended to the document head, and being executed there by the browser too. Hence the double request.
The first came direct from the script – the one I intended to happen.
The second came from inside the JQuery append function. This was the request suffixed with the randomly generated query string argument in the form “&_=123456789”.
I simplified things by preventing the JQuery library side effect. I used the native append function...
document.getElementByTagName(“head”).appendChild(script);
One request now happens in the way I intended. I had no idea that the JQuery append function could have such a significant side effect built in.
See www.enhanceie.com/redir/?id=httpperf for further discussion.

Resources