how to crawling it make about jquery site when I use node - ajax

I should crawl https://motul.lubricantadvisor.com/Default.aspx?data=1&lang=ENG&lang=eng
but how can I do crawl the this website. I think it use jQuery. some people say you should use ajax. but I'll contain database by mongodb so I'll use node.js how can I do that?

Instead of using NodeJS (designed for other purposes), use PhantomJS, which is specially designed for testing/scraping of webpages. Since it uses JavaScript, it should be pretty easy to learn for you.
Another method (if you want to use Node) is to figure out how this webpage communicates to the underlying backend and connect directly to the backend using a library such as node-XMLHttpRequest.
Yet another option is to scrape data directly from the webpage using artoo.js, which injects directly into the rendered webpage and allows you to scrape the webpage using jQuery selectors.
Ethics note: However, as with all scraping, please be careful and only scrape websites for which you have explicit permission to. Not only may you be stealing their data, you may be wasting their bandwidth (and therefore their money), so please be considerate when using any sort of scraping tool.

Related

Web application design with/without ajax

Let's say I am creating a webapp for a library. My base url is http://mylibrary.com. I want to use "pretty" URLs as follows:
http://mylibrary.com/books (list all books)
http://mylibrary.com/books/book1 (details of a particular book)
At present my approach is to create a single page app and use history api to manage the URLs. i.e I load all CSS and JS files when the user visits the home page. From then I just get data from server using AJAX, in JSON format and then create the required HTML using Javascript.
But I have learnt that this is not so good from SEO point of view.If a crawler were to visit http://mylibrary.com/books it will not see booklist at all because AJAX calls would not take place.
My question is what is the other approach to design this kind of app ? Specifically:
Should the server create entire web page and send it to browser? I mean will the response from server include everything from <html> to </html> or only the required parts?
Do programming languages like PHP efficiently manage to send the HTML to clients ? I would rather have the webserver do it ..
It appears to me that in this scenario AJAX would have very little role to play other than may be change minor parts of the page. Is that a correct understanding ? ..and here I was thinking AJAX is the modern way of doing things
A library would have many books.
So the list would be long..
Using ajax allows you to fetch only the part of it the user is trying to read, without having to retrieve the entire list, or navigate by reloading.
so for low bandwidth, and impatient users, ajax is a godsend.
for crawlers that need the entire page to collect data from, not so much..
so really you want to provide different content depending on the visistor.
How to identify web-crawler?
IMHO: Provide the page from php, if the user agent is a robot, provide the list, otherwise provide the fancy ajax based site, that shows only what you want, when you want..

100% Ajax Websites

I have been looking at Hulu's new website and I am very impressed from a developer's standpoint (as well as a designer's).
I have found that, unless you switch between http/https, you are served content entirely from json requests. That is a HUGE feat to have this level of ajax while maintaining browse back button support as well as allowing each url to be visited directly.
I want to create a website like this as a learning experience. Is there any type of framework out there that can give me this kind of support?
I was thinking I could...
leverage jQuery
use clientside MVVM frameworks like KnockoutJS?
use ASP.NET MVC content negotiation to serve html or json determined by an accept header.
using the same codebase.
use the same template for client side and server side rendering
provide ways to update pagetitle/meta tags/etc.
Ajax forms/widgets/etc would still be used, by I am thinking about page level ajax using json and client side templates.
What do you think? Any frameworks out there? Any patterns I could follow?
It is always best to first build a website without AJAX support, then add AJAX on top of that. Doing this means that:
users without javascript can already access your website
users can already visit any URL directly.
Adding AJAX support can be accomplished by various javascript libraries. So that you can render json content, you will want to look at javascript templating. You will want to use javascript templating even on your server side for when you add AJAX support (file extension .ejs). This will probably require some appropriate libraries to run javascript on the server.
When you add AJAX support, you will want to use the "History.js" library for browser back/forward/history support.
Make no mistake. This is a HUGE project (unless your website only has a few pages). So it is going to take a LONG time to add all the AJAX support to the best possible standard.
to answer your bullet point about using the same template server side as well as client side: check outdust. It was originally developed by akdubya, but has since been adopted and enhanced by linkedin. They use it to render templates on their mobile app client side. Personally I've used it on the server side and it works great.

How do we do AJAX programming

I have no idea about AJAX programming features. I just know that it is Asynchronous Javascript and XML.
Please help me in knowing about this language.
I have gone through many AJAX tutorials. But none of the programs are running. Why I don't know.
Do we save the file with .HTML extension?
Read:
AJAX Tutorial by W3Schools.
AJAX Programming by Google Code University
To start coding you can get the Ajax Control Toolkit by Microsoft. You should read Ajax Control Toolkit Tutorials to get a grasp of it.
You can use the free Microsoft Visual Web Developer 2010 Express Edition as your IDE.
Aside from the correct responses that the others gave you, judging from your question I think you first need to learn about client-side and server-side code.
Do we save the file with .HTML extension?
Yes and no. You will have an HTML frontend, that for instance contains a button. This will be interpreted from the client's (=user) browser. In fact it may be rendered differently depending on the browser/OS/etc.
Now, you attach some Javascript code to this button. This also runs on the client's browser, and creates a XMLHttpRequest object, either directly or through the use of a library (JQuery & Co.). Note that a library is not necessary to do an AJAX request. It will make your life easier if you do a lot of AJAX calls, but it is not essential.
And here's where the magic happens: the XMLHttpRequest object will call asynchronously (i.e.: without reloading the page) a server-side page. This may be a PHP, ASP, Perl etc etc file that does something on the server, for instance queries a database. This part of the operation is absolutely independent from the client. The user can close the browser before the server-side code finishes to load and the server will not know about it.
Once the server-side code has finished executing it returns to the client with some response data (e.g. a piece of XML, JSON, HTML or whatever you like). Finally the client executes (or not) some other Javascript code in response to this, for example to write on the screen, again with no reloading of the page, something based on what the server has returned.
Maybe I can help you understand AJAX by clarifying the concepts a bit.
Please help me in knowing about this language.
AJAX is not a language, it is a way of using existing techniques to improve the user experience of a web site. The language is Javascript in the browser but you can use any server side technique that you feel comfortable with (ASP.NET, Java, PHP, Ruby etc.)
Do we save the file with .HTML extension?
Well, that is not really the point. What you have to grasp here is that there is a server and a browser that interact with each other. Yes, you can use static HTML files for your pages (and save them as .html files), but you'll need a server to respond to the requests of the browser. This may be why your sample code is not working; you need to set up a server that works with your pages.
The whole idea behind AJAX is to improve the user experience by not reloading the entire page when a user interacts with it. You request the data you need and update the page by using Javascript to update the HTML. This is called an out-of-band or asynchronous request.
I just know that it is Asynchronous Javascript and XML.
That is what the acronym stands for but it doesn't quite cover what the technique is for, nor is it accurate any more. In the beginning XML was used to transfer data from the server to the client. People found that XML is not really that easy to work with in Javascript so now it's more common to use JSON. JSON is a snippet of javascript that can be evaluated in the browser. The snippet creates javascript object(s) that represent the data.
If you use a Javascript library, like others have suggested here, you won't have to worry about many of the details though.
Before you get into AJAX you should make sure that you understand:
HTML and CSS
Javascript
how to modify HTML with Javascript
how a browser requests information from a server
how to handle requests on the server
If you are not comfortable with all of these concepts, stick with 'regular' web pages and try to improve your knowledge step by step.
Once you get the basic knowledge from W3school, I suggest you use a framework. Usually developers do not use XMLHttpRequest at all. Instead, javascript frameworks like ExtJS, jQuery and other frameworks make your work simple. I suggest you learn bit of javascript as well. check out jQuery.
Just to add that AJAX is rarely used in its pure form with XMLHttpRequest. You will often use it as a part of AJAX UI libraries which make your life easier. If you are from the Java world - such an AJAX library is Richfaces.
Instead of worrying about how to do AJAX, use something that allows you to forget about it. Frameworks like NOLOH do AJAX (and Comet) for you automatically without you having to do a thing. Just concentrate on your application, and business logic and it does the rest.
Really, everything is done via AJAX if available, automatically. No work on your part. If you're don't want to spend much time researching it, check out this short video that was demonstrated at Confoo PHP Conference this past March http://www.youtube.com/phpframework#p/u/11/cdD9hSuq7aw.
For all those worried about, well, if it's all AJAX, what about search engines? No need to worry, http://dev.noloh.com/#/articles/Search-Engine-Friendly/.
So instead of having to worry about all these different technologies, or the client-server relationship, you can sit down, code and have your website/WebApp working in no time.
You can read about NOLOH is this month's cover story of php|architect magazine, http://www.phparch.com/magazine/2010/may/.
Enjoy.
Disclaimer: I'm a co-founder of NOLOH.
It is easy one. Ajax getting data from server side by client side execution. We have to do use XMLHttpRequest to get the result.

Ajax architecture in Django application

I am trying to find the optimal architecture for an ajax-heavy Django application I'm currently building. I'd like to keep a consistent way of doing forms, validation, fetching data, JSON message format but find it exceedingly hard to find a solution that can be used consistently.
Can someone point me in the right direction or share their view on best practice?
I make everything as normal views which display normally in the browser. That includes all the replies to AJAX requests (sub pages).
When I want to make bits of the site more dynamic I then use jQuery to do the AJAX, or in this case AJAH and just load the contents of one of the divs in the sub page into the requesting page.
This technique works really well - it is very easy to debug the sub pages as they are just normal pages, and jQuery makes your life very easy using these as part of an AJA[XH]ed page.
For all the answers to this, I can't believe no one's mentioned django-piston yet. It's mainly promoted for use in building REST APIs, but it can output JSON (which jQuery, among others, can consume) and works just like views in that you can do anything with a request, making it a great option for implementing AJAX interactions (or AJAJ [JSON], AJAH, etc whatever). It also supports form validation.
I can't think of any standard way to insert ajax into a Django application, but you can have a look to this tutorial.
You will also find more details on django's page about Ajax
Two weeks ago I made a write up how I implement sub-templates to use them in "normal" and "ajax" request (for Django it is the same). Maybe it is helpful for you.
+1 to Nick for pages displaying normally in the browser. That seems to be the best starting point.
The problem with the simplest AJAX approaches, such as Nick and vikingosegundo propose, is that you'll have to rely on the innerHTML property in your Javascript. This is the only way to dump the new HTML sent in the JSON. Some would consider this a Bad Thing.
Unfortunately I'm not aware of a standard way to replicate the display of forms using Javascript that matches the Django rendering. My approach (that I'm still working on) is to subclass the Django Form class so it outputs bits of Javascript along with the HTML from as_p() etc. These then replicate the form my manipulating the DOM.
From experience I know that managing an application where you generate the HTML on the server side and just "insert" it into your pages, becomes a nightmare. It is also impossible to test using the Django test framework. If you're using Selenium or a similar tool, it's ok, but you need to wait for the ajax request to go return so you need tons of sleeps in your test script, which may slow down your test suite.
If the purpose of using the Ajax technique is to create a good user interface, I would recommend going all in, like the GMail interface, and doing everything in the browser with JavaScript. I have written several apps like this using nothing but jQuery, state machines for managing UI state and JSON with ReST on the backend. Django, IMHO, is a perfect match for the backend in this case. There are even third party software for generating a ReST-interface to your models, which I've never used myself, but as far as I know they are great at the simple things, but you of course still need to do your own business logic.
With this approach, you do run into the problem of duplicating code in the JS and in your backend, such as form handling, validation, etc. I have been thinking about solving this with generating structured information about the forms and validation logic which I can use in JS. This could be compiled at deploy-time and be loaded as any other JS file.
Also, avoid XML. The browsers are slow at parsing it, it is a pain to generate and a pain to work with in the browser. Use JSON.
Im currently testing:
jQuery & backbone.js client-side
django-piston (intermediate layer)
Will write later my findings on my blog http://blog.sserrano.com

Reverse Engineer A Web Form

I have a web site which I download 2-3 MB of raw data from that then feeds into an ETL process to load it into my data mart. Unfortunately the data provider is the US Dept. of Ag (USDA) and they do not allow downloading via FTP. They require that I use a web form to select the elements I want, click through 2-3 screens and eventually click to download the file. I'd like to automate this download process. I am not a web developer but somehow it seems that I should be able to use some tool to tell me exactly what put/get/magic goes from the final request to the server. If I had a tool that said, "pass these parameters to this url and wait for a response" I could then hack something together in Perl to automate this process.
I realize that if I deconstructed all 5 of their pages and read through the JavaScript includes and tapped my heals together 3 times I could get this info from what I have access to. But I want a faster and more direct path that does not require me to manually parse all their JS.
Restatement of the final question: Is there a tool or method that will show clearly what the final request request sent from a web form was and how it was structured?
A tamperer's best friends (these are firefox extensions, you could also use something like Wireshark)
HTTPFox
Tamper Data
Best of luck
Use Fiddler2 as a proxy to see what is being passed back and forth. I've done this with success in other similar circumstances
Home page is here: http://www.fiddler2.com/fiddler2/
As with the other responses, except my tool of choice is Charles
What about using a web testing toolkit, like Watir and Ruby ?
Easy to fill in the forms.. just use the output..
Use WatiN and combine it with WatiN TestRecorder (Google for it)
It can "simulate" a user sitting in front of the browser punching in values which you can supply from your own C# code...

Resources