Why use Puppeteer for JS webscraping instead of just Ajax? - ajax

I have been using regular Ajax requests to pull HTML from a desired location using my Node server. However my peer has recently been using Puppeteer (a headless browser) to achieve roughly the same thing. Why use something like this?

If you need just HTML source code, you usually need not a headless browser. But if you need a full-fledged document with all scripts executed and all dynamical data generated, you do need a headless browser.

Related

how to crawling it make about jquery site when I use node

I should crawl https://motul.lubricantadvisor.com/Default.aspx?data=1&lang=ENG&lang=eng
but how can I do crawl the this website. I think it use jQuery. some people say you should use ajax. but I'll contain database by mongodb so I'll use node.js how can I do that?
Instead of using NodeJS (designed for other purposes), use PhantomJS, which is specially designed for testing/scraping of webpages. Since it uses JavaScript, it should be pretty easy to learn for you.
Another method (if you want to use Node) is to figure out how this webpage communicates to the underlying backend and connect directly to the backend using a library such as node-XMLHttpRequest.
Yet another option is to scrape data directly from the webpage using artoo.js, which injects directly into the rendered webpage and allows you to scrape the webpage using jQuery selectors.
Ethics note: However, as with all scraping, please be careful and only scrape websites for which you have explicit permission to. Not only may you be stealing their data, you may be wasting their bandwidth (and therefore their money), so please be considerate when using any sort of scraping tool.

Website store JS and CSS on client side without any framework

I am making a website in Codeigniter and not using any client side framework like angularJS. However I need some features of angularJS like downloading the JS and CSS once at the client rather than downloading it for each page. As my website content is much dependent on the server, should I use angularJS? I read that it makes tha application slower.
your question is not about angular at all!
I recommend you to read something about build systems like require, grunt, yeoman...
What you want to do is ajaxifying your website, as Stever said it's not about angular at all..
you may use RequireJS to load the script when a page need it.
For a best perfomance, use grunt for running any task like : minifying, compressing your stylesheet and so on..

How to load a page in background?

I have a project where we're using an iframe. However, project specs have changed and we can no longer use the iframe. Instead we need to request the html page in the background and display it on page when loaded.
Any ideas on how to do this via Ruby (rails). Thought best to ask for general direction before diving in.
Thanks!
load it with ajax, and do a body append
It depends on where you want to have the work occur, on the back-end in your Rails code, or in the user's browser via JavaScript, as #stunaz suggests.
Keeping it in the browser and loading via JavaScript will expose the HTML page's location to the user, which might not be desirable. Loading it from the back-end and including it in the HTML emitted to the browser will hide the source entirely.
If you want to do it on the back-end, the simplest thing is to either load the file from the local drive, if it is local using File.read. If it's on a different machine, you can use Open::URI to pull it in. Either way, you'd then insert it into the HTML in the right spot. How you do that depends on what you are using to generate the outgoing HTML.

?_escaped_fragment_= - headless browser

what do I have to do to add a ?_escaped_fragment_= support to my server? I want google to be able to crawl through my ajax site. My hashes are already in #! form
But I have no idea how to tell my server that when I enter mywebsite.com/?_escaped_fragment_=section to my browser so the url mywebsite.com/section and it would be equal to mywebsite.com/#!
thanks
Simple answer - my method (soon to be used for a site with ca. 50,000 AJAX-generated URLs) is to have a node.js server using a headless environment (try zombie, phantomjs, or any other) to load the site, making sure it's able to execute javascript and read the DOM - then at runtime, if it's google requesting the fragment, fire a request to the node.js server, which loads the site, executes the javascript, waits for the response, and delivers back the HTML, which is output to the browser.
If that sounds like a lot of work - I'm about 90% finished on the code that does it all for you, where you'd simply drop one line of (PHP) code at the top of your site/app and it does the rest for you, using a remote node.js server.
The code will be open source so if you want to set it up yourself on a node server, you can - or if it's a PITA to set it up yourself, I'll probably have a live server up and running which your app/website would fire ?_escaped_fragment_ requests to, and get the html snapshot back. It also implements caching so that these are only requested once every X days.
Watch this space - just got a few kinks to work out, and it'll be on my site (josscrowcroft.com) and I'll put it in a github repo too.

any scripting language can read AJAX/Java Script? (linux)

is there any way I can scrape web pages that uses AJAX?
by using something like ruby + mechanize on linux server that doesn't have monitor attached (linode.com for example)
http://watir.com/ would be a solution but I guess not applicable to linode.
Check out TestPlan. It can do testing without a monitor -- by using the HTMLUnit backend. It handles quite a lot of JavaScript, including AJAX. I use it to scrape several pages and have built several tests of AJAX with it.
You can also run TestPlan with a browser if you want. This gives you the best of both worlds: develop tests and visually see what is happening, and then switch to the display-less mode.

Resources