I may be saying this with incorrect terminology so correct me if I'm wrong please.
Here's what I want to do: I'm trying to scrape a website's comments section but the comments are loaded via an ajax call after the page has fully loaded. When I try to scrape the HTML from the site via:
res, err:= http.Get(url)
if err != nil {
// handle error
}
defer res.Body.Close()
But it obviously gets the html before the ajax call. How do I go about getting the html after the ajax call?
This is completely off the top of my head, but would I need to basically create a js-renderer in code for this? My guess is that the JS needs to execute somehow. Any suggestions / libraries / examples on how to go about this? I'd prefer this to be in go but it could be realistically in any language.
you can use the headless browsers like http://phantomjs.org/ to get page, execute all javascripts on it and scrape the comments.
This example can help : https://github.com/ariya/phantomjs/blob/master/examples/phantomwebintro.js
But phantomjs is separate binary application, maybe installing it can be not so trivial.
Also you you can research the page using Firebug, see the requests being send to fetch comments, and emulate this call in go.
Maybe the page loads comments via javascript code like this
$.get( "/ajax/comments", function( data ) {
$( ".comments" ).html( data );
});
so you can fetch and parse the /ajax/comments page using go
Recently I had the same issue and GoQuery helped a lot
I tried the first site came from the net, where comments are loaded by JS event and wrote you a small snippet. You may try and check it out.
doc,_ := goquery.NewDocument("http://www.ihg.com/holidayinn/hotels/us/en/san-francisco/sfocc/hoteldetail/hotel-reviews?scmisc=hotel_details_reviews_link_bottom")
html_contents,_ := doc.Html()
fmt.Println(html_contents)
This will initially shows all the comments below main content of the page, which are loaded by JS event.
Good Luck!
If you own the site or can easily determine (or generate) the URI of the call that loads the comments, it's probably easier to make that same AJAX call yourself rather than bother with DOM parsing or arbitrary JS execution.
At that point Go would actually be a good language to use, since its JSON and XML standard libraries are excellent for unmarshalling that kind of data.
Related
I have a listing page for an e-commerce website with various items (item_list.php). This page is generated with a PHP loop and displays each item inside a <li> element. Every item is a link to the same page, called item_details.php .
When clicking on the link i want to run a script that changes a SESSION var to a certain $id (which will be excracted from the <li> itself with .innerHTML function) and then allowing the browser to move into the next page (item_details).
This is needed so i can display the proper information about each item.
I think this is possible with Ajax but I would prefer a solution that uses JS and PHP only.
(P.S.This is for a University project and im still a PHP newbie, i tried searching for an answer for a good while but couldn't find a solution)
No JS or other client-side code can set session values, so you need either an ajax call to php, or some workaround. This is not a complete answer, but something to get you thinking and hopefully going on the project again.
The obvious answer is just include it in the link and then get it in PHP from the $_GET -array, and filter it properly.
item title
If, however, there is some reason this is not a question with an obvious answer:
1.) Closest what you're after can be achieved with a callback and an ajax call. The idea is to have the actual link with a click function, returning false so the link doesn't fire at once, which also calls an ajax post request which finally will use document.location to redirect your browser.
I strongly advice against this, as this will prevent ctrl-clicks causing a flawed user experience.
Check out some code an examples here, which you could modify. You will also need an ajax.php file which will actually set the session value. https://developers.google.com/analytics/devguides/collection/analyticsjs/enhanced-ecommerce#product-click
2.) Now, a perhaps slightly better approach, if you truly need to do this client-side could be to use an click handler which instead of performing an ajax call or setting session directly, would be to use jQuery to set a cookie and then access this data on the item_list.php -page.
See more information and instructions here: https://www.electrictoolbox.com/jquery-cookies/
<script>
$('product_li a).click(function(){
$.cookie("li_click_data", $(this).parent().innerhtml());
return true;
});
</script>
......
<li class="product_li">your product title</li>
And in your target php file you check for the cookie. Remember, that this cookie can be set to anything, so never ever trust user data. Test and filter it in order to make sure your code is not compromised. I don't know what you want to do with this data.
$_COOKIE['li_click_data'];
3.) Finally, as the best approach, you should look at your current code, and see if there is something you can re-engineer. Here's a quick example.
You could do the following in php to save an array of the values in the session on each page load, and then get that value provided you have some kind of id or other usable identifier for your items:
// for list_items.php
foreach($item as $i) {
// Do what you normally do, but also set an array in the session.
// Presuming you have an id or some other means (here as item_id), to identify
// an item, then you can also access this array on the item_details -page.
$_SESSION['mystic_item_data_array'][$i['item_id]] = $i['thedata'];
}
// For item_details.php
$item_id = // whatever means you use to identify items, get that id.
$data_you_need = $_SESSION['mystic_item_data_array'][$item_id];
Finally.
All above ways are usable for small data like previous page, filters, keys and similar.
Basically, 1 and 2 (client-side) should only be used, if the data is actually generated client-side. If you have it in PHP already, then process it in php as well.
If your intention is to store actual html, then just regenerate that again on the other page and use one of the above ways to store the small data in case you need that.
I hope this gets you going and at least thinking of how to solve your project. Good luck!
I want extract a number provided by javascript object in site, but I really don't understand that I am doing.
I tried different versions using alike examples and guidelines in import.io site and other tutorial sites, but I got only 1 of two results: extracted all numbers on given page or nothing at all.
I tried e.g. //[contains(.,"Unikālo apmeklējumu skaits:")]#type ; //[contains(.,"Unikālo apmeklējumu skaits:")] . Most likely it's necessary to add there something else, but I just don't know that.
Link I am interested in to extract from is: https://www.ss.lv/msg/lv/clothes-footwear/womens-clothes/trousers/ikcbb.html and information necessary is a number after text "Unikālo apmeklējumu skaits:" which is given by javascript.
Hopefully someone will be able to help me with this problem.
For someone who is new in web-scraping this should be a hard task, I'll ty to explain it. First of all, the xpath to get to that location could be something like this:
'//td[#class="msg_footer" and contains(text(), "Unik")]'
Now you have that tag (and what it contains), but if you check it doesn't contain the number you need, that content is being dynamically loaded with a javascript, and the javascript is this one:
<script type="text/javascript"><!--
var ss_w='rādīt numuru';
document.write( '<scr'+'ipt id="contacts_js" src="/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t='+new Date()+'"></scr'+'ipt>' );
--></script>
which could be gotten from the response with this xpath:
'//script[contains(text(), "contacts_js")]/text()'
from that string, you should replicate the url that comes in src, so this url for example:
/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=
and add to the end the current date, as javascript creates it with new Date(). Then you should make a request to that url (adding the previous response domain), so something like:
https://www.ss.lv/js/2015-10-27/37863/VHoBGkpqSV8bfwkdTX9AXEpZXCVDlASIQ1ZV3kK.js?t=Wed%20Oct%2028%202015%2020:56:42%20GMT-0500%20(PET)
check that the date is urlencoded. it should return a response like:
var PHONE_CNT=-1;var PHONE_CNT2=-1;var PHONE_CNT3=-1;var EMAIL_CNT=-1;var SHOW_CNT=22;var PH_c="";var PH_1=0;var PH_2=0;var PH_3=0;
pcc_id=0;PH_1=gpzd("JTg3aCU3QyU1QnolN0MlN0JYcWh6JTVCdCU5NSU4QyU5MnV4ayU5QXElN0IlOTQlNUNweiU5MGtvJTdCJThFJTVF","55937369");
where you can check that the value inside SHOW_CNT is the number you want.
If you want to know how I figured out which request and which script was populating that response tag, well that I did using firebug, searching for SHOW_CNT inside all of the responses that involve calling to your URL, which pointed to the request I specified, and then trying to check who was requesting that.
Hope it helped.
support#import.io are the guys to speak to, they give free advice and help trouble shoot problems just like this all the time.
There are all kinds of tips and tricks you can use... for example import.io provide (an undocumented beta) JavaScript Pre-render service that would likely work for you in this scenario. API publish failures are sometimes caused by timeouts while waiting for sites to render JS, this would fix that.
http://support.import.io/knowledgebase/articles/623235-infinite-scroll-and-javascript-prerender-beta
I hope this helps.
I'm using node, request and cheerio, to fetch data from a html page. This has not been any problem but one page loads additional data through ajax to fill different containers. These are empty and undefined when the initial request is done, how do I handle this the best way?
request(url, function (error, response, html) {
if (!error && response.statusCode == 200) {
var $ = cheerio.load(html);
forum_url = $('.this.url.is.loaded.separatly.with.ajax'[1].attr('href');
}
});
Cheerio isn't really designed with ajax in mind. If you are able to extract the urls that need to be downloaded, you would likely have to maintain multiple seperate $ objects, as it's unlikely they can be merged easily.
Usually, in cases where you need to execute javascript found on a scraped page, we would turn to Phantom.js. Phantom is a headless browser that you control using javascript, it's pretty cool.
You can check out some Phantom.js web scraping code here: http://code4node.com/snippet/web-scraping-with-node-and-phantomjs
I am new to Django but i am advanced programmer in other frameworks.
What i intend to do:
Press a form button, triggering Javascript that fires a Ajax request which is processed by a Django View (creates a file) that return plain simple JSON data (the name of the file) - and that is appended as a link to a DOM-Element named 'downloads'.
What i achieved so far instead:
Press the button, triggering js that fires a ajax request which is process by a Django view (creates a file) that return the whole page appended as a duplicate to the DOM-Element named 'downloads' (instead of simple JSON data).
here is the extracted code from the corresponding Django view:
context = {
'filename': filename
}
data['filename'] = render_to_string(current_app+'/json_download_link.html', context)
return HttpResponse(json.dumps(data), content_type="application/json")
I tried several variants (like https://stackoverflow.com/a/2428119/850547), with and without RequestContext object; different rendering strats.. i am out of ideas now..
It seems to me that there is NO possibility to make ajax requests without using a template in the response.. :-/ (but i hope i am wrong)
But even then: Why is Django return the main template (with full DOM) that i have NOT passed to the context...
I just want JSON data - not more!
I hope my problem is understandable... if you need more informations let me know and i will add them.
EDIT:
for the upcoming questions - json_download_link.html looks like this:
Download
But i don't even want to use that!
corresponding jquery:
$.post(url, form_data)
.done(function(result){
$('#downloads').append(' Download CSV')
})
I don't understand your question. Of course you can make an Ajax request without using a template. If you don't want to use a template, don't use a template. If you just want to return JSON, then do that.
Without having any details of what's going wrong, I would imagine that your Ajax request is not hitting the view you think it is, but is going to the original full-page view. Try adding some logging in the view to see what's going on.
There is no need to return the full template. You can return parts of template and render/append them at the frontend.
A template can be as small as you want. For example this is a template:
name.html
<p>My name is {{name}}</p>
You can return only this template with json.dumps() and append it on the front end.
What is your json_download_link.html?
assuming example.csv is string
data ={}
data['filename'] = u'example.csv'
return HttpResponse(simplejson.dumps(data), content_type="application/json")
Is this what you are looking for?
I'm looking for a simple example of a Node/Express/Jade page being updated using an Ajax call with both the client and server side code.
I'm having a bit of trouble putting it all together in my head.
There are great many ways this could be done and it's not immediately apparent which approach you want to take.
I suppose the simplest scenario would be to add some client-side logic to fetch pieces of html from the server and update the client. This is easily achieved using jQuery (put it inside a document ready block to wire up the event):
$('#button').click(function() {
$.get('/some/url', {foo: 42}, function(result) {
$('#target').html(result);
}
}
This way all your html is generated on the server and you simply fetch and insert it into the page as needed.
You could also fetch json from the server and render the html on the client, but that is one of the alternative approaches. I highly recommend giving TodoMVC a look - it's a todo-list application with many different implementations (each using a different framework) and therefore a great learning resource for the various approaches and helper libraries.
I'd also recommend the Hands-on Node.js book. It will help you understand routing and how to get started with Node.