Scraping content from an AJAX/Javascript web page - ajax

I need to do some screen scraping on a web page where the content I need is generated by AJAX. On the initial page there is a table with 4 tabs. When you click on any of the tabs the content of the table changes. I need the content from the 3rd tab only.
I have used the google chrome 'Inspect Element' tool to see what the requests and post data was and I can get the information I need when I put the information (session id and a lot of other cookie data as well as post data) from the inspect element result into a PHP curl request. But this only works for the 30 minutes that the session lasts. Does anyone know of a way I can get to this information?

I wont reproduce the code here but I will point you to the answer.
Its within this book:
http://www.amazon.com/Webbots-Spiders-Screen-Scrapers-Developing/dp/1593273975/ref=dp_ob_image_bk
A must buy for someone doing what your doing.

In the end I used htmlunit to get the content I needed. I also found the HTMLUnit Scripter very useful to help generate the Java code required.

Related

Getting the JSON file when triggering AJAX

I'm writing a crawler to get the content from a website which uses AJAX.
There is a "show more" button at the bottom of the page, and my origin approach is to use Selenium.PhantomJS to pretend a web browser but it works in some website and some don't.
I'm wondering if there is some way i can directly get the underly JSON file of the AJAX action. Please give me some details, thanks.
By the way, I'm using Python.
I understand this is less of a python than a scraping problem in general (and I understand you meant "scraping" instead of "crawling" as a scraper reads/parses/processes one page whereas a crawler processes multiple pages and they're relation to each other).
You can get the JSON file immediately given you know it's URL. If you don't (for example because the URL changes from time to time), you might need to search through javascript files on the page manually to find out how the URL is generated.
Once you know the JSON file's URL, it's quite simple. As you already seem to know how to get the HTML of the "main" page, you can use your existing code to get the JSON file.
I'm not familiar with PhantomJS, but I reckon it's easier to get the JSON file immediately instead of simulating an AJAX request (if that's even possible with Phantom).

how does usatoday display URI for news docs?

I am developing a web app/message board in AJAX. Ive come to the part where I need to decide how to display threads.
Should I refresh a completely new page for each thread? Or load it via AJAX. Obviously, I want each thread to be crawlable, linkable, and saveable as a favorite in your browser.
Then I saw USAToday's website (www.usatoday.com/news). Its very interesting how they load the page through a popup window, change the URI, and keep the data in the background.
This is exactly what I want, but I don't know what they are doing.
Can anyone else decipher this or lead me down the right path?
My impeccable googling skills has led me to believe that the answer lies in pushState.
http://www.seomoz.org/blog/create-crawlable-link-friendly-ajax-websites-using-pushstate
Essentially, it appears they are...
using the HREF of the provided link to change the URI via pushState.
using AJAX to load the contents of the page accessed via the link.
on close, they most likely use data from the newly loaded page to figure out what section its was under(sports, entertainment, etc), and reload that page.

Ajax generated pages with different URLs

I couldn't really word the title very well, but here's my problem: I've got a webpage that reads from a database each time the user clicks a button, the content is then replaced for part of the page.
Because it is an ajax load, everything is done in the background, and so the URL stays the same. This wasn't be a problem at all until I realised that I will want to have a different Facebook comments box for each set of content that is loaded - so if someone comments, it is posted to their facebook profile, people click on the link and are then taken to different content.
So... what I need is some way of referencing each set of content, and I've found a site that does exactly that (I'm sure there are a lot of them).
Here's the link.
Each set of content has a different 'hash code' (because I don't know the actual name for it) which is appended to the URL - in this case the code is "#1922934", this allows people to post links to it that specific set of content on Facebook etc. - and also allows a different Facebook comment box for each set of content.
Does anyone know how such a set-up can be achieved or how these 'hash codes' work?
Here's a document from wikipedia on it.
[http://en.wikipedia.org/wiki/Fragment_identifier][1]
The main idea is that URI fragments are used because they don't cause a page reload. They also can be used to refer to anchors on a web page.
What I would do is on page load use JavaScript to read the URI fragment (location.hash) then make a request to your server to load the comments etc. The URI fragment cannot be read by a server and is only found through a client (browser)
Sounds like you want something like SammyJS.

Loading Ajax Response Data with Adsense Codes inside

I'm 10000000% sure that this question has been asked before, however, the majority of the responses that I came across were from back in 2005, 2006 and so on. Not to mention, almost all of the questions themselves were too general. Therefore, I'm asking this so that for anyone else needs to find this out, then they won't need to dig through about 50 webpages to get an idea.
My question is simply that I have a webpage that has Google Ads embedded into the HTML of the website. The website was first developed as a static HTML site where each link reloaded a new page. Nevermind the backend technology of the website - the website itself produces purely dynamic content. The website is close to completion and now a fully-ajax listener has been added to all the links. When any of the links are clicked, JavaScript takes over, parses the link and sets that using popstate or the hashbang. The page itself is then queried to the server via AJAX and the content is updated using document.getElementByID('container').innerHTML=ajax.responseText; This way, there is almost a 100% method of accessing content that was replaced by AJAX.
This all works fine, but the responseText itself may, WILL contain Google Ads, and I was just wondering how to display them as if it were a static page. Clearly this doesn't work. Here are the options that I've come across:
Use an IFrame:
An IFrame seems to be an effective way to load the content; just stick the adsense codes a simple adsense.html iframe file and let the browser and
directly into page, it isn't possible
it's against their TOS
there is document.write() omitted in ajax request
Your chance is:
Create simple iframe
<iframe src="advert.html"></iframe>
and in advert.html, add your advert code
It's then loaded fine without problems.
Good luck

Ajax generated content, crawling and black listing

My website uses ajax.
I've got a user list page which list users in an ajax table (with paging and more information stuff...).
The url of this page is :
/user-list
User list is created by ajax. When the user click on one user, he is redirected to a page which url is : /member/memberName
So we can see here that ajax is used to generate content and not to manage navigation (with the # character).
I want to detect bot to index all pages.
So, in ajax I want to display an ajax table with paging and cool ajax effetcs (more info...) and when I detect a bot I want to display all users (without paging) with a link to the member page like this :
JohnBob...
Do you think I can be black listed with this technique ? If you think so, could you please provide an alternative solution by keeping these clean urls and without redeveloping the user-list (without ajax) ?
Google support a specification to make AJAX crawlable:
http://code.google.com/web/ajaxcrawling/docs/specification.html
I did an experiment and it works:
http://seo-website-designer.com/SEO-Ajax-Google-Solution
As this is a Google specification, you won't get penalised (unless you abuse it).
Saying that, only Google support it at the moment (AFAIK).
Also, I believe following the concept of Progressive Enhancement is a better approach. That is, create a working html website then make the JavaScript enhance it
Maybe use the urls with an onclick to trigger your AJAX scripting? Like
Some URL
I don't think Google would punish you for this, you primarily use JScript, but you do provide a fall back for their bot, so your site doesn't get any less accessible.
EDIT
Ok, I misunderstood. Then my guess would be you basically have two options:
1. Write a different part of your site where bots end up, or,
2. Rewrite your current site to for example always give a 'full' page, with an option to only get, say, the content div. Then you can get only the content with JavaScript, but bots will always get a nice page.

Resources