How to load a page in background? - ruby

I have a project where we're using an iframe. However, project specs have changed and we can no longer use the iframe. Instead we need to request the html page in the background and display it on page when loaded.
Any ideas on how to do this via Ruby (rails). Thought best to ask for general direction before diving in.
Thanks!

load it with ajax, and do a body append

It depends on where you want to have the work occur, on the back-end in your Rails code, or in the user's browser via JavaScript, as #stunaz suggests.
Keeping it in the browser and loading via JavaScript will expose the HTML page's location to the user, which might not be desirable. Loading it from the back-end and including it in the HTML emitted to the browser will hide the source entirely.
If you want to do it on the back-end, the simplest thing is to either load the file from the local drive, if it is local using File.read. If it's on a different machine, you can use Open::URI to pull it in. Either way, you'd then insert it into the HTML in the right spot. How you do that depends on what you are using to generate the outgoing HTML.

Related

Web application design with/without ajax

Let's say I am creating a webapp for a library. My base url is http://mylibrary.com. I want to use "pretty" URLs as follows:
http://mylibrary.com/books (list all books)
http://mylibrary.com/books/book1 (details of a particular book)
At present my approach is to create a single page app and use history api to manage the URLs. i.e I load all CSS and JS files when the user visits the home page. From then I just get data from server using AJAX, in JSON format and then create the required HTML using Javascript.
But I have learnt that this is not so good from SEO point of view.If a crawler were to visit http://mylibrary.com/books it will not see booklist at all because AJAX calls would not take place.
My question is what is the other approach to design this kind of app ? Specifically:
Should the server create entire web page and send it to browser? I mean will the response from server include everything from <html> to </html> or only the required parts?
Do programming languages like PHP efficiently manage to send the HTML to clients ? I would rather have the webserver do it ..
It appears to me that in this scenario AJAX would have very little role to play other than may be change minor parts of the page. Is that a correct understanding ? ..and here I was thinking AJAX is the modern way of doing things
A library would have many books.
So the list would be long..
Using ajax allows you to fetch only the part of it the user is trying to read, without having to retrieve the entire list, or navigate by reloading.
so for low bandwidth, and impatient users, ajax is a godsend.
for crawlers that need the entire page to collect data from, not so much..
so really you want to provide different content depending on the visistor.
How to identify web-crawler?
IMHO: Provide the page from php, if the user agent is a robot, provide the list, otherwise provide the fancy ajax based site, that shows only what you want, when you want..

Google Search optimisation for ajax calls

I have a page on my site which has a list of things which gets updated frequently. This list is created by calling the server via jsonp, getting json back and transforming it into html. Fast and slick.
Unfortunately, Google isn't able to index it. After reading up on how to get this done according to Google's AJAX crawling guide, I am bit confused and need some clarification and confirmation:
The ajax pages need to be implement the rules only, right?
I currently have a rest url like
[site]/base/junkets/browse.aspx?page=1&rows=18&sidx=ScoreAll&sord=desc&callback=jsonp1295964163067
this would need to become something like:
[site]/base/junkets/browse.aspx#page=1&rows=18&sidx=ScoreAll&sord=desc&callback=jsonp1295964163067
And when google calls it like this
[site]/base/junkets/browse.aspx#!page=1&rows=18&sidx=ScoreAll&sord=desc&callback=jsonp1295964163067
I would have to deliver the html snapshot.
Why replace the ? with # ?
Creating html snapshots seems very cumbersome. Would it suffice to just serve simple links? In my case I would be happy if google would only index the things pages.
It looks like you've misunderstood the AJAX crawling guide. The #! notation is to be used on links to the page your AJAX application lives within, not on the URL of the service your appliction makes calls to. For example, if I access your app by going to example.com/app/, then you'd make page crawlable by instead linking to example.com/app/#!page=1.
Now when Googlebot sees that URL in a link, instead of going to example.com/app/#!page=1 – which means issuing a request for example.com/app/ (recall that the hash is never sent to the server) – it will request example.com/app/?_escaped_fragment_=page=1. If _escaped_fragment_ is present in a request, you know to return the static HTML version of your content.
Why is all of this necessary? Googlebot does not execute script (nor does it know how to index your JSON objects), so it has no way of knowing what ends up in front of your users after your scripts run and content is loaded. So, your server has to do the heavy lifting of producing a HTML version of what your users ultimately see in the AJAXy version.
So what are your next steps?
First, either change the links pointing to your application to include #!page=1 (or whatever), or add <meta name="fragment" content="!"> to your app's HTML. (See item 3 of the AJAX crawling guide.)
When the user changes pages (if this is applicable), you should also update the hash to reflect the current page. You could simply set location.hash='#!page=n';, but I'd recommend using the excellent jQuery BBQ plugin to help you manage the page's hash. (This way, you can listen to changes to the hash if the user manually changes it in the address bar.) Caveat: the currently released version of BBQ (1.2.1) does not support AJAX crawlable URLs, but the most recent version in the Git master (1.3pre) does, so you'll need to grab it here. Then, just set the AJAX crawlable option:
$.param.fragment.ajaxCrawlable(true);
Second, you'll have to add some server-side logic to example.com/app/ to detect the presence of _escaped_fragment_ in the query string, and return a static HTML version of the page if it's there. This is where Google's guidance on creating HTML snapshots might be helpful. It sounds like you might want to pursue option 3. You could also modify your service to output HTML in addition to JSON.
I've more or less given up on this. There really seems no alternative to generating the html on the server and delivering it in the html bdoy if you want goolge to index your directory.
I even tried adding a section wraped a .net user control which implemented a simple html version of the directory. But google also managed to ignore ..
So in the end my directory has been de-ajaxified. :(

Javascript: achieving the Google Ad AJAX effect

I need to create a portable script to give to others to implement on their websites that will dynamically show content from my database (MySQL).
I know AJAX has a cross-site problem, but it seems that Google's ad's somehow manage the effect in a cross-browser / cross-site fashion.
Knowing that I have to give people a simple cut/paste snippet to put in their website...how can I achieve this? How did Google?
They use an <iframe>, so the ad is served from their server, and can talk to their database. I'm not actually sure that they use any sort of AJAX from their ads, though; they appear to just be mostly static content, with a few scripts for tweaking the formatting (which are optional, since they want their ads to be visible even if users have JS turned off).
Remember, you can always look into this on your own, and see what they did. On Firefox, use Firebug to explore the html, css, and scripts on a site. On WebKit based browsers (Safari, Chrome, and others), you can use the Web Inspector.
Google's ad code is loaded via a script tag that calls a remote javascript file. The AJAX restrictions that are generally enforced with xmlhttp, iframe, and similar AJAX requests don't apply when it comes to loading remote javascript files.
Once you've loaded the javascript file, you can create iframes in your page that link back to the actual hosted content on your server (and feed them any data about the current page that you wish).
jQuery has built in support for jsonp in their ajax calls. You may want to lookin in to using that if you are really needing to use ajax.
http://api.jquery.com/
http://docs.jquery.com/Ajax
You don't need iFrames and you don't need AJAX. It's really, really simple!
You pull in a remote JS file that is actually a constructed file from php/asp/whatever. In your JS file you have a document.write script that writes the content. It's that simple.
We do this all the time with media stored on separate sites. Here's an example.
YOUR SERVER: file.php (which outputs js)
<script>
document.write("I'm on a remote server");
</script>
OTHER SITE:
<script src='http://www.yourserver.com/file.php'></script>
And it will output the content generated by the script. To make the content customized you can put in script vars above the script call that will adjust what your file pulls out. From there it's pretty straightforward.
I realize this question is a year old, but I've written a library that can help with the document.write part of the problem (whether this is a TOS violation, I don't know) writeCapture.js. It's pretty simple:
$('#ads').writeCapture().html('<script src="whatever-your-adsense-code-is"> </script>');
The example uses jQuery, but you can use it standalone as well.

Screen scraping an ASP.NET web page to retrieve data displayed in the grid view

I am using RUBY to screen scrap a web page (created in asp.net) which uses gridview to display data. I am successfully able to read the data displayed on page-1 of the grid but unable to figure out how I can move to the next page in the grid to read all the data.
Problem is the page number hyperlinks are not normal hyperlinks (with URL) but instead are javascript hyperlink which causes postback to the same page..
An example of the hyperlink:-
6
I recommend using Watir, a ruby library designed for browser testing, if you're already using ruby for processing. For one thing, it gives you a much nicer interface to the DOM elements on the page, and it makes clicking links like this easier:
ie.link(:text, '6').click
Then, of course you have easier methods for navigating the table as well. It's easy enough to automate this process:
1..total_number_of_pages.each do |next_page|
ie.link(:text, next_page).click
# table processing goes here
end
I don't know your use case, but this approach has its advantages and disadvantages. For one thing, it actually runs a browser instance, so if this is something you need to frequently run quietly in the background in completely automated way, this may not be the best approach. On the other hand, if it's ok to launch a browser instance, then you don't have to worry about all that postback nonsense, and you can just click the link as if you were a user.
Watir: http://wtr.rubyforge.org/
You'll need to figure out the actual URL.
Option 1a: Open the page in a browser with good developer support (e.g. firefox with the web development tools) and look through the source to find where _doPostBack is defined. Figure out what URL it's constructing. Note that it might not be in the main page source, but instead in something that the page loads.
Option 1b: Ditto, but have ruby do it. If you're fetching the page with Net:HTTP you've got the tools to find the definition of __doPostBack already (the body as a string, ruby's grep, and the ability to request additional files, such as those in script tags).
Option 2: Monitor the traffic between a browser and the page (e.g. with a logging proxy) to find out what the URL is.
Option 3: Ask the owner of the web page.
Option 4: Guess. This may not be as bad as it sounds (e.g. if the original URL ends with "...?page=1" or something) but in general this is the least likely to work.
Edit (in response to your comment on the other question):
Assuming you're using the Net:HTTP library, you can do a postback by just replacing your get with a post, e.g. my_http.post(my_url) instead of my_http.get(my_url)
Edit (in response to danieltalsky's answer):
watir may be a really good solution for you (I'm kicking myself for not having thought of it), but be aware that you may have to manually fire the event or go through other hoops to get what you want. As a specific gotcha, with any asynchronous fetch like this you need to make sure that the full response has come back before you scrape it; that isn't a problem when you're doing the request inline yourself.
You will have to perform the postback. The data is pass with a form POST back to the server. Like Markus said use something like FireBug or the Developer Tools in IE 8 and fiddler to watch the traffic. But honestly this is a web form using the bloated GridView and you will be in for a fun adventure. ;)
You'll need to do some investigation in order to figure out what HTTP request the javascript execution is performing. I've used the Mozilla browser with the Firebug plugin and also the "Live HTTP Headers" plugin to help determine what is going on. It will likely become clear to you which requests you will need to make in order to traverse to the next page. Make sure you pay attention to any cookies getting set.
I've had really good success using Mechanize for scraping. It wraps all of the HTTP communication, html parsing and searching(using Nokogiri), redirection, and holding onto cookies. But it doesn't know how to execute Javascript, which is why you will need to figure out what http request to perform on your own.

Pass information back from an iframe?

Right now i'm building a firefox plugin that duplicates some functionality on my website. It takes in an email address and then returns information to the user. The easiest way to do this in the plugin is to use an Iframe and render that super simple form on my website. All of this works great, but to make the plugin really useful, i would like the plugin to have access to the information that the iframe renders, so it can use it in the current window that the user is in.
Is it possible to pass information back through an Iframe in this manner? I know there are quite a few domain access restrictions with Iframes, so any help or insight is appreciated!!
I've done this two ways.
If the iframe is on the same domain as the parent website, you can just, in javascript, access window.parent.
If it isn't, however...I've done a dirty trick. I'll share it here, though, as it may help.
We created a page on the other domain, which would call to window.parent.parent. We put that in a hidden iframe inside the iframed page, and send it a querystring argument or two. It's not pretty, but it gets around cross-domain scripting problems.
This basically means that you have this sort of thing:
admin.example.com
content.example.com - iframe
admin.example.com?contentid=350 - hidden iframe that makes a window.parent.parent call.
Is the point of this whole exercise functional testing of your website? If so, instead of your custom Firefox plugin, consider using Selenium to automate interactions with websites. It works with all major browsers and supports the inspection of page elements you are trying to do (using XPath). It also features a Firefox plugin called Selenium IDE that allows you to conveniently "record" your interactions with a website for automated playback later.

Resources