Screen scraping and proxies using Ruby - ajax

I know there are several screen scraping threads on here but none of the answers quite satisfied me.
I am trying to scrape the HTML from an external web page using javascript. I am using $.ajax and everything should work fine. Here is my code:
$.ajax({
url: "my.url/path",
dataType: 'text',
success: function(data) {
var myVar = $.get(url);
alert(myVar);
}
});
The only problem is that it is looking for the specified url within my web server. How do I use a proxy to get to an external web page?

Due to Cross Site Scripting restrictions, you're going to have to pass the desired URL to a page on your server that will query the URL in question from serverside, and then return the results to you. Take a look at the thread below and the incorporate that into your application and have it return the source when that page is hit by your AJAX function.
How to get the HTML source of a webpage in Ruby
Using a GET request is going to the be easiest way to transfer the URL of the page you want to fetch your server so you'll be able to call something like:
$.ajax("fetchPage.rb" + encodeURI(http://www.google.com))
Because you can't access the side in question directly from the server, you're going to have to pipe the serverside script through a proxy for the request to work, which really kind of depends on your setup. Taking a look at the Proxy class in Ruby:
http://ruby-doc.org/stdlib-1.9.3/libdoc/net/http/rdoc/Net/HTTP.html#method-c-Proxy

Related

Curl works but ajax not working in Shopify private app

I have created a private app from my store and try to hit https://API_KEY:PASS#STORE_NAME/admin/orders.json URL using ajax and curl. It is working if I use curl but not with ajax. Can anyone explain here what is the issue?
This might be a Cross origin problem. If you are using jQuery try to make an ajax call with dataType set to jsonp as shown here:
$.ajax("url", {
dataType: "jsonp",
success: function(data) {
console.log(data);
}
})
Like the other answer said, it's a cross origin problem (See CORS)
Best way to deal with it normally is Shopify App Proxy, but this isn't available to private apps, only custom apps. Best bet is to build a custom app and authenticate with OAuth2, assuming there's no other reason you've chosen to build a private app instead.
If the nature of your app permits the change to a custom app, the App Proxy will give you a {store-name}.myshopify.com/{resource} end point that will bypass the cross-origin issue, but forward the request to your remote server.
Also, when you're working with JS and something is not working, check the console, and share any errors. No one can really tell you why it's not working without seeing either the code, the error, or both, but this is a common enough stumbling block with AJAX since all this cross-origin security stuff got put into place that I'm 90% sure it's the answer.

cross domain request with dojo

I am attempting a cross domain request with dojo. External url is of MIME type text/html the only content on the page is something like 1236. I tried
dojo.require("dojo.io.script");
dojo.ready(function() {
dojo.io.script.get({
url: "theexternalurl",
callbackParamName: "jsoncallback",
load: function(data) {
console.log(data);
}
});
});,
But that was no good. Any ideas on how this can be done with dojo?
I suspect you are bumping into the browser security here. Cross-domain requests will only work when using iframes or injecting scripts (as you have done) and when the content of that script is valid "text/javascript".
If you are trying to load "text/html" into the script, it won't work as it isn't a valid script. It is something most of us have tried to do at some point. I have spent hours trying to get around cross-domain restrictions and found the security blocking it to be solid.
See my answer here for more details.
If all you are trying to do is load the content onto the page then you could use an <iframe>. However, if you are trying to parse the loaded content in some way than I'm afraid it is a dead-end. Probably not the answer you were hoping for but it'll save you hours of frustration.

Cross domain javascript ajax request - status 200 OK but no response

Here is my situation:
Im creating a widget that site admins can embed in their site and the data are stored in my server. So the script basically has to make an ajax request to a php file in my server to update the database. Right? Right :)
The ajax request works excellent when i run it in my local server but it does not work when the php file is on my ONLINE server.
This is the code im using:
var url = "http://www.mydomain.net/ajax_php.php";
var params = "com=ins&id=1&mail=mymail#site.net";
http.async = true;
http.open("POST", url, true);
http.onreadystatechange = function() {
if(http.readyState == 4 && http.status == 200) {
//do my things here
alert( http.responseText );
}
}
http.send(params);
In firebug it shows: http://www.mydomain.net/ajax_php.php 200 OK X 600ms.
When i check the ajax responnseText I always get a Status:0
Now my question is: "Can i do cross-domain ajax requests by default? Might this be a cross-domain ajax problem? Since it works when the requested file resides in my local server but DOESN'T work when the requested file is in another server, im thinking ajax requests to another remote server might be denied? Can you help me clear on this?
Thanks..
Cross-domain requests are not directly allowed. However, there is a commonly-used technique called JSONP that will allow you to avoid this restriction through the use of script tags. Basically, you create a callback function with a known name:
function receiveData(data) {
// ...
}
And then your server wraps JSON data in a function call, like this:
receiveData({"the": "data"});
And you "call" the cross-domain server by adding a script tag to your page. jQuery elegantly wraps all of this up in its ajax function.
Another technique that I've had to use at times is cross-document communication through iframes. You can have one window talk to another, even cross-domain, in a restricted manner through postMessage. Note that only recent browsers have this functionality, so that option is not viable in all cases without resorting to hackery.
You're going to need to have your response sent back to your client via a JSONP call.
What you'll need to do is to have your request for data wrapped in a script tag. Your server will respond with your data wrapped in a function call. By downloading the script as an external resource, your browser will execute the script (just like adding a reference to an external JS file like jQuery) and pass the data to a known JS method. Your JS method will then take the data and do whatever you need to do with it.
Lots of steps involved. Using a library like jQuery provides a lot of support for this.
Hope this helps.

Issue with METHOD in prototype / Ajax.Request

I am trying to call yahoo api via Ajax to find current weather:
var query = "select * from weather.forecast where location in ('UKXX0085','UKXX0061','CAXX0518','CHXX0049') and u='c'";
var url = 'http://query.yahooapis.com/v1/public/yql?q=' + encodeURIComponent(query) +'&rnd=1344223&format=json&callback=jsonp1285353223470';
new Ajax.Request(url, {
method: 'get',
onComplete: function(transport) {
alert(transport.Status); // say 'null'
alert(transport.responseText); // say ''
}
});
I noticed, that instead of GET firebug says OPTIONS. What is it and how I can use force prototype to use GET?
Here is functionality which i am trying to recreate.
And here is full URL which I am trying to access:
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20weather.forecast%20where%20location%20in%20(%27UKXX0085%27%2C%27UKXX0061%27%2C%27CAXX0518%27%2C%27CHXX0049%27)%20and%20u%3D%27c%27&rnd=1344223&format=json&callback=jsonp1285353223470
After hours of trying to debug the same issue myself, I came to the following conclusion.
I believe this happens because of XSS counter-measures in newer browsers.
You can find very detailed information about these new counter-measures here:
https://developer.mozilla.org/en/http_access_control
Basically, a site can specify how "careful" the browser should be about allowing scripts from other domains. If your site, or a site from which you're loading external JavaScript code, includes one of these pieces of "browser advice", newer browsers will react by enforcing a stronger XSS policy.
For some reason, Prototype's Ajax.Request, under Firefox, seems to react by attempting to do an OPTIONS request, rather than a GET or POST, so perhaps Prototype has not been updated to correctly handle these new security conditions.
At least that was the conclusion in my case. Maybe this clue can help with your case...

Crawling Ajax.request url directly ... permission error

I need to crawl a web board, which uses ajax for dynamic update/hide/show of comments without reloading the corresponding post.
I am blocked by this comment area.
In Ajax.request, url is specified with a path without host name like this :
new Ajax(**'/bbs/comment_db/load.php'**, {
update : $('comment_result'),
evalScripts : true,
method : 'post',
data : 'id=work_gallery&no=i7dg&sno='+npage+'&spl='+splno+'&mno='+cmx+'&ksearch='+$('ksearch').value,
onComplete : function() {
$('cmt_spinner').setStyle('display','none');
try {
$('cpn'+npage).setStyle('fontWeight','bold');
$('cpf'+npage).setStyle('fontWeight','bold');
} catch(err) {}
}
}).request();
If I try to access the url with the full host name then
I just got the message: "Permission Error" :
new Ajax(**'http://host.name.com/bbs/comment_db/load.php'**, {
update : $('comment_result'),
evalScripts : true,
method : 'post',
data : 'id=work_gallery&no=i7dg&sno='+npage+'&spl='+splno+'&mno='+cmx+'&ksearch='+$('ksearch').value,
onComplete : function() {
$('cmt_spinner').setStyle('display','none');
try {
$('cpn'+npage).setStyle('fontWeight','bold');
$('cpf'+npage).setStyle('fontWeight','bold');
} catch(err) {}
}
}).request();
will result in the same error.
This is the same even when I call the actual php url in the web browser like this:
http://host.name.com/bbs/comment_db/load.php?'id=work_gallery&..'
I guess that the php module is restricted to be called by an url in the same host.
Any idea for crawling this data ?
Thanks in advance.
-- Shin
Cross site XMLHttpRequest are forbidden by most browsers. If you want to crawl different sites, you will need to do it in a server side script.
As mentioned by darin, the XMLHttpRequest Object (which is the essence of Ajax requests) has security restrictions on calling cross-site HTTP requests, I believe its called the "Same Origin Policy for JavaScript".
While there is a working group within the W3C who have proposed new Access Control for Cross-Site Requests recommendation the restriction still remains in effect for most mainstream browsers.
I found some information on the Mozilla Developer Network that may provide a better explanation.
In your case, it appears that you are using the Prototype JavaScript framework, where Ajax.Request still uses the XMLHttpRequest object for its Ajax requests.
method:'post'
might well be your problem: the host serving the request likely rejects get requests, which is all you can throw at it from a browser address bar. if this is what's happening, you'll need to find or install some sort of scripting tool capable of doing the job (perl would be my choice, and unless you're running Windows, you'll already have that).
I do have to wonder whether what you're trying to do is legit, though: trawling other sites' comment databases isn't usually encouraged.
I would solve this by running a PHP script locally that will do the crawling from outside pages. That way jQuery doesn't have to go to an outside domain.

Resources