how to crowd source my web crawling

how to crowd source my web crawling - ajax

My web application requires downloading content from the user URL specified.
Currently this request go through my server, which is inefficient and could get my server IP blocked.
Is there a way to let the user download the URL content directly?
The same-origin policy seems to prevent using AJAX or an iframe to download and reuse this content.
Any ideas? For example is there a way via flash to download and reuse URL content?

You could use Tor to mask your requests, but if you're having to go such lengths to crawl a website perhaps you shouldn't be doing it?
Also, with your approach the iframe request will include your page URL as the referrer, which makes identifying these requests at the server end pretty straightforward...

If it's a specific web side, I recommend to talk to the website operators rather than trying to crawl anonymously.

Related

How do I proxy API requests in a JAMstack solution?

I'm developing a site that's virtually entirely static. I use a generator to create all the HTML.
However, my site is a front-end to a store embedded in its pages. I have a little node.js server proxying requests on behalf of the browser to the back-end store. All it does is provide the number of items in the shopping cart so I can keep the number updated on all pages of my site. That's because the browser doesn't allow cross-domain scripting. My server has to act as a proxy between the client and the store.
(The embedded store is loaded from the store's web site and so itself does not require proxying.)
I was hoping to eventually deploy to Netlify or some similar JAMstack provider. But I don't see how I'd proxy on Netlify.
What is the standard solution to this problem? Or is proxying unavailable to JAMstack solutions? Are there JAMstack providers that solve this problem?

Netlify does allow for proxy rewrites using redirect paths with status code 200.
You can store your proxy redirects in _redirects at the root of your deployed site. In other words the file needs to exist at the root of the site directory to be deployed after a build.
_redirects
/api/* https://api.example.com/:splat 200
So a call to:
/api/v1/gifs/random?tag=cat&api_key=your_api_key
will be proxied to:
https://api.example.com/v1/gifs/random?tag=cat&api_key=your_api_key
If the API supports standard HTTP caching mechanisms like Etags or Last-Modified headers, the responses will even get cached by CDN nodes.
NOTE: you can also setup your redirects in your netlify.toml

How to allow API call only from the same origin/ domain?

So I have run into a problem. I am building a javacript file, which appends an iframe into clients page (within a div). Lets say, this iframe is loaded from http://example.com/iframe. iframe's backend module (to handle form submission; written in spring) has some endpoints like http://example.com/url1, http://example.com/url2
Now, I want only the iframe to be able to communicate with backend APIs. Currently, I can hit the iframe backend APIs from localmachine too.
I have come across the referer HTTP field, and initially was planning to set a referer filter on the APIs, but later found out that this can easily be spoofed. Will I get any benifit sertting CORS header on the APIs and setting the origin as http://example.com? Will it work, and if yes, is this a safe and dependable solution? Are there any better alternatives?

Google shows some of my pages in https

I had the SSL installed on my site for a day and uninstalled it. Now in serps, google shows some of the pages in https version. I don't think this is a duplicate content issue because only one version of the web page shows up in serps. I'm not sure how to fix this because the mix of http hand https show up.
Why does google randomly show https version of my pages?
How can i tell google to use the http version of those random pages?
Thanks in advance!

I think more information is needed to fully answer this question, however, most likely Google is showing the HTTPS version of your page because your site was using HTTPs at the time when Google visited and indexed your page, thus the HTTPS version was cached.
You can partially manage your search page and request that Google re-index your site from Google Web Developer tools. https://www.google.com/webmasters/tools/. However, the change isn't guaranteed and won't take place immediately.
Furthermore, you can control whether a visitor uses HTTPS or HTTP when visiting your site via standard redirects. In other words, update your webserver or web application so if someone visits http://mysite they will be redirected (preferably with a 301 status code for SEO reasons) to https://mysite.
Finally, as a general practice, you should really use HTTPS as browsers may highlight warnings for non-secured traffic in the future. For sites I manage, "http" requests are automatically redirected to the "https" of the same URL to ensure all traffic is always over a secure connection.

Can we write/read data from/in to localsorage by using javascript from HTTPS web page?

I need to write and read data form localstorage when we open HTTPS web pahe. It is possible or Safari afari sequrity dosent allow it? allow it?

Yes.
Local Storage works regardless of the protocol used to load the page.
Note that HTTPS and non-HTTPS pages cannot interact with eachother due to the same-origin policy.

HTTP site with JSONP API over HTTPS?

Given all the coverage FireSheep has been getting, I have been trying to work out the best practices for balancing HTTP / HTTPS usage for some sites I manage (e.g. blogging sites, magazine sites with user contributed comments).
To me, its over kill to deliver all pages over HTTPS if the user is logged in. If a page is public (e.g. a blog) there is little point encrypting the public page. All I want to do is prevent session hijacking by sniffing cookies over HTTP channels.
So, one plan is:
Login form is over HTTPS
Issue two cookies: One cookie is 'public' and identifies there user for read only aspects (e.g. 'welcome bob!'). The second cookie is private and 'HTTPS only'. This is the cookie that is verified whenever the user makes a change (e.g. adds a comment, deletes a post).
This means that all 'changing' requests must be issued over HTTPS.
We use a lot of AJAX. Indeed, many comment forms use AJAX to post the content.
Obviously, I cant use AJAX directly to post content to a HTTPS backend from a HTTP frontend.
My question is: Can I use script injection (I think this is commonly called 'JSONP'?) to access the API? So in this case there would be a HTTP public page that sends data to the private backend by injecting a script accessed via HTTPS (so that the private cookie is visible in the request).
Can you have HTTPS content inside a HTTP page? I know you get warnings the other way around, but I figure that HTTPS inside HTTP is not a security breach.
Would that work? It seems to work in chrome and FF, but its IE that would be the party pooper!

Another way is to have an iframe which points to a https page that can make all kinds (GET, POST, PUT etc) of Ajax calls to the server over https (same domain as iframe is on https too). Once the response is back inside the iframe, you can post a message back to the main window using HTML5 postMessage API.
Pseudo code:
<iframe src="https://<hostname>/sslProxy">
sslProxy:
MakeAjaxyCall('GET', 'https://<hostname>/endpoint', function (response) {
top.postMessage(response, domain);
});
This works in all modern browsers except IE <= 7 for which you'll have to either resort to JSONP or cross domain communication using Flash.

The problem with JSONP is that you can only use it for GETs.
Can you have HTTPS content inside a
HTTP page? I know you get warnings the
other way around, but I figure that
HTTPS inside HTTP is not a security
breach.breach.
Including HTTPS content inside a regular HTTP page won't raise any alerts in any browser.
However, I don't think JSONP will help you out of this one. Using GETs to post content and modify data is a very bad idea, and prone to other attacks like CSFR

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

how to crowd source my web crawling - ajax

If it's a specific web side, I recommend to talk to the website operators rather than trying to crawl anonymously.

Related

How do I proxy API requests in a JAMstack solution?

How to allow API call only from the same origin/ domain?

Google shows some of my pages in https

Can we write/read data from/in to localsorage by using javascript from HTTPS web page?

HTTP site with JSONP API over HTTPS?

Categories

Resources