How can I use Golang's net/http's http.Get(url string) but block certain url's and resources from request?
E.g.
http.Get("https://google.com") // But somehow block the main CSS file.
You do not need to block URLs and resources because net/http Get() will not automatically perform fetching of included links or resources.
You probably confuse it with how a browser fetches a URL. A browser will issue a request and then follow up with fetching all the resources (Javascript/CSS/images/videos etc.) But Go's net/http request is much lower level - it is more like curl fetch - it will follow redirect by default, but otherwise it will just fetch a single response to the GET request. You can think of the result of issuing a call `http.Get("https://google.com") as similar to what you see as in the browser as the page source (plus HTTP headers and response code). This response will likely to include a number of other URLs for links and resources - if you like, you can parse them out and request some or all of them (leaving out what you would want to "block"), like low-level web crawlers do.
Related
I am building an application which will allow users to upload images. Mostly, it will work with mobile browsers with slow internet connections. I was wondering if there are best practices for this. Does doing some encryption and than doing the transfer and decoding on server is a trick to try ? OR something else?
You would want something preferably with resumable uploads. Since your connections is slow you'd need something that can be resumed where you left off. A library i've come across over the many years is Nginx upload module:
http://www.grid.net.ru/nginx/upload.en.html
According to the site:
The module parses request body storing all files being uploaded to a directory specified by upload_store directive. The files are then being stripped from body and altered request is then passed to a location specified by upload_pass directive, thus allowing arbitrary handling of uploaded files. Each of file fields are being replaced by a set of fields specified by upload_set_form_field directive. The content of each uploaded file then could be read from a file specified by $upload_tmp_path variable or the file could be simply moved to ultimate destination. Removal of output files is controlled by directive upload_cleanup. If a request has a method other than POST, the module returns error 405 (Method not allowed). Requests with such methods could be processed in alternative location via error_page directive.
I created an ASP.NET MVC4 Web API service (REST) with a single GET action. The action currently needs 11 input values, so rather than passing all of those values in the URL, I opted to encapsulate those values into a single class type and have it passed as Content-Body. When I test in Fiddler, I specify the verb as GET, and enter the JSON text in the "Request Body" input box. This works great!
The problem is when I attempt to perform Load Testing in Visual Studio 2010 Ultimate. I am able to specify the GET action and the JSON Content-Body just fine. But when I run the Load test, VS reports exceptions of type ProtocolViolationException (Cannot send a content-body with this verb-type) in the test results. The test executes in 1ms so I suspect the exceptions are causing the test to immediately abort. What can I do to avoid those exceptions? I'd prefer to not change my API to use URL arguments just to work-around the test tooling. If I should change the API for other reasons, let me know. Thanks!
I found it easier to put this answer rather than carry on the discussions.
Sending content with GET is not defined in RFC 2616 yet it has not been prohibited. So as far as the spec is concerned we are in a territory that we have to make our judgement.
GET is canonically used to get a resource. So you are retrieving this resource using this verb with the parameters you are sending. Since GET is both safe and idempotent, it is ideal for caching. Caching usually takes place based on the resource URI - and sometimes based on various headers. The point is cache implementations - AFAIK - would not use the GET content (and to be honest I have not seen any GET with content in real world). And it would not make sense to include the content in the key generation since it reduces the scalability of the caches.
If you have parameters to send, they must be in the URI since this is part of what defines that URI. As such, I strongly believe sending content with GET is wrong.
Even when you look at implementations such as OData, they put the criteria in the URI. I cannot imagine your (or any) applications requirements is beyond OData query requirements.
The following question is about a caching framework to be implemented or already existing for the REST-inspired behaviour described in the following.
The goal is that GET and HEAD requests should be handled as efficiently as requests to static pages.
In terms of technology, I think of Java Servlets and MySQL to implement the site. (But emergence of good reasons may still impact my choice of technology.)
The web pages should support GET, HEAD and POST; GET and HEAD being much more frequent than POST. The page content will not change with GET/HEAD, only with POST. Therefore, I want to serve GET and HEAD requests directly from the file system and only POST requests from the servlet.
A first (slightly incomplete) idea is that the POST request would pre-calculate the HTML for successive GET/HEAD requests and store it into the file system. GET/HEAD then would always obtain the file from there. I believe that this could easily be implemented in Apache with conditional URL rewriting.
The more refined approach is that GET would serve the HTML from the file system (and HEAD use it, too), if there is a pre-computed file, and otherwise would invoke the servlet machinery to generate it on the fly. POST in this case would not generate any HTML, but only update the database appropriately and delete the HTML file from the file system as a flag to have it generated anew with the next GET/HEAD. The advantage of this second approach is that it handles more gracefully the “initial phase” of the web pages, where no POST has been called yet. I believe that this lazy-generate-and-store approach could be implemented in Apache by providing an error-handler, which would invoke the servlet in case of “file-not-found-but-should-be-there”.
In a later round of refinement, to save bandwidth, the cached HTML files should also be available in a gzip-ed version which is served when the client understands that. I believe that the basic mechanisms should be the same as for the uncompressed HTML files.
Since there will be many such REST-like pages, both approaches might occasionally need some mechanism to garbage-collect rarely used HTML files in order to save file space.
To summarise, I am confident that my GET/HEAD-optimised architecture can be cleanly implemented. I would like to have opinions on the idea as such in the first place (I believe it is good, but I may be wrong) and whether somebody has already experience with such an architecture, perhaps even knows a free framework implementing it.
Finally, I'd like to note that client caching is not the solution I am after, because multiple different clients will GET or HEAD the same page. Moreover, I want to absolutely avoid the servlet machinery during GET/HEAD requests in case the pre-computed file exists. It should not even be invoked to provide cache-related HTTP headers in GET/HEAD requests nor dump a file to output.
The questions are:
Are there better (standard) mechanisms available to reach the goal stated at the beginning?
If not, does anybody know about an existing framework like the one I consider?
I think that a HTTP cache does not reach my goal. As far as I understand, the HTTP cache would still need to invoke the servlet with a HEAD request in order to learn whether a POST has meanwhile changed the page. Since page changes will come at unpredictable points in time, an HTTP header stating an expiration time is not good enough.
Use Expires HTTP Header and/or HTTP conditional requests.
Expires
The Expires entity-header field gives the date/time after which the response is considered stale. A stale cache entry may not normally be returned by a cache (either a proxy cache or a user agent cache) unless it is first validated with the origin server (or with an intermediate cache that has a fresh copy of the entity). See section 13.2 for further discussion of the expiration model.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
Conditional Requests
Decorate cache-able response with Expires,Last-Modified and/or ETag header. Make requests conditional with If-Modified-Since, If-None-Match header, If-*, etc. (see RFC).
e.g.
Last response headers:
...
Expires: Wed, 15 Nov 1995 04:58:08 GMT
...
don't perform new request on the resource before expiration date (the Expires header) and then perform conditional request:
...
If-Modified-Since: Wed, 15 Nov 1995 04:58:08 GMT
...
If the resource wasn't modified then 304 Not Modified response code is returned and the response doesn't have a body. 200 OK and response with body is returned otherwise.
Note: HTTP RFC also defines Cache-Control header
See Caching in HTTP
http://www.w3.org/Protocols/rfc2616/rfc2616-sec13.html
As some of you may already know, there are some caching issues in Firefox/Chrome for requests that are initiated by XmlHttpRequest object. These issues mean that browser does not strictly follow the rules and does not go to server for the new XSLT file (for example). Response does not have Expires header (for performance reasons we can't use it).
Firefox has additional parameter in the XHR object "channel" to which you put value Components.interfaces.nsIRequest.LOAD_BYPASS_CACHE to go to server explicitly.
Does something like that exist for Chrome?
Let me immediatelly stop everyone who would recommend adding timestamp as a value of GET parameter or random integer - I don't want server to get different URL requests. I want it to get the original URL. Reason is that I want to protect server from getting too many different requests for simple static files and sending too much data to clients when it is not needed.
If you hit static file with generated GET parameter (like '?forcenew=12314') would render 200 response each first time and 304 for every following request for that value of random integer. I want to make requests that will always return 304 if the target static file is identical to client version. This is BTW how web browsers should work out-of-the-box but XHR objects tend to not go to server at all to ask is file changed or not.
In my main project at work I had the same exact problem. My solution was not to append random strings or timestamps to GET requests, but to append a specific string to GET requests.
If you have a revision number e.g. subversion revision or likewise from git/mer or whatever you are using, append that. Static files will get 304 responses until the moment a new revision is released. When the new release happens a single 200 response is granted and it is back to happily generating 304 responses. :-)
This has the added bonus of being browser independent.
Should you be unlucky and not have a revision number, then make one up and increment it each time you make a release.
You should look into Etags, etags are keys that can be generated from the contents of the file therefore once the file on the server changes the system will be a new etag. Obviously this will be a service-side change which is something that you will need to do given that you want a 200 and then subsequent 304's. Chrome and FF should respect these etags so you shouldn't need to do any crazy client-side hacks.
Chrome now supports Cache-Control: max-age=0 request HTTP header. You can set it after you open an XMLHttpRequest instance:
xhr.setRequestHeader( "Cache-Control", "max-age=0" );
This will instruct Chrome to not use cached response without revalidation.
For more information check The State of Browser Caching, Revisited by Mark Nottingham and RFC 7234 Hypertext Transfer Protocol (HTTP/1.1): Caching.
I'm doing an AJAX download that is being redirected. I'd like to know the final target URL the request was redirected to. I'm using jQuery, but also have access to the underlying XMLHttpRequest. Does anyone know a way to get the final URL?
It seems like I'll need to have the final target insert its URL into a known location in the headers or response body, then have the script look for it there. I was hoping to have something that would work regardless of the target though.
Additional note: I'm asking how my code can get the full url from production code, which will run from the user's system. I'm not asking how I can get the full url when I'm debugging.
The easiest way to do this is to use Fiddler or Wireshark to examine the HTTP traffic. Use Fiddler at the client if your interface uses a browser, otherwise use Wireshark to capture the traffic on the wire.
One word - Firebug, it is a Firefox plugin. Never do any kind of AJAX development without it.
Activate Firebug and select Net, then perform your AJAX request. This will show the URL that is called, the entire request (header and body) and the entire response (once again, header and body). It also allows you to step through your JavaScript and debug it - breakpoints, watches, etc.
I'll second the Firebug suggestion. You'll see the url as the "Location" header in the http response.
It sounds like you also want to get this url in js? If so, you can get it off the xhr response object in the callback (which you can also inspect using FB!). :)