Remove all external resources from HTML with Nokogiri - ruby

I want to remove all external resources from an html file.
I am using wget to make some local copies of a page. Wget has options to convert links to local file system and it is quite ok but still some links (at the end of the download depth I believe) keep their external src, so they contain http.
The closest I could get to find everything that contains http is using this:
doc.search("//*[starts-with(#href, 'http')]")
But that just finds the href elements and http can also in images, videos and anything.
Any ideas what would be the right instructions for Nokogiri to tell me everything that contains http?
Thanks.

If you simply want to expand your search to elements with any attribute starting with 'http' you can do this:
doc.search("//*[#*[starts-with(.,'http')]]")

Related

404 Page Not Loading Images (File Source / Layers Problem)

I'm facing what could be a simple problem, but haven't been able to find any solutions yet.
I just recently built a simple 404 page on my website (nofound.html) which through a .htaccess redirection (ErrorDocument 404 /nofound.html) allows me to catch all URL errors and such. Basic.
The problem is that, since the 404 response can be called within different directories, say for example index, or /dir1, or /dir1/dir2, etc, the page is not loading its styles and images right, since the source paths do not link correctly (image.png links correctly from index/nofound.html, but should become ../image.png and so on when going into different directory layers).
I've managed to make the styles load correctly by loading the same stylesheet twice (styles.css AND ../styles.css), but for the images, I have not found any workaround yet (other than duplicating the image within layered directories, which is cumbersome and redundant).
Any thoughts? Thanks in advance!
Although it does not follow best practices, my final solution was to simply use absolute hyperlinks (such as https://website/img.png or something like that) rather than relative links within the website's anchors (../../img.png)
As such, every element on the page is loaded from its absolute -actual- directory, regardless of the relative relationship between the nofound.html result and the rest of the site architecture.

Recursive wget: alter links

I am trying to optimize my AJAX fragment links for Google crawler (which substitudes "#!..." links with "?_escaped_fragment_=..." as described here). I want to check if the entire site is accessible via _escaped_fragment_ links I have implemented.
I am curious if I can use wget recursive site download to this end and make it substitude "#!" links with "_escaped_fragment_", so that wget sees
abc.com?_escaped_fragment=arg=value
instead of
abc.com#!arg=value
No you can't strings after # is not sending to server... they are for JavaScript routing.

How do I set caching headers for my CSS/JS but ensure visitors always have the latest versions?

I'd like to speed up my site's loading time in part by ensuring all CSS/JS is being cached by the browser, as recommend by Google's PageSpeed tool. But I'd like to ensure that visitors have the latest CSS/JS files, if they are updated and the cache now contains old code.
From my research so far, appending something like "?459454" to the end of the CSS/JS url is popular. But wouldn't that force the visitor's browser to re-download the CSS/JS file every time?
Is there a way to set the files to be cached by the browser, but ensure the browser knows about updated versions of the cached files?
If you're using Apache, you can use mod_pagespeed (mentioned earlier by symcbean) to do this automatically.
It would work best if you also use the ModPagespeedLoadFromFile directive since that will create a new URL as soon as it detects that the resource has changed on disk, however it will work fine without that (it will use the cache expiry time returned when it fetches the resource to rewrite it).
If you're using nginx, you could use ngx_pagespeed.
If you're using IIS, you could use IISpeed, which is not a Google product and I don't know it's full feature set.
Version numbers will work, but you can also append a hash of the file to the filename with your web framework or asset build script:
<script src="script-5054a101c8b164cbfa570d97fe23cc0d.js"></script>
That way, once your HTML changes to reflect this new version, browsers will just download and cache the updated version of your script.
As you say, append a query string to the URL of the asset, but only change it if the content is different, or change it when you deploy a new version.
appending something like "?459454" to the end of the CSS/JS url is popular. But wouldn't that force the visitor's browser to re-download the CSS/JS file every time?
No it won't force them to download each time, however there are a lot of intermediate proxies out there which ignore query strings on cacheable content - hence many tools (including mod_pagespeed which does automatic url rewriting based on file conents, and content merging on the fly along with lots of other cool tricks) move the version information into the path / filename.
If you've only got .htaccess type access then you can strip the version information out to map direct to a file, or use a scripted 404 redirector (but this is probably only a good idea if you're behind a caching reverse proxy).

extract URL from .swf file

I am trying to extract images from flash on the following web-site: http://meijer.shoplocal.com/meijer/default.aspx?action=entryflash&storeref=120
I noticed that every time I click on "Next image", an images is requested from sever. Sample URL is http://akimages.shoplocal.com/dyn_rppi/740.0.75.0/meijer/large/110206os_o_003_T1C1_2pw26.jpg
So, this URL is exactly what I need, but I don't know how to extract all these URLs from the .swf file I have. I don't have any experience with flash, but I think that URLs should be in the .swf file. I tried "grep '110206os_o_003_T1C1_2pw26' adspage_slider-2.swf", but didn't get any result :(((
Ivan,
Did you try a Flash decoder? It should allow you to access the code and respective resources. Another possible and easier way would be to use Fiddler2 to extract the URLs that you have clicked from the swf file. Still, before you move further, make sure that you're not breaking any of the site's Terms and Conditions.

Is it safe to serve an image on the web without an extension?

I'm treating all *.jpg files as static, but I need to serve a few dynamically. Can I simply omit the extension so I don't have to get fancy with my url rules? Is it enough to just set the file type in the header?
I've never had a problem serving dynamic images with a strange extension or no extension at all. Querystrings are also fine.
It will be enough for the headers to be correct and the binary file correctly formed. When you do this make sure you also set the Content-Disposition to a reasonable file name so people don't try to download your files with crazy querystring names. (Which windows users will be unable to save since they will most likely have a "?" in them.)
Instead of omitting the extension (on your server), activate content negotiation (i.e. +MultiViews if you're using Apache) and omit the extensions in your URIs. That way, Apache will decide what file to serve; you could have an image in both png and svg format, and serve the one accepted by the browser.
Generally, a correct Content-type header is enough.

Resources