Are wildcards allowed in sitemap.xml file? - sitemap

I have a website that has a directory that contains 100+ html files.
I want crawlers to crawl all the html files that directory.
I have already added following sentence to my robots.txt:
Allow /DirName/*.html$
Is there any way to include the files in the directory in sitemap.xml file so that all html files in the directory will get crawled?
Something like this:
<url>
<loc>MyWebsiteName/DirName/*.html</loc>
</url>

The sitemap protocol neither restricts or allows the use of wildcards; to be honest this is the first time i hear this. Also, I'm pretty much sure that search engines can't make use of the wildcards in sitemaps.
Please take a look at Google's recommendation of sitemap generators. There are tons of tools you can create a sitemap with in a blink of an eye.

It is not allows the use of wildcards. if you run php in your server then you could list all files in the directory and generate sitemap.xml automatically using the DirectoryIterator .
// this is assume you have already a sitemap class.
$sitemap = new Sitemap;
// iterate the directory
foreach(new DirectoryIterator('/MyWebsiteName/DirName') as $directoryItem)
{
// Filter the item
if(!$directoryItem->isFile()) continue;
// New basic sitemap.
$url = new Sitemap_URL;
// Set arguments.
$url->set_loc(sprintf('/DirName/%1$s', $directoryItem->getBasename()))
->set_last_mod(1276800492)
->set_change_frequency('daily')
->set_priority(1);
// Add it to sitemap.
$sitemap->add($url);
}
// Render the output.
$response = $sitemap->render();
// Cache the output for 24 hours.
$cache->set('sitemap', $response, 86400);
// Output the sitemap.
echo $response;

Related

When Migrated Catalyst Website Leaves Module Code

A client wants to move a website they made using Adobe Catalyst to a different hosting provider. I was able to copy the entire website via FTP and move it to the new host. Everything looks fine except for many of the links leaving code that looks like this:
{module_contentholder, name="_U309"} {module_contentholder, name="_U299"}
Does anyone know what this is or how to fix it?
Those are references to Content Holders. They work similarly to PHP's include statement, but the file they reference is fixed to a single path: /_System/ContentHolders/.
You will likely come across more tags like that, such as {module_menu} and {tag_pagecontent}. You'll need to manually adapt them to the whatever the new host uses. The documentation will help: http://docs.businesscatalyst.com/reference/
The obtuse names of the content holders shown in your example indicates the site was likely to have been generated by Adobe Muse, a WYSIWYG editor. I strongly recommend that you find the original .muse project files, and use those to update the site. Muse can compile the site for platforms other than Business Catalyst.
Through research I found out there is no way to get those codes to display properly without editing each page individually. Fortunately, I wrote a PHP script that goes through the code of each page and replaces it automatically.
Step 1: Make a file in the index directory called replacement.php
Step 2: Put this code in
$file = $_GET['file'];
$path = '/path/to/public_html/' . $file;
$file_contents = file_get_contents($path);
preg_match_all("/{module(.*?)}/", $file_contents, $matches);
foreach($matches[0] as $match) {
if(preg_match('/\"([^\"]*?)\"/', $match, $query)) {
$queryNew = str_replace("\"", "", $query[0]);
$queryPath = '/path/to/public_html/_System/ContentHolders/' . strtolower($queryNew) . '.html';
$queryContents = file_get_contents($queryPath);
$file_contents = str_replace($match, $queryContents, $file_contents);
}
}
file_put_contents($path, $file_contents);
Step 3: Replace where it says /path/to/public_html/ to your domain files location.
Step 4: go to http://www.yourdomain.com/replacement.php?file=index.html to change over the index file. You can change "index.html" in the url to any other page you want converted.
Hopefully this helps someone else in the future.

Magento, Split sitemap.xml and cron job

I am trying to split my sitemap.xml because Google webmaster tools only allows sitemap.xml to be less than 50k urls. I have the following code placed in:
app\code\local\Mage\Sitemap\Model\Sitemap.php to split sitemap.xml if the file contains more than 50k urls.
public function check_counter(&$io) {
static $counter;
$counter++;
$tRec = 50000; // total record per file
if ( ($counter % $tRec) == 0 ){
$io->streamWrite('</urlset>');
$io->streamClose();
$filename = preg_replace('/\.xml/', '-'.
round($counter/$tRec).
'.xml', $this->getSitemapFilename());
$io->streamOpen($filename);
$io->streamWrite('<?xml version="1.0" encoding="UTF-8"?>'."\n");
$io->streamWrite('<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">');
}
Everything works fine if I manually generate sitemap from Magento admin panel. It will create 2 files: Sitemap.xml (50,000 urls) and Sitemap-1.xml (12,312 urls)
I have setup the cron job to generate sitemaps every night. The problem is the cron job seems doesn't follow the code. It generates 2 files: sitemap-1.xml (I don't know how many, but definitely more than 50,000 urls, because google gives me an error says i have too many urls in this file) and sitemap.xml (couple hundred urls.)
What's wrong with the code? Or what's wrong with my cron job?
EDIT:
I put
$this->check_counter($io);
after each
$io->streamWrite($xml);
in
public function generateXml()
A little late to the party, but here is our solution for this issue.
We created a separate module to house the changes (not just moving it to local) ex: Namespace_Modulename
In the app/etc/modules/Namespace_Modulename.xml file we added a depends statement as follows:
<depends>
<Mage_Sitemap/>
</depends>
Once we did this, the cron job worked properly.

Magento: Which files are used to generate the Package Layout XML object?

There are several layout files in the layout directory of which page.xml gets processed first and local.xml gets processed last. However, it is unclear to me how to determine which of the other xml files in that directory are used in generating the Package Layout XML object. Presumably, for different frontName + controller name + action name a different subcollection of these files is bunched together. Furthermore the order in which these are bunched together may affect blocks like Mage_Core_Block_Text_List which just displays the blocks as they are added.
So, how can I determine whether a specific page request will bunch up a given foo.xml layout file in generating the Package Layout XML object from which the eventual Page Layout XML is derived?
Thanks.
While there's numerous files involved, (since module developers add a layout file to the configuration, the configuration is loaded and merged, and then the layout update object reads from the merged configuration to determine which files to load), you're probably looking for this
#File: app/code/core/Mage/Core/Model/Layout/Update.php
public function getFileLayoutUpdatesXml($area, $package, $theme, $storeId = null)
{
//...
$fileStr = file_get_contents($filename);
$fileStr = str_replace($this->_subst['from'], $this->_subst['to'], $fileStr);
$fileXml = simplexml_load_string($fileStr, $elementClass);
//...
}
Log or var_dump the $filename variable and you'll be able to see which files Magento is trying to load.

Why are images uploaded through the admin panel of my custom module, having the file name of Array?

Here is my BrandController.php
https://gist.github.com/a958926883b9e7cc68f7#file-brandcontroller-php-L53
I've gone through all my files of my custom module, and compared them to the one given from the custom module maker, and I couldn't find much differences.
Are you attempting to upload multiple files? If you're using multiple fileupload elements with the same name you'll get an array of items.
So when the following line is called,
//this way the name is saved in DB
$data['filename'] = $_FILES['filename']['name'];
It will have the value
["name"]=>array(2) {
[0]=>string(9)"file0.txt"
[1]=>string(9)"file1.txt"
}
you'll need to update the code to loop through each $_FILES['filename']['name'] and upload and save the files separately.
You may unknowingly uploaded multiple files. If you that is not your intention, you may check your in your HTML and check the name attribute of the tag. It must not be an array (like this).
<input type="file" name="my_files[]" />
If you only see Array() in your database, it means you are indeed uploading a multiple files. You can process them by using loops.
If you are really sure that you are uploading 1 image, you may follow #Palanikumar's suggestion. Use a print_r() and display the $_FILES and paste it here. IF you don't want to use that, You can use
json_encode($the-data-you-are-going-to-insert-to-the-database);
If you don't know where to put the print_r() function, you may put it after line 56 of this file.
https://gist.github.com/desbest/a958926883b9e7cc68f7#file-brandcontroller-php-L53
if(isset($_FILES['filename']['name']) && $_FILES['filename']['name'] != '') {
print_r($_FILES);
die;
If saveAction() is being called inside an ajax function you need to log the ajax response. Assuming you are using jquery..
$ajaxResponse = $.POST({...});
console.log($ajaxResponse.responseText);
Then, you you can view it inside a browser's console. If nothing appears, you may use a non-async request
$ajaxResponse = $.POST({
// your options,
// your another option,
async: FALSE
});
Usually file upload will return in array format. So that each uploaded file will have the information like name, type, size, temporary name, error. You can get the file information using print function (print_r($_FILES)). So if you want to display name of the file you have to use something like this $_FILES['filename']['name']
Use print function and debugging tool then save file information using loops.
For more info please check here.
You aren't setting the enctype of the form so the image will never be sent. updated the code to
$form = new Varien_Data_Form(array( 'enctype' => 'multipart/form-data'));

How to force browsers to reload cached CSS and JS files?

I have noticed that some browsers (in particular, Firefox and Opera) are very zealous in using cached copies of .css and .js files, even between browser sessions. This leads to a problem when you update one of these files, but the user's browser keeps on using the cached copy.
What is the most elegant way of forcing the user's browser to reload the file when it has changed?
Ideally, the solution would not force the browser to reload the file on every visit to the page.
I have found John Millikin's and da5id's suggestion to be useful. It turns out there is a term for this: auto-versioning.
I have posted a new answer below which is a combination of my original solution and John's suggestion.
Another idea that was suggested by SCdF would be to append a bogus query string to the file. (Some Python code, to automatically use the timestamp as a bogus query string, was submitted by pi..)
However, there is some discussion as to whether or not the browser would cache a file with a query string. (Remember, we want the browser to cache the file and use it on future visits. We only want it to fetch the file again when it has changed.)
This solution is written in PHP, but it should be easily adapted to other languages.
The original .htaccess regex can cause problems with files like json-1.3.js. The solution is to only rewrite if there are exactly 10 digits at the end. (Because 10 digits covers all timestamps from 9/9/2001 to 11/20/2286.)
First, we use the following rewrite rule in .htaccess:
RewriteEngine on
RewriteRule ^(.*)\.[\d]{10}\.(css|js)$ $1.$2 [L]
Now, we write the following PHP function:
/**
* Given a file, i.e. /css/base.css, replaces it with a string containing the
* file's mtime, i.e. /css/base.1221534296.css.
*
* #param $file The file to be loaded. Must be an absolute path (i.e.
* starting with slash).
*/
function auto_version($file)
{
if(strpos($file, '/') !== 0 || !file_exists($_SERVER['DOCUMENT_ROOT'] . $file))
return $file;
$mtime = filemtime($_SERVER['DOCUMENT_ROOT'] . $file);
return preg_replace('{\\.([^./]+)$}', ".$mtime.\$1", $file);
}
Now, wherever you include your CSS, change it from this:
<link rel="stylesheet" href="/css/base.css" type="text/css" />
To this:
<link rel="stylesheet" href="<?php echo auto_version('/css/base.css'); ?>" type="text/css" />
This way, you never have to modify the link tag again, and the user will always see the latest CSS. The browser will be able to cache the CSS file, but when you make any changes to your CSS the browser will see this as a new URL, so it won't use the cached copy.
This can also work with images, favicons, and JavaScript. Basically anything that is not dynamically generated.
Simple Client-side Technique
In general, caching is good... So there are a couple of techniques, depending on whether you're fixing the problem for yourself as you develop a website, or whether you're trying to control cache in a production environment.
General visitors to your website won't have the same experience that you're having when you're developing the site. Since the average visitor comes to the site less frequently (maybe only a few times each month, unless you're a Google or hi5 Networks), then they are less likely to have your files in cache, and that may be enough.
If you want to force a new version into the browser, you can always add a query string to the request, and bump up the version number when you make major changes:
<script src="/myJavascript.js?version=4"></script>
This will ensure that everyone gets the new file. It works because the browser looks at the URL of the file to determine whether it has a copy in cache. If your server isn't set up to do anything with the query string, it will be ignored, but the name will look like a new file to the browser.
On the other hand, if you're developing a website, you don't want to change the version number every time you save a change to your development version. That would be tedious.
So while you're developing your site, a good trick would be to automatically generate a query string parameter:
<!-- Development version: -->
<script>document.write('<script src="/myJavascript.js?dev=' + Math.floor(Math.random() * 100) + '"\><\/script>');</script>
Adding a query string to the request is a good way to version a resource, but for a simple website this may be unnecessary. And remember, caching is a good thing.
It's also worth noting that the browser isn't necessarily stingy about keeping files in cache. Browsers have policies for this sort of thing, and they are usually playing by the rules laid down in the HTTP specification. When a browser makes a request to a server, part of the response is an Expires header... a date which tells the browser how long it should be kept in cache. The next time the browser comes across a request for the same file, it sees that it has a copy in cache and looks to the Expires date to decide whether it should be used.
So believe it or not, it's actually your server that is making that browser cache so persistent. You could adjust your server settings and change the Expires headers, but the little technique I've written above is probably a much simpler way for you to go about it. Since caching is good, you usually want to set that date far into the future (a "Far-future Expires Header"), and use the technique described above to force a change.
If you're interested in more information on HTTP or how these requests are made, a good book is "High Performance Web Sites" by Steve Souders. It's a very good introduction to the subject.
Google's mod_pagespeed plugin for Apache will do auto-versioning for you. It's really slick.
It parses HTML on its way out of the webserver (works with PHP, Ruby on Rails, Python, static HTML -- anything) and rewrites links to CSS, JavaScript, image files so they include an id code. It serves up the files at the modified URLs with a very long cache control on them. When the files change, it automatically changes the URLs so the browser has to re-fetch them. It basically just works, without any changes to your code. It'll even minify your code on the way out too.
Instead of changing the version manually, I would recommend you use an MD5 hash of the actual CSS file.
So your URL would be something like
http://mysite.com/css/[md5_hash_here]/style.css
You could still use the rewrite rule to strip out the hash, but the advantage is that now you can set your cache policy to "cache forever", since if the URL is the same, that means that the file is unchanged.
You can then write a simple shell script that would compute the hash of the file and update your tag (you'd probably want to move it to a separate file for inclusion).
Simply run that script every time CSS changes and you're good. The browser will ONLY reload your files when they are altered. If you make an edit and then undo it, there's no pain in figuring out which version you need to return to in order for your visitors not to re-download.
I am not sure why you guys/gals are taking so much pain to implement this solution.
All you need to do if get the file's modified timestamp and append it as a querystring to the file.
In PHP I would do it as:
<link href="mycss.css?v=<?= filemtime('mycss.css') ?>" rel="stylesheet">
filemtime() is a PHP function that returns the file modified timestamp.
You can just put ?foo=1234 at the end of your CSS / JavaScript import, changing 1234 to be whatever you like. Have a look at the Stack Overflow HTML source for an example.
The idea there being that the ? parameters are discarded / ignored on the request anyway and you can change that number when you roll out a new version.
Note: There is some argument with regard to exactly how this affects caching. I believe the general gist of it is that GET requests, with or without parameters should be cachable, so the above solution should work.
However, it is down to both the web server to decide if it wants to adhere to that part of the spec and the browser the user uses, as it can just go right ahead and ask for a fresh version anyway.
I've heard this called "auto versioning". The most common method is to include the static file's modification time somewhere in the URL, and strip it out using rewrite handlers or URL configurations:
See also:
Automatic asset versioning in Django
Automatically Version Your CSS and JavaScript Files
The 30 or so existing answers are great advice for a circa 2008 website. However, when it comes to a modern, single-page application (SPA), it might be time to rethink some fundamental assumptions… specifically the idea that it is desirable for the web server to serve only the single, most recent version of a file.
Imagine you're a user that has version M of a SPA loaded into your browser:
Your CD pipeline deploys the new version N of the application onto the server
You navigate within the SPA, which sends an XMLHttpRequest (XHR) to the server to get /some.template
(Your browser hasn't refreshed the page, so you're still running version M)
The server responds with the contents of /some.template — do you want it to return version M or N of the template?
If the format of /some.template changed between versions M and N (or the file was renamed or whatever) you probably don't want version N of the template sent to the browser that's running the old version M of the parser.†
Web applications run into this issue when two conditions are met:
Resources are requested asynchronously some time after the initial page load
The application logic assumes things (that may change in future versions) about resource content
Once your application needs to serve up multiple versions in parallel, solving caching and "reloading" becomes trivial:
Install all site files into versioned directories: /v<release_tag_1>/…files…, /v<release_tag_2>/…files…
Set HTTP headers to let browsers cache files forever
(Or better yet, put everything in a CDN)
Update all <script> and <link> tags, etc. to point to that file in one of the versioned directories
That last step sounds tricky, as it could require calling a URL builder for every URL in your server-side or client-side code. Or you could just make clever use of the <base> tag and change the current version in one place.
† One way around this is to be aggressive about forcing the browser to reload everything when a new version is released. But for the sake of letting any in-progress operations to complete, it may still be easiest to support at least two versions in parallel: v-current and v-previous.
In Laravel (PHP) we can do it in the following clear and elegant way (using file modification timestamp):
<script src="{{ asset('/js/your.js?v='.filemtime('js/your.js')) }}"></script>
And similar for CSS
<link rel="stylesheet" href="{{asset('css/your.css?v='.filemtime('css/your.css'))}}">
Example HTML output (filemtime return time as as a Unix timestamp)
<link rel="stylesheet" href="assets/css/your.css?v=1577772366">
Don’t use foo.css?version=1!
Browsers aren't supposed to cache URLs with GET variables. According to http://www.thinkvitamin.com/features/webapps/serving-javascript-fast, though Internet Explorer and Firefox ignore this, Opera and Safari don't! Instead, use foo.v1234.css, and use rewrite rules to strip out the version number.
Here is a pure JavaScript solution
(function(){
// Match this timestamp with the release of your code
var lastVersioning = Date.UTC(2014, 11, 20, 2, 15, 10);
var lastCacheDateTime = localStorage.getItem('lastCacheDatetime');
if(lastCacheDateTime){
if(lastVersioning > lastCacheDateTime){
var reload = true;
}
}
localStorage.setItem('lastCacheDatetime', Date.now());
if(reload){
location.reload(true);
}
})();
The above will look for the last time the user visited your site. If the last visit was before you released new code, it uses location.reload(true) to force page refresh from server.
I usually have this as the very first script within the <head> so it's evaluated before any other content loads. If a reload needs to occurs, it's hardly noticeable to the user.
I am using local storage to store the last visit timestamp on the browser, but you can add cookies to the mix if you're looking to support older versions of IE.
The RewriteRule needs a small update for JavaScript or CSS files that contain a dot notation versioning at the end. E.g., json-1.3.js.
I added a dot negation class [^.] to the regex, so .number. is ignored.
RewriteRule ^(.*)\.[^.][\d]+\.(css|js)$ $1.$2 [L]
Interesting post. Having read all the answers here combined with the fact that I have never had any problems with "bogus" query strings (which I am unsure why everyone is so reluctant to use this) I guess the solution (which removes the need for Apache rewrite rules as in the accepted answer) is to compute a short hash of the CSS file contents (instead of the file datetime) as a bogus querystring.
This would result in the following:
<link rel="stylesheet" href="/css/base.css?[hash-here]" type="text/css" />
Of course, the datetime solutions also get the job done in the case of editing a CSS file, but I think it is about the CSS file content and not about the file datetime, so why get these mixed up?
For ASP.NET 4.5 and greater you can use script bundling.
The request http://localhost/MvcBM_time/bundles/AllMyScripts?v=r0sLDicvP58AIXN_mc3QdyVvVj5euZNzdsa2N1PKvb81 is for the bundle AllMyScripts and contains a query string pair v=r0sLDicvP58AIXN_mc3QdyVvVj5euZNzdsa2N1PKvb81. The query string v has a value token that is a unique identifier used for caching. As long as the bundle doesn't change, the ASP.NET application will request the AllMyScripts bundle using this token. If any file in the bundle changes, the ASP.NET optimization framework will generate a new token, guaranteeing that browser requests for the bundle will get the latest bundle.
There are other benefits to bundling, including increased performance on first-time page loads with minification.
For my development, I find that Chrome has a great solution.
https://superuser.com/a/512833
With developer tools open, simply long click the refresh button and let go once you hover over "Empty Cache and Hard Reload".
This is my best friend, and is a super lightweight way to get what you want!
Thanks to Kip for his perfect solution!
I extended it to use it as an Zend_view_Helper. Because my client run his page on a virtual host I also extended it for that.
/**
* Extend filepath with timestamp to force browser to
* automatically refresh them if they are updated
*
* This is based on Kip's version, but now
* also works on virtual hosts
* #link http://stackoverflow.com/questions/118884/what-is-an-elegant-way-to-force-browsers-to-reload-cached-css-js-files
*
* Usage:
* - extend your .htaccess file with
* # Route for My_View_Helper_AutoRefreshRewriter
* # which extends files with there timestamp so if these
* # are updated a automatic refresh should occur
* # RewriteRule ^(.*)\.[^.][\d]+\.(css|js)$ $1.$2 [L]
* - then use it in your view script like
* $this->headLink()->appendStylesheet( $this->autoRefreshRewriter($this->cssPath . 'default.css'));
*
*/
class My_View_Helper_AutoRefreshRewriter extends Zend_View_Helper_Abstract {
public function autoRefreshRewriter($filePath) {
if (strpos($filePath, '/') !== 0) {
// Path has no leading '/'
return $filePath;
} elseif (file_exists($_SERVER['DOCUMENT_ROOT'] . $filePath)) {
// File exists under normal path
// so build path based on this
$mtime = filemtime($_SERVER['DOCUMENT_ROOT'] . $filePath);
return preg_replace('{\\.([^./]+)$}', ".$mtime.\$1", $filePath);
} else {
// Fetch directory of index.php file (file from all others are included)
// and get only the directory
$indexFilePath = dirname(current(get_included_files()));
// Check if file exist relativ to index file
if (file_exists($indexFilePath . $filePath)) {
// Get timestamp based on this relativ path
$mtime = filemtime($indexFilePath . $filePath);
// Write generated timestamp to path
// but use old path not the relativ one
return preg_replace('{\\.([^./]+)$}', ".$mtime.\$1", $filePath);
} else {
return $filePath;
}
}
}
}
I have not found the client-side DOM approach creating the script node (or CSS) element dynamically:
<script>
var node = document.createElement("script");
node.type = "text/javascript";
node.src = 'test.js?' + Math.floor(Math.random()*999999999);
document.getElementsByTagName("head")[0].appendChild(node);
</script>
Say you have a file available at:
/styles/screen.css
You can either append a query parameter with version information onto the URI, e.g.:
/styles/screen.css?v=1234
Or you can prepend version information, e.g.:
/v/1234/styles/screen.css
IMHO, the second method is better for CSS files, because they can refer to images using relative URLs which means that if you specify a background-image like so:
body {
background-image: url('images/happy.gif');
}
Its URL will effectively be:
/v/1234/styles/images/happy.gif
This means that if you update the version number used, the server will treat this as a new resource and not use a cached version. If you base your version number on the Subversion, CVS, etc. revision this means that changes to images referenced in CSS files will be noticed. That isn't guaranteed with the first scheme, i.e. the URL images/happy.gif relative to /styles/screen.css?v=1235 is /styles/images/happy.gif which doesn't contain any version information.
I have implemented a caching solution using this technique with Java servlets and simply handle requests to /v/* with a servlet that delegates to the underlying resource (i.e. /styles/screen.css). In development mode I set caching headers that tell the client to always check the freshness of the resource with the server (this typically results in a 304 if you delegate to Tomcat's DefaultServlet and the .css, .js, etc. file hasn't changed) while in deployment mode I set headers that say "cache forever".
You could simply add some random number with the CSS and JavaScript URL like
example.css?randomNo = Math.random()
Google Chrome has the Hard Reload as well as the Empty Cache and Hard Reload option. You can click and hold the reload button (in Inspect Mode) to select one.
I recently solved this using Python. Here is the code (it should be easy to adopt to other languages):
def import_tag(pattern, name, **kw):
if name[0] == "/":
name = name[1:]
# Additional HTML attributes
attrs = ' '.join(['%s="%s"' % item for item in kw.items()])
try:
# Get the files modification time
mtime = os.stat(os.path.join('/documentroot', name)).st_mtime
include = "%s?%d" % (name, mtime)
# This is the same as sprintf(pattern, attrs, include) in other
# languages
return pattern % (attrs, include)
except:
# In case of error return the include without the added query
# parameter.
return pattern % (attrs, name)
def script(name, **kw):
return import_tag('<script %s src="/%s"></script>', name, **kw)
def stylesheet(name, **kw):
return import_tag('<link rel="stylesheet" type="text/css" %s href="/%s">', name, **kw)
This code basically appends the files time-stamp as a query parameter to the URL. The call of the following function
script("/main.css")
will result in
<link rel="stylesheet" type="text/css" href="/main.css?1221842734">
The advantage of course is that you do never have to change your HTML content again, touching the CSS file will automatically trigger a cache invalidation. It works very well and the overhead is not noticeable.
You can force a "session-wide caching" if you add the session-id as a spurious parameter of the JavaScript/CSS file:
<link rel="stylesheet" src="myStyles.css?ABCDEF12345sessionID" />
<script language="javascript" src="myCode.js?ABCDEF12345sessionID"></script>
If you want a version-wide caching, you could add some code to print the file date or similar. If you're using Java you can use a custom-tag to generate the link in an elegant way.
<link rel="stylesheet" src="myStyles.css?20080922_1020" />
<script language="javascript" src="myCode.js?20080922_1120"></script>
For ASP.NET I propose the following solution with advanced options (debug/release mode, versions):
Include JavaScript or CSS files this way:
<script type="text/javascript" src="Scripts/exampleScript<%=Global.JsPostfix%>" />
<link rel="stylesheet" type="text/css" href="Css/exampleCss<%=Global.CssPostfix%>" />
Global.JsPostfix and Global.CssPostfix are calculated by the following way in Global.asax:
protected void Application_Start(object sender, EventArgs e)
{
...
string jsVersion = ConfigurationManager.AppSettings["JsVersion"];
bool updateEveryAppStart = Convert.ToBoolean(ConfigurationManager.AppSettings["UpdateJsEveryAppStart"]);
int buildNumber = System.Reflection.Assembly.GetExecutingAssembly().GetName().Version.Revision;
JsPostfix = "";
#if !DEBUG
JsPostfix += ".min";
#endif
JsPostfix += ".js?" + jsVersion + "_" + buildNumber;
if (updateEveryAppStart)
{
Random rand = new Random();
JsPosfix += "_" + rand.Next();
}
...
}
If you're using Git and PHP, you can reload the script from the cache each time there is a change in the Git repository, using the following code:
exec('git rev-parse --verify HEAD 2> /dev/null', $gitLog);
echo ' <script src="/path/to/script.js"?v='.$gitLog[0].'></script>'.PHP_EOL;
Simply add this code where you want to do a hard reload (force the browser to reload cached CSS and JavaScript files):
$(window).load(function() {
location.reload(true);
});
Do this inside the .load, so it does not refresh like a loop.
For development: use a browser setting: for example, Chrome network tab has a disable cache option.
For production: append a unique query parameter to the request (for example, q?Date.now()) with a server-side rendering framework or pure JavaScript code.
// Pure JavaScript unique query parameter generation
//
//=== myfile.js
function hello() { console.log('hello') };
//=== end of file
<script type="text/javascript">
document.write('<script type="text/javascript" src="myfile.js?q=' + Date.now() + '">
// document.write is considered bad practice!
// We can't use hello() yet
</script>')
<script type="text/javascript">
hello();
</script>
For developers with this problem while developing and testing:
Remove caching briefly.
"keep caching consistent with the file" .. it's way too much hassle ..
Generally speaking, I don't mind loading more - even loading again files which did not change - on most projects - is practically irrelevant. While developing an application - we are mostly loading from disk, on localhost:port - so this increase in network traffic issue is not a deal breaking issue.
Most small projects are just playing around - they never end-up in production. So for them you don't need anything more...
As such if you use Chrome DevTools, you can follow this disable-caching approach like in the image below:
And if you have Firefox caching issues:
Do this only in development. You also need a mechanism to force reload for production, since your users will use old cache invalidated modules if you update your application frequently and you don't provide a dedicated cache synchronisation mechanism like the ones described in the answers above.
Yes, this information is already in previous answers, but I still needed to do a Google search to find it.
It seems all answers here suggest some sort of versioning in the naming scheme, which has its downsides.
Browsers should be well aware of what to cache and what not to cache by reading the web server's response, in particular the HTTP headers - for how long is this resource valid? Was this resource updated since I last retrieved it? etc.
If things are configured 'correctly', just updating the files of your application should (at some point) refresh the browser's caches. You can for example configure your web server to tell the browser to never cache files (which is a bad idea).
A more in-depth explanation of how that works is in How Web Caches Work.
Just use server-side code to add the date of the file... that way it will be cached and only reloaded when the file changes.
In ASP.NET:
<link rel="stylesheet" href="~/css/custom.css?d=#(System.Text.RegularExpressions.Regex.Replace(File.GetLastWriteTime(Server.MapPath("~/css/custom.css")).ToString(),"[^0-9]", ""))" />
<script type="text/javascript" src="~/js/custom.js?d=#(System.Text.RegularExpressions.Regex.Replace(File.GetLastWriteTime(Server.MapPath("~/js/custom.js")).ToString(),"[^0-9]", ""))"></script>
This can be simplified to:
<script src="<%= Page.ResolveClientUrlUnique("~/js/custom.js") %>" type="text/javascript"></script>
By adding an extension method to your project to extend Page:
public static class Extension_Methods
{
public static string ResolveClientUrlUnique(this System.Web.UI.Page oPg, string sRelPath)
{
string sFilePath = oPg.Server.MapPath(sRelPath);
string sLastDate = System.IO.File.GetLastWriteTime(sFilePath).ToString();
string sDateHashed = System.Text.RegularExpressions.Regex.Replace(sLastDate, "[^0-9]", "");
return oPg.ResolveClientUrl(sRelPath) + "?d=" + sDateHashed;
}
}
You can use SRI to break the browser cache. You only have to update your index.html file with the new SRI hash every time. When the browser loads the HTML and finds out the SRI hash on the HTML page didn't match that of the cached version of the resource, it will reload your resource from your servers. It also comes with a good side effect of bypassing cross-origin read blocking.
<script src="https://jessietessie.github.io/google-translate-token-generator/google_translate_token_generator.js" integrity="sha384-muTMBCWlaLhgTXLmflAEQVaaGwxYe1DYIf2fGdRkaAQeb4Usma/kqRWFWErr2BSi" crossorigin="anonymous"></script>

Resources