How do I setup PageRules to bypass Caching against an entire image directory? - caching

I'm working with Cloudflare and one of its benefits is caching. There are certain areas of my site that shouldn't be cached as the results need to be seen by the user. I run an online store and images for products are stored in directories which are also designated by the product Id, this means there are quite a few directories and I don't want the images to be cached.
Here are a few example URLs of product images on my website.
www.mysite.co.uk/images/products/87/image300.jpg
www.mysite.co.uk/images/products/88/image300.jpg
www.mysite.co.uk/images/products/94/image300.jpg
www.mysite.co.uk/images/products/27/image300.jpg
To get around the caching of the above directories I tried adding page rules within the Cloudflare dashboard, despite my best efforts I cannot get Cloudflare to stop caching all the images within the products directory.
My first attempt was to use a wildcard and prevent caching across all pages on my site, the pagerule I used was:
mysite.co.uk/*
However, the above rule didn't seem to do anything. I then attempt to get more granular with it and opted for a rule like this:
mysite.co.uk/images/products/*
This rule didn't seem to work either. I then looked at more advanced wildcard use but I fear I got out of my depth:
mysite.co.uk/images/products/*/$1.jpg
Needless to say, the above rule didn't work either. So, my question is, what rule should I use to prevent caching of my product images?

Assuming your site is published at www.mysite.co.uk and the images you don't want to cache are under www.mysite.co.uk/images/products/... then, you would create a page rule such as:
This rule will tell Cloudflare to not store the resources matching the expression on the CDN. You can also change the rule to match www.mysite.co.uk/images/products/*.jpg if you only want to match jpg images under that folder.
Finally: if there are more page rules defines, it is recommended ordering them from most to least specific, as only one rule is matched for every request.

Related

Google Custom Search API returning HTML documents instead of images

I started using the Google Custom Search API for a project, the idea is to search for images, and I wanted to use the Custom Search because the Google Images API is deprecated.
I already enabled image search on the CSE console
My query is like this:
https://www.googleapis.com/customsearch/v1?key=APIKEY&cx=CSECX&q=flower&alt=json&searchType=image&num=1&start=NUMBER
Where NUMBER is a random value between 1 and 20
Sometimes, it returns results like this:
{u'kind': u'customsearch#result', u'title': u'Flower Wallpaper Tumblr #6790199', u'displayLink': u'7-themes.com', u'htmlTitle': u'<b>Flower</b> Wallpaper Tumblr #6790199', u'snippet': u'Flower Wallpaper Tumblr', u'htmlSnippet': u'<b>Flower</b> Wallpaper Tumblr', u'link': u'http://7-themes.com/data_images/out/7/6790199-flower-wallpaper-tumblr.jpg', u'mime': u'image/jpeg', u'image': {u'thumbnailWidth': 150, u'byteSize': 808360, u'height': 1200, u'width': 1920, u'contextLink': u'http://7-themes.com/6790199-flower-wallpaper-tumblr.html', u'thumbnailLink': u'https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSad0z_Wla0nRHAcQrjO5jLQkFjcoqnNHhejjuGmdA1AW2BqIVEpLARAk0s', u'thumbnailHeight': 94}}
Highlighting the interesting part:
u'link': u'http://7-themes.com/data_images/out/7/6790199-flower-wallpaper-tumblr.jpg', u'mime': u'image/jpeg'
So it seems that the URL is http://7-themes.com/data_images/out/7/6790199-flower-wallpaper-tumblr.jpg and mimetype is image/jpeg, but if you go to the URL, you'll see it's not an image, but an HTML document
Of course, I could capture this as an exception, but I don't want to waste daily API requests (out of a 100 limit per day) because the API didn't give me an image when I explicitly said so.
So, the question is: Is this normal behaviour, or misconfiguration/misuse on my part? If so, how could I fix it?
Thanks for your attention
After a little bit of reading, my best guess is that some servers are doing a resource redirect to prevent external sources from hotlinking directly to a resource. The file in question is advertised as an image, but accessing it from an external server will provide an HTML document instead. This is not a URL redirect, so it isn't detected by clients (including the Google crawler) until the resource is downloaded.
This sort of resource redirect is done on Apache servers using the .htaccess file and the RewriteEngine, with a technique similar to the one described here, although that particular technique can't be used to bait-and-switch images for HTML documents.
In short, if a server is lying about what type of file it's hosting, Google can't do anything about that. You can confirm that this is not an issue with the custom search API by performing the same query on the normal web search interface -- notice that clicking the image loads an HTML document rather than the image itself.

How to crawl/index the links on a single page: Google Search Appliance

Am new to the GSA and also don't have full admin access to the system so have to forward requests through to ICT Services to have changes made to our crawls and collections.
I hope someone can help with this question:
I have a single web page which has a list of links to about 180 documents (most of which are stored in the same subdirectory /docs/ which contains some 2400 documents). The rest are scattered across the site in a number of other subdirectories ie /finance/, /hr/ etc
At the moment all that happens is that I either get the single webpage indexed and none of the 180 links. Or I get the 1 page plus ALL of the 2400 documents in the /docs/ subdirectory.
I want to be able to just crawl/index this page and the 180 links and create a separate collection
Is there a simple way to do this?
Regards
Henry
Another possible solution is to use a robots.txt file to disallow crawling of the other pages you don't want. This would be a lot of work if you have to enumerate all of them though.
Your best bet is to see if there is some common URL pattern you can use to specify only the 180 pages you do want. For example, are the pages you do want all PDFs, and the other files you do not want are all some other type? If you can find something that is common for all the pages you want that isn't true for the other pages, you can use that to formulate a pattern (maybe using regex) to do what you want.
Instead of configuring the URL pattern under start urls and follow pattern,
configure the complete url. Get the 180 urls + 1 single web page url and put all 181 urls under start urls and follow pattern.By configuring complete urls, we could avoid GSA being crawling the other urls in the application as we are not keeping any common url pattern under follow urls.
Create a new collection and place all 180 doc urls + single web page
url (or generic pattern matching 181 urls) in that collection under "Include Content Matching the Following Patterns".
I assume that you do not want to index other 2400 documents on GSA.
Hope it helps.
Regards,
Mohan.
You would be better off using a meta and url feed for this.
It will allow you to control whether the GSA follows links in your 180 pages if you fed them in or whether you index your list page if you just feed that. You do this by specifying noindex or nofollow.
You'll still need to have your follow and crawl patterns and collections set up correctly but it's the easiest way to control what gets indexed.
You don't necessarily need to write code for this either, you can use curl and hand craft the xml.
The documentation is pretty good and easy to follow. Feeds Protocol Developers Guide

Moved Category By Mistake and Now URL key has changed

I wanted to move a category called "trinkets" before one called "widgets". Instead, it somehow ended up inside of "widgets." When I moved it back out, the url key was changed to "trinkets-1". How do I get it back to "trinkets"?
Here is something I found - http://www.yireo.com/tutorials/magento/magento-administration/664-fixing-url-rewrites-with-magento
Quoted from the above website:
Sometimes when you make changes to your products, or enable a certain extension, Magento might start to rewrite all your URLs to include a suffix "-1" or some other number. Within the URL Rewrites, Magento differentiates between System URLs and Custom URLs. If the System URLs are broken like this, you should not fix this by adding new Custom URLs.
Instead, open up phpMyAdmin, create a backup of your Magento database, and flush the Magento table core_url_rewrite (so that it becomes totally empty). Immediately afterwards, refresh the Catalog Url Rewrites under Index Management. This will regenerate all System URLs.
If you are comfortable to take a backup and try removing all the records from the above table (obviously preceded by any table pre-fixes), it sounds like a quick fix.
When you rename a category an URL Rewrite Rule is generated so you don't loose the traffic that is incoming on the original category url (see Catalog -> Url Rewrite Management, search for Request Path: trinkets).
Now, when you move it back it checks if the URL Key "trinkets" is already used (which is, because a redirect was generated).
Delete from URL Rewrite Rules the records matching "trinkets" and modift the url key (edit category). Also, when you modify the URL key for the category, make sure the checkbox for " Create Permanent Redirect for old URL" is unchecked.

Detecting URL rewrites (SEO urls)

How could a client detect if a server is using Search Engine Optimizing techniques such as using mod_rewrite to implement "seo friendly urls."
For example:
Normal url:
http://somedomain.com/index.php?type=pic&id=1
SEO friendly URL:
http://somedomain.com/pic/1
Since mod_rewrite runs server side, there is no way a client can detect it for sure.
The only thing you can do client side is to look for some clues:
Is the HTML generated dynamic and that changes between calls? Then /pic/1 would need to be handled by some script and is most likely not the real URL.
Like said before: are there <link rel="canonical"> tags? Then the website likes to tell the search engine, which URL of multiple with the same content it should use from.
Modify parts of the URL and see, if you get an 404. In /pic/1 I would modify "1".
If there is no mod_rewrite it will return 404. If it is, the error is handled by the server side scripting language and can return a 404, but in most cases would return a 200 page printing an error.
You can use a <link rel="canonical" href="..." /> tag.
The SEO aspect is usually on words in the URL, so you can probably ignore any parts that are numeric. Usually SEO is applied over a group of like content, such that is has a common base URL, for example:
Base www.domain.ext/article, with fully URL examples being:
www.domain.ext/article/2011/06/15/man-bites-dog
www.domain.ext/article/2010/12/01/beauty-not-just-skin-deep
Such that the SEO aspect of the URL is the suffix. Algorithm to apply is typify each "folder" after the common base assigning it a "datatype" - numeric, text, alphanumeric and then score as follows:
HTTP Response Code is 200: should be obvious, but you can get a 404 www.domain.ext/errors/file-not-found that would pass the other checks listed.
Non Numeric, with Separators, Spell Checked: separators are usually dashes, underscores or spaces. Take each word and perform a spell check. If the words are valid - including proper names.
Spell Checked URL Text on Page if the text passes a spell check, analyze the page content to see if it appears there.
Spell Checked URL Text on Page Inside a Tag: if prior is true, mark again if text in its entirety is inside an HTML tag.
Tag is Important: if prior is true and tag is <title> or <h#> tag.
Usually with this approach you'll have a max of 5 points, unless multiple folders in the URL meet the criteria, with higher values being better. Now you can probably improve this by using a Bayesian probability approach that uses the above to featurize (i.e. detects the occurrence of some phenomenon) URLs, plus come up with some other clever featurizations. But, then you've got to train the algorithm, which may not be worth it.
Now based on your example, you also want to capture situations where the URL has been designed such that a crawler will index because query parameters are now part of the URL instead. In that case you can still typify suffixes' folders to arrive at patterns of data types - in your example's case that a common prefix is always trailed by an integer - and score those URLs as being SEO friendly as well.
I presume you would be using of the curl variants.
You could try sending the same request but with different "user agent" values.
i.e. send the request one using user agent "Mozzilla/5.0" and a second time using User Agent "Googlebot" if the server is doing something special for web crawlers then there should be a different response
With the frameworks today and url routing they provide I don't even need to use mod_rewrite to create friendly urls such http://somedomain.com/pic/1 so I doubt you can detect anything. I would create such urls for all visitors, crawlers or not. Maybe you can spoof some bot headers to pretend you're a known crawler and see if there's any change. Dunno how legal that is tbh.
For the dynamic url's pattern, its better to use <link rel="canonical" href="..." /> tag for other duplicate

SEO URL Structure

Based on the following example URL structure:
mysite.com/mypage.aspx?a=red&b=green&c=blue
Pages in the application use ASP.net user controls and some of these controls build a query string. To prevent duplicate keys being created e.g. &pid=12&pid=10, I am researching methods of rewriting the URL:
a)
mysite.com/mypage.aspx/red/green/blue
b)
mysite.com/mypage.aspx?controlname=a,red|b,green|c,blue
Pages using this structure would be publishing content that I would like to get indexed and ranked - articles and products (8,000 products to start, with thousands more being added later)
My gut instinct tells me to go with the first method, but would it would be overkill to add all that infrastructure if the second method will accomplish my goal of getting pages indexed AND ranked.
So my question, looking at the pro's and con's, Google Ranking, time to implement etc. which method should I use?
Thanks!
From an SEO perspective you want to try and avoid the querystring, so getting it into the URL and a short form URL is going to get you a better "bang for the buck" on the implementation side of things.
Therefore, I'd recommend the first.
Why don't use MVC pattern, this way all your link will be SEO ready. Check here, you will find what is MVC and also some implementation in .net!
You can easily make SEO-friendly URLs with the help of Helicon Ape (the software which allows having basic Apache functionality on your IIS server). You'll need mod_rewrite I guess.
If you get interested, I can help you with the rules.
Can you explain in more detail your current architecture and what the parameters all mean? There's nothing really wrong with query strings if it's truly dynamic content. Rewriting ?a=red&b=green&c=blue to /red/green/blue is kinda pointless and it's unclear from the URL what might be on the page.
The key is to simplify as much as possible. Split the site into categories and give each "entity" one URL.
For example, if you are selling products, use one URL per product, with keywords in the URL - e.g. mysite.com/products/red-widget or mysite.com/products/12-red-widget if you need the product ID.

Resources