How many rewrite rules should I expect to manage? - url-rewriting

I'm dealing with a hosting team that is fairly skiddish of managing many rewrite rules. What are your experiences with the number of rules your sites are currently managing?
I can see dozens (if not more) coming up as the site grows and contracts and need to set expectations that this isn't out of the norm.
Thanks

it should be normal that you have a lot of rewrite rules to your site. As the site gets bigger the amount of pages you need to rewrite grows, and depending on what the pages do, you could have multiple rewrites per page. This is all based on how secure you are making your site. More security means more precautions.
module gives you the ability to transparently redirect one URL to another, without the user’s knowledge. This opens up all sorts of possibilities, from simply redirecting old URLs to new addresses, to cleaning up the ‘dirty’ URLs coming from a poor publishing system — giving you URLs that are friendlier to both readers and search engines.
QUOTE
so pretty much it's at your discretion. How secure do you want it to be.

Related

How to measure web assets browser cache efficiency

Do You (smart guys :) have any idea what is the best way to measure cache efficieny of assets which are used on website (js, css, fonts etc).
How should I decide if for my website better way is (for example) to put all my JS files into one file or to seperate between some smaller files?
Many websites do it with one, two, three big files, but for example Facebook has many, many small files. It's clear that Facebook has a lot of returning visitors, so this strategy is better. But how to measure it.
I know I can check for GA and returning/new visitors ratio, but it's not very deep check. After that I still don't know which users during entering my website had some my assets in cache & which hadn't.
My suggestion would be to make an A/B test. Implement several solutions, measure impact on your users loading speed, look at 99 or 98 percentile and you'll see which works better.
Looking to returning/new visitors data is good approach, but not reliable in case you have not so many users.
Facebook uses many files because they also use http 2.0 (SPDY). It solves many problems with delivering multiple resources, for examples this protocol allows you to gzip http headers, not only bodies as good old http 1.1.
Look at other benefits of http 2.0

Magento - prevent from browsing without rewrite

I have a problem with someone (using many IP addresses) browsing all over my shop using:
example.com/catalog/category/view/id/$i
I have URL rewrite turned on, so the usual human browsing looks "friendly":
example.com/category_name.html
Therefore, the question is - how to prevent from browsing the shop using "old" (not rewritten) URLs, leaving only "friendly" URLs allowed?
This is pretty important, since it is using hundreds of threads which is causing the shop to work really slow.
Since there are many random IP addresses, clearly you can't just block access from a single or small group of addresses. You may need to implement some logging that somehow identifies this crawler uniquely (maybe by browser agent, or possibly with some clever use of the Modernizr javascript library).
Once you've been able to distinguish some unique identifiers of this crawler, you could probably use a rule in .htaccess (if it's a user agent thing) to redirect or otherwise prevent them from consuming your server's oomph.
This SO question provides details on rules for user agents.
Block all bots/crawlers/spiders for a special directory with htaccess
If the spider crawls all the urls of the given pattern:
example.com/catalog/category/view/id/$i
then you can just kill these urls in a .htaccess. The rewrite is made internally from category.html -> /catalog/category/view/id/$i so, you only block the bots.
Once the rewrites are there ... They are there. They are stored in the Mage database for many reasons. One is crawlers like the one crawling your site. Another is users that might have the old page bookmarked. There are a number of methods individuals have come up with to go through and clean up your redirects (Google) ... But as it stands, in Magento, once they are there, they are not easily managed using Magento.
I might suggest generating a new site map and submitting it to the crawler affecting your site. Not only is this crawler going to be crawling tons of pages it doesn't need to, it's going to see duplicate content (bad ju ju).

Possible to use one codebase for a subdomain for multiple sites?

I don't even know if this is even possible, but I thought I'd ask.
I am creating a small CRUD application but I have multiple sites. Each site would use the CRUD. The application would have common CRUD methods and style, but each individual site would apply different forms. I want to avoid creating multiple CRUD applications that varied only in specific content (just different forms).
I want to have something like this:
mycrud.website1.com
mycrud.website2.com
mycrud.website3.com
I can create a subdomain for each individual site no problem. But is it feasible to point all the subdomains to one MVC application directory? And if it is possible any suggestions for how I might go about restricting users from website1 from seeing website2 or website3 content? Is that something "roles" could take care of (after authenticating user)?
Thanks.
There are a lot of websites that do this, not just with MVC. Some content farms point *.mydomain.com to a single IP and have a wild card mapping in IIS.
From there, your application should look at the URL to determine what it should be doing. Some CMS systems operate in this manner, using the domain as a key to deciding what pages to load.
I've built a private labelable SAS application (Software as a Service) that allows us to host all of our clients in a single application. Some clients have customizations to pages or features. We are able to handle that by creating custom plugins for each client that over-ride the Controllers or Views when needed.
All clients share a common code base and aside from each clients custom theme/template they are the same. Only when a client had us customize one feature did we need to build out their plugin DLL. Now, this is advanced stuff so it would require heavy modifications to your code base but in the end if it's what your application needs it is 100% possible.
First - the easy part is having one web site for all three domains. You can do that simply with DNS entries. No problem. All three domains should point at the same ip.
As far as the content, you could do that in a number of ways. I think your idea of roles is pretty solid. It also leaves open the possible of a given user seeing content from both site1 and site2, if that would ever be necessary.
If you don't want to force users to authenticate, you should look at other options. You could wrap your CRUD logic and data access logic into separate libraries and use them across three different sites in IIS. You could have one site and display content based on the request URL. There's probably a lot of other options too.

Crawlers/SEO Friendly/Mod Rewrite/It doesn't make any sense

So I am attached to this rather annoying project where a clients client is all nit picky about the little things and he's giving my guy hell who is gladly returning the favor by following the good old rule of shoving shi* down the chain of command.
Now my question. The application consists basically of 3 different mini projects. The backend interface for the administrator, backend interface for the client and the frontend for everyone.
I was specifically asked to apply MOD_REWRITE rules to make things SEO friendly. That was the ultimate aim, so this was basically an exercise in making things more search friendly rather than making the links aesthetically better looking.
So I worked on the frontend, which is basically the landing page for everyone. It looks beautiful, the links are at worst followed by one backslash.
My clients issue. He wants to know why the backend interfaces for the admin and user are still displaying those gigantic ugly links. And these are very very ugly links, I am talking three to four backslashes followed by various get sequences and what not, so you can probably understand the complexities behind MOD_REWRITING something such as this.
In the spur of the moment I said that I left it the way it was to make sure the backend interface wouldn't be sniffed up by any crawlers.
But I am not sure if that's necessarily true. Where do crawlers stop? When do they give up on trying to parse links? I know I can use a .robot file to specify rules. But, as indigenous creatures, what are their instincts?
I know this is more of a rant than anything and I am running a very high risk of having my first question rejected :| But hey, it feels good to have this off my chest.
Cheers!
Where do crawlers stop? When do they give up on trying to parse links?
Robots.txt does not work for all bots.
You can use basic authentication or limited access by IP to hide back-end, if no files are needed for front-end.
If not practicable, try to send 404 or 401 headers for back-end files. But this is just an idea, no guarantee.
But, as indigenous creatures, what are their instincts?
Hyperlinks, toolbars and browser-sided, pre-activated functions for malware-, spam- and fraud-warnings...

How effective is ajaxcrawling compared to serverside created website SEO?

I'm looking for real world experiences in regards to ajaxcrawling:
http://code.google.com/web/ajaxcrawling/index.html
I'm particularly concerned about the infamous Gizmodo failure of late, I know I can find them via Google now, but it's not clear to me how effective this method of ajaxcrawling is in comparison to serverside generated sites is.
I would like to make a wiki that lives mostly on the client side, and which is populated by ajax json. It just feels more fluid, and I think it would be a pluspoint over my competition. (wikipedia, wikimedia)
Obviously, for a wiki it's incredibly important to have working SEO.
I would be very happy for any experiences you have had dealing with clientside development.
My research shows that the general consensus on the web right now is, that you should absolutely avoid doing ajax sites unless you don't care about SEO (for example, a portfolio site, a corporate site etc).
Well, these SEO problems arise when you have a single page that loads content dynamically based on sophisticated client-side behavior. Spiders aren't always smart enough to know when JavaScript is being injected, so if they can't follow links to get to your content, most of them won't understand what's going on in a predictable way, and thus won't be able to fullly index your site.
If you could have the option of unique URLs that lead to static content, even if they all route back to a single page by a URL rewriting scheme, that could solve the problem. Also, it will yield huge benefits down the road when you've got a lot of traffic -- the whole page can be cached at the web server/proxy level, leading to less load on your servers.
Hope that helps.

Resources