We have 2 million businesses in our database. My client wants to create 1 static page for each business so that the pages are indexed by Google. He wants each page to have the following URL structure:-
http://website/business-name/category/city/zip/
So an example page could have the URL:-
http://example.com/pizza-hut/restaurant/newyork/12345/
What I want to know is the disadvantages of making this much static pages and any advantages they have? Thanks.
A clear disadvantage is that this makes a lot of files to manage, for you, the server, the cache system, the file system and the source control system.
And that's totally useless as you can can let the rest of the world, including Google bots, think there are different pages without making different pages, using htaccess rules : they allow URL rewriting.
There is no difference between a static and generated web page to Google. It will crawl the page if there is a link to it. If there is no link to a static page it has no way of finding it. So there is no advantage of a static page unless it is faster to read from disk then to generate on the fly.
Related
I'm working on a site where we are using the slide function from jquery-ui.
The Google-hosted minified version of jquery-ui weighs 63KB - this is for the whole library. The custom download of just the slide function weighs 14KB.
Obviously if a user has cached the Google hosted version its a no-brainer, but if they haven't it will take longer to load as I could just lump the custom jquery-ui slide function inside of my main.js file.
I guess it comes down to how many other sites using jquery-ui (if this was just for the normal jquery the above would be a no-brainer as loads of sites use jquery, but I'm a bit unsure as per the usage of jquery-ui)...
I can't work out what's the best thing to do in the above scenario?
I'd say if the custom selective build is that small, both absolutely and relatively, there's a good reasons to choose that path.
Loading a JavaScript resource has several implications, in the following order of events:
Loading: Request / response communication or, in case of a cache hit - fetching. Keep in mind that CDN or not, the communication only affects the first page. If your site is built in a traditional "full page request" style (as opposed to SPA's and the likes), this literally becomes a non-issue.
Parsing: The JS engine needs to parse the entire resource.
Executing: The JS engine executes the entire resource. That means that any initialization / loading code is executed, even if that's initialization for features that aren't used in the hosting page.
Memory usage: The memory usage depends on the entire resource. That includes static objects as well as function (which are also objects).
With that in mind, having a smaller resource is advantageous in ways beyond simple loading. More so, a request for such a small resource is negligible in terms of communication. You wouldn't even think twice about it had it been a mini version of the company logo somewhere on the bottom of the screen where nobody even notices.
As a side note and potential optimization, if your site serves any proprietary library, or a group of less common libraries, you can bundle all of these together, including the jQuery UI subset, and your users will only have a single request, again making this advantageous.
Go with the Google hosted version
It is likely that the user would have recently visited a website that loads jQuery-UI hosted on Google servers.
It will take load off from your server and make other elements load faster.
Browsers load a fixed number of resources from one domain. Loading the jQuery-UI from Google servers will make sure it is downloaded concurrently with other resource that reside on your servers.
The Yahoo developer network recommends using a CDN. Their full reasons are posted here.
https://developer.yahoo.com/performance/rules.html
This quote from their site really seals it in my mind.
"Deploying your content across multiple, geographically dispersed servers will make your pages load faster from the user's perspective."
I am not an expert but my two cents are these anyway. With a CDN you can be sure that there is reduced latency, plus as mentioned, user is most likely to have picked it up from some other website hosted by googleAlso the thing I always care about, save bandwidth.
I am building a single-page website that brings text html fragments. The fragments are in static html files stored at the server. Do I have to provide parallel pages for SEO indexing? I am hoping this technology improved.
Please be more specific, so we can help you better.
Escaped fragments is the recommended way to get Google to index your AJAX-site. Read more about them here.
I don't know what you mean by parallel, but ?_escaped_fragment_=key=value should result in a HTML snapshot of the page #!key=value
I have a problem with someone (using many IP addresses) browsing all over my shop using:
example.com/catalog/category/view/id/$i
I have URL rewrite turned on, so the usual human browsing looks "friendly":
example.com/category_name.html
Therefore, the question is - how to prevent from browsing the shop using "old" (not rewritten) URLs, leaving only "friendly" URLs allowed?
This is pretty important, since it is using hundreds of threads which is causing the shop to work really slow.
Since there are many random IP addresses, clearly you can't just block access from a single or small group of addresses. You may need to implement some logging that somehow identifies this crawler uniquely (maybe by browser agent, or possibly with some clever use of the Modernizr javascript library).
Once you've been able to distinguish some unique identifiers of this crawler, you could probably use a rule in .htaccess (if it's a user agent thing) to redirect or otherwise prevent them from consuming your server's oomph.
This SO question provides details on rules for user agents.
Block all bots/crawlers/spiders for a special directory with htaccess
If the spider crawls all the urls of the given pattern:
example.com/catalog/category/view/id/$i
then you can just kill these urls in a .htaccess. The rewrite is made internally from category.html -> /catalog/category/view/id/$i so, you only block the bots.
Once the rewrites are there ... They are there. They are stored in the Mage database for many reasons. One is crawlers like the one crawling your site. Another is users that might have the old page bookmarked. There are a number of methods individuals have come up with to go through and clean up your redirects (Google) ... But as it stands, in Magento, once they are there, they are not easily managed using Magento.
I might suggest generating a new site map and submitting it to the crawler affecting your site. Not only is this crawler going to be crawling tons of pages it doesn't need to, it's going to see duplicate content (bad ju ju).
We are currently reworking a WebForms based application to a MVC3/Razor system, and have hit a minor issue.
In our current solution, we have a large number of resources held within an external CMS that are compiled into our .aspx pages via a GlobalResource handler; this means that the hit on the CMS is low, and that the resources are only ever needed to be collected once for each individual page.
We can't seem to find any mechanism within Razor/MVC3 that would allow us to do the same thing - any pointers?
I'm not 100% sure I understand your question, but it sounds like you are using an HttpHandler to serve up a bundle of resources. There's no reason that code can't continue to work in MVC3 as-is. If you're doing something else, I'll need a little more clarification :). What's the underlying goal? To reduce the number of Http Requests per page?
I'm in the process of creating a sitemap for my website. I'm doing this because I have a large number of pages that can only be reached via a search form normally by users.
I've created an automated method for pulling the links out of the database and compiling them into a sitemap. However, for all the pages that are regularly accessible, and do not live in the database, I would have to manually go through and add these to the sitemap.
It strikes me that the regular pages are those that get found anyway by ordinary crawlers, so it seems like a hassle manually adding in those pages, and then making sure the sitemap keeps up to date on any changes to them.
Is it a bad to just leave those out, if they're already being indexed, and have my sitemap only contain my dynamic pages?
Google will crawl any URLs (as allowed by robots.txt) it discovers, even if they are not in the sitemap. So long as your static pages are all reachable from the other pages in your sitemap, it is fine to exclude them. However, there are other features of sitemap XML that may incentivize you to include static URLs in your sitemap (such as modification dates and priorities).
If you're willing to write a script to automatically generate a sitemap for database entries, then take it one step further and make your script also generate entries for static pages. This could be as simple as searching through the webroot and looking for *.html files. Or if you are using a framework, iterate over your framework's static routes.
Yes, I think it is not a good to leave them out. I think it would also be advisable to look for a way that your search pages can be found by a crawler without a sitemap. For example, you could add some kind of advanced search page where a user can select in a form the search term. Crawlers can also fill in those forms.