Are clean URLs a backend or a frontend thing - mod-rewrite

What do you think.. are clean URLs a backend or frontend 'discipline'

The answer is BOTH.
For example:
https://stackoverflow.com/questions/203278/are-clean-urls-a-backend-or-a-frontend-thing
The number above is a database id, a back-end thing. Chop off the pretty part and it goes to the same page. Therefore the "are-clean-urls-a-backend-or-a-frontend-thing" is part of the front-end thing.

If we're talking url's being 'clean' from an end user experience then I'm going to break the mould a bit and say that url's in general are not intuitive and they never will be, they are intended to be machine readable.
There is no standard to the format of a url such that when navigating from site to site humans will never ever remember how to reach a resource purely through remembering urls and their 'friendly syntax'. We can argue the toss about whether using a '?' and '&' or '/' to express how how to identify a resource via a url; is one method better than the other? it doesn't matter. At the end of the day a machine parses it and sends back the result.
We should stop deluding ourselves that people actually type these things in and realise that uri's are for machines, not people.
I have yet to use/remember a uri that goes beyond the first few characters of the http://domain.com/ part of an address, and I've been using the web since a long time. That's what bookmarks are for. Nowhere on a website does it say 'change this part here in our url to view 'whatever else' resource' because url's are usually undocumented and opaque.
Yes make your uri's SEO friendly (hell even they change periodically) but forget about the whole 'human/clean' resource identifier thing, it's a mystical pipe dream.
I agree with Vlion that url's should provide a unique mechanism to bookmark a resource and return to it (unlike some of these abominable web 2.0 ajax/silverlight/flash creations), but the bookmark will never be for humans to comprehend and understand. There seems to be quite a lot of preoccupation and energy spent in dreaming up url strategies that humans can remember and type in, it's a waste of energy. Let's get on and solve real problems.
Sorry for the rant, but there's a lot of web 2.0 nonsense related to urls going on in certain circles that are just a total waste of time.

Now that Firefox's Awesome bar and Google Chrome's Omnibox address bars can be used to search the browsing history it makes it much easier for users to search their history for previously visited sites, so having clean urls may help the user find sites in their history more easily.
Making sure the page has an appropriate Title is important (as both browsers search the title as well as the url) but by making sure the url has relevant keywords in it as well, when those keywords are typed in the address bar the urls will be more likely to show up higher in the suggestions as the keyword will be matched twice, in the url and the title.
Also, once a user has typed the name of a site they will be presented with example urls from the site which they can then use as a template for narrowing down their search. So using verbs and nouns in the url for different sections or actions of the site will aid the user to narrow their search to just the part of the site they are interested in, e.g. the /questions/ or /tag/ sections of stackoverflow, or the "/doc" at the end of docs.google.com/doc that can be used to view just document pages on Google docs*.
Since both Firefox and Chrome search for each space separated word typed into the address bar, it could be argued that it isn't necessary for searching that the url be completely human readable, but to allow the user to actually read the keywords they are interested in from the url the amount of "noise" should be kept to a minimum.
* which are of the form http://docs.google.com/Doc?id=gibberish

My perspective is simple:
every place I visit with my browser(with various edge case exceptions) should be bookmarkable and Forward/Back should be usable and not destroy any data entry.

Backend for sure. Your server is the one that has to take care of the routing to the resources requested by the URL.

I think the main reasons for using friendly URLs are:
Ease of linking / sharing
Presentation
Seo
So I think it's purely a client-side pleasure. While they're nice on the server as well, they're not mission critical.

Related

Getting most relevant content from page

I need to create a universal web scraper to parse articles on the different websites. Of course, I know about XPath, but I want to try to make it universal for any website despite the HTML markup of a page.
I need to determine whether there is an article on the page and if it is - parse a text of title, body and tags (if exists).
Frankly speaking, my knowledge in DS is not very huge, but I assume this task (determine whether it is article, and parsing only needed parts) is possible to solve.
What tools should I use? Any help?
Actually, for the second task, I need to implement something similar that google chrome mobile does. When page is not optimised for mobile, then propose to show the page in adaptive mode (just title, and main content).
If you are using Python, some libraries to look at are:
scrapy, which scrapes data and can extract some of the results) and,
BeautifulSoup, which is more geared towards the extraction part itself.
It is possible to request a version of a website (e.g. for Chrome, Safari, Mobile, old-school systems) by creating a custom header for your scraper.
HAve a look at the relevant documentation, and you can get an idea of how to use headers in scrapy here.
I do not know of any more specialised tools. Your tasks are more analytical and are typically not performed with the use of models for estimating e.g. what content is where on a webpage. This might be an intersting research direction though; to see if you can create a model that generalises across many websites to extract the desired content.
That leads me on to my last point, which is to say that creating a single scraper that works for any website *containing your artile type) is not usually possible. People create websites differently, however they see fit, which means they also change them. This usually leads to a good scraper requiring constant updates as time (and developers) moves on.
EDIT:
Then if you have lots of labelled examples, it might be possible to train a model. The challenge might be the look-back range of the model. For example, a typical LSTM model is given a parameter that tells it how far to look back into the past. It is stored within its memory internally. In your case, you might be looking for a start and end HTML tag of an article, to then extract just that part. These tahs could be thousands of words apart. Something a standard LSTM might not be fit to retain and use.
If you could pose your problem a little differently, then there are other approaches that might be plausible. E.g., you could make it a "question-answer" problem, by saying: I have this HTML, where is the article content? If that sounds ok for your use-case, have a look here for some model based approaches.

Internationalizing Title/Meta Tags ok or bad practice?

Is there a problem if I have both English and Chinese versions of the same title/meta tags under the same exact url? I detect the language the user has set for the browser (through the http header "accept-language" field) and change the titles/meta tags based on the language set. I get a large percentage of my traffic from China and felt this was a better-localized user experience for those users BUT I have no idea how Google would view this. My gut feeling tells me that this is not good for SEO.
Baidu.com, a major Chinese search engine, does in fact pick up my translated tags however for other US based sites it does not translate their English title/meta tags into Chinese. I would think Chinese users are less likely to click on those.
Creating sub domains and or separate domains for other countries is not an option at this point. That being said should I only have one language (English) for my title/meta tags to avoid any search engine issues?
Thanks for any advice / wisdom you can offer. Really hoping to get clarity on best practices.
Thanks all!
Yes, it probably is a problem. Search engines see mixed language content. You are not describing how you “detect and change the titles/meta tags based on the users browser language”, but you are probably doing it client-side and using “browser language”, which is wrong whatever it means in detail (it does not specify the user’s preferred language).
To get a more targeted answer, ask a more real question, with a URL.
If you want to get search traffic from search engines in both English and Chinese, you should have two urls instead of one.
When googlebot crawls a page, it does not even send the "Accept-Language" header. You have to send it your default language. When there is one url, there is no way for you to have your second language indexed. You won't be ranked in search engines in multiple languages.
For best SEO, use separate top level domains, subdomains, or folders for different languages.
http://example.de/
http://example.es/
http://example.com/
http://de.example.com/
http://es.example.com/
http://www.example.com/
http://example.com/de/
http://example.com/es/
http://example.com/en/
I think there are no problem when you use English and Chinese in same meta tags.

formatting a string with html for localization in spring

I have a string with urls in them.
eg: Under maintenance. Try again after sometime
I want to pass the complete string to the translators.
I see that there are two options
1. label.maintenance = Under maintenance. {0}Try again{1} after sometime (pass with anchor tags as variables)
2. label.maintenance = Under maintenance. Try again after sometime (pass the string with html)
In the first case if the translator misplaces {0} and {1} that may spoil the format of my page.
In the second case, the translator has to understand html and can possibly inject incorrect links.
What is the best way to achieve this?
I had this problem multiple times before and I came to conclusion that second option is best. Although it requires that translator know what HTML is all about, it turns out it gives the best results.
It communicates clearly what is a link here and if translator needs to re-order the sentence (which happens just too often), (s)he knows what part belongs to link.
In the first example, translator wouldn't know why the sentence contains placeholders and what to do with them. At best (s)he will ask. At worst, translator could re-order them or move translated "Try again" somewhere else and put another word between placeholders.
The conclusion is: leave HTML in translatable resources. At the same time, try to hire experienced software translators as oppose to regular translators. Chances are high they would understand Localization concepts and they would use common glossary of software terms (i.e. the one from Microsoft). That would make your software easier understandable for end users and decrease Localization defects that you will need to fix. In the end hiring software translator may cost less than "cheap" regular one...

Pretty URLs Vs. Duplicate Content

I'm trying to clear up a grey area about this much talked about topic...
Like most devs, I've made some pretty URLs with mod_rewrite. My sites internal links point to the pretty URLs and things are working nicely.
But, I can still access the old URL if I point to it directly.
Now, this is most certainly going to cause duplicate content issues so after doing some research it seems that 301 redirects are the way to go.
But.... and here's the grey bit...
If you are working on a site with thousands of URLs, what's best practice to achieve this? I don't wantto list 1k+ lines in .htaccess I thought of a regexp in my rewrite rule, but my pretty URLs have names from the database in them... and I can't access that from .htaccess :)
Have I hit a dead end? Is there a way around this? Would Google's canonical tag be a possibility??
Well, I don't know if this is the "definitive" answer, but I have a bunch of "functional" URLS like:
http://www.flipscript.com/product.aspx?cid=7&pid=42&ds=asdjlf8i7sdfkhsjfd978
but I remap the URLs, link to them and list them in my site map as:
http://www.flipscript.com/ambigram-ring.aspx
I haven't seen ANY evidence that identical URLS pointing to the same content within the same domain has any negative impact on SEO.
In fact, over the past year, I have climbed to the #1 position on Google with this in place for my primary keyword.
My theory about why this should be so is that Google applies the duplicate content penalty for entire "clone sites", not for just linking with different URLs to the same content within a single site.
A quick dirty way would be to re-route everything on the site via a PHP file that checks to see if the path is still valid, querying the database if necessary. Use a 301 redirect if the path has permanently moved. Soon enough these "grey urls" should hardly ever come across, and indexes should be updated across search engines. At which point you can remove the router.
If you could specify what your "grey url" looks like I may be able to suggest a better alternative.
"Would Google's canonical tag be a possibility??" -- Why not?
--> It automatically transfers page rank
--> Google recommends canonical tag even if the content differs slightly but is more or less similar.
--> Too many 301 redirects to pages within site are bad for SEO (my personal experience with Bing).
--> Too may 301 redirects increase the effective load time of content for your users (especially bad if the ping times from their location to your server is high).

mod_rewrite and redundant / old urls, some SEO best practices needed

Having a look at how google perceives our site at the moment and coming up short...
Basically, we use a bog-standard structure of URL rewriting to make them look SEO friendly.
for instance, a product URL takes shape of any string_([0-9]).html and so forth. of course, this allows us to link to whatever we want before the product id... which we have done. In the past, a product page was Product_Name_79.html and then became Brand_Name_Product_Name_79.html. apache does not really care and id 79 gets passed on in either case. However, google now has 2 versions of this product cached under different URLs - and that's not a good thing as it continues to arrive to the first URL and spider it.
same thing applies to our rewrite rules for brands and categories, some of which had been dropped and some of which have been modified.
there are over 11k urls in site:domain whereas our sitemap gets some 5.8k only. how would you prevent spiders from fetching older versions of urls that you no-longer link to (considering it's not a manual process and often such urls can be very dynamic).
eg, Mens_Merrell_Trail_Running_Shoes__50-100__10____024/ is a dynamic url for the merrell brand, narrowed down by items in trail running shoes that cost between 50 and 100 and size 10 with gender set to men's.
if we decide to nofollow any size and money filter urls, that leaves google still being able to access them through its old cache...
what is the best practice for disallowing a particular type of urls? as the combinations above are nearly infinite, i cannot produce a list and it certainly cannot be backdated against what brands and categories google may hold for us historically.
shall we add noindex when such filters are applied? shall we export them to robots.txt? do nothing in the hope that google stops returning?
to put it into perspective, we have 2600 product page urls that are now redundant / disabled, what would you do with them? redirect to homepage, brand page, 404, do nothing?
thanks for any advice
i think you're looking for rel="canonical", google should start ignoring you're links if they're really not linked to. You can check any incoming links with a tool like this: http://www.seomoz.org/linkscape.
Also if you're old urls match (or don't match) a consisent pattern you could set up a 301 redirect in apache either for pages matching the old pattern or not matching the new pattern...
hope this helps!
Just be sure to set up redirects for any URL you change. Also, I don't recommend using rel=nofollow since it indicates to Google that your site is not trustworthy.

Resources