I just want to deindex or remove the page ending with #!.
It is a dynamic page which fetching data from database and rendering on two URLs
example.com/detail/297#!
example.com/detail/297
If you want to deindex both of those URLs you should return a "410 Gone" HTTP status for them. (Or even a "404 Not Found" status would work but it isn't as descriptive.)
If you want to deindex only the version with the hash bang, that is a little harder. URLs with the hash, share the same HTTP request, response, and status code as the version without the hash. That means that handling the hash portion of the URL can only be done with JavaScript. You will need to insert JavaScript into the page that detects that presence of the hash bang and implements one of the following:
Changes the location.href to remove the hash bang
Changes the location.href to the URL of a "410 Gone" error page
Dynamically inserts a <meta name="robots" content="noindex"> tag into the <head> of the page.
Related
I've a html page which is picking random .txt file as an input and dipslay a graph using AmcCharts(js framework). After every 10 min, old .txt file is removed and new is created automatically. My HTML is picking old .txt from cache and not the new .txt generated.
I've tried
<meta http-equiv='cache-control' content='no-cache'>
<meta http-equiv='expires' content='0'>
<meta http-equiv='pragma' content='no-cache'>
but they're not working.
The meta tags only determine how the page itself is cached, not how any files that you load from the page is cached.
If you can use server scripting to handle the requests for the text file, you could add the cache settings to the response as HTTP headers, to set the cacheability of the request.
Otherwise you would need to make the URL that you use to request the file unique each time by adding a parameter to it. You can for example use Javascript to generate a random number and add as a parameter, so that you request for exampe data.txt?8973624895723405 instead of just data.txt.
we've implemented a new AJAX based search on our website. We're adding the parameters and their values with # tag at the end of the main URL, when user makes further refine by applying additional filters.
This was done to enable our users to share the URL of what they were viewing. It's actually now achieved in a way that the page gets redirected and the content is generated first for the base URL. Using a Javascript function which executes onload looks at the parameters in the # tags and makes another AJAX hit.
Questions:
Why browsers are not sending the # thing to the server. i.e.; # part is not even received by the HTTP Server. It's interesting actually, browsers are not sending them at all
What is best way to get the # values? I'm looking at more of to avoid the double hit that we've implemented right now. i.e.; content is loaded already and then making another AJAX call to apply the refines.
The # value is an instruction to the browser to look for a named anchor in the document it is to load from the server. It is interpreted and actioned by the browser. The server can do nothing with it, so there's no point in sending it. If you're trying to use this for some other purpose then you'll run into difficulties - as you have found.
There is a mechanism for sending data to the server: the querystring. Append your parameters to the URL prefixed by a ?, in the form variablename=data, with successive variables separated by a &.
I have a form with a fileupload control in it, and I call form.submit with a success function.
My server side does the usual trick of setting the content type to text/html to get it to arrive in one piece.
In the success function, action.response.responseText does contain the JSON which I sent.
When it leaves the server, it looks like:
{
html: "<div>a div</div>"
}
When it arrives in the success function, the tags are missing. What's going on? Do I need to put some sort of html cdata wrapper around the entire response on the server to avoid this?
A string in a string in JSON. As long as it is well-formed you are allowed to put HTML in string values (making sure you escape quotes etc.).
It's probably the function you're using to insert the HTML that's stripping the tags.
Here's the situation. When you ask ExtJS or JQuery to do Ajax for a form with a file upload, it has to use an iframe. For the response to come back correctly, it has to be of content type text/html whatever is in it. So it has to have it's HTML characters escaped for HTML, which I accomplished with a function from CommonsLang.
How could a client detect if a server is using Search Engine Optimizing techniques such as using mod_rewrite to implement "seo friendly urls."
For example:
Normal url:
http://somedomain.com/index.php?type=pic&id=1
SEO friendly URL:
http://somedomain.com/pic/1
Since mod_rewrite runs server side, there is no way a client can detect it for sure.
The only thing you can do client side is to look for some clues:
Is the HTML generated dynamic and that changes between calls? Then /pic/1 would need to be handled by some script and is most likely not the real URL.
Like said before: are there <link rel="canonical"> tags? Then the website likes to tell the search engine, which URL of multiple with the same content it should use from.
Modify parts of the URL and see, if you get an 404. In /pic/1 I would modify "1".
If there is no mod_rewrite it will return 404. If it is, the error is handled by the server side scripting language and can return a 404, but in most cases would return a 200 page printing an error.
You can use a <link rel="canonical" href="..." /> tag.
The SEO aspect is usually on words in the URL, so you can probably ignore any parts that are numeric. Usually SEO is applied over a group of like content, such that is has a common base URL, for example:
Base www.domain.ext/article, with fully URL examples being:
www.domain.ext/article/2011/06/15/man-bites-dog
www.domain.ext/article/2010/12/01/beauty-not-just-skin-deep
Such that the SEO aspect of the URL is the suffix. Algorithm to apply is typify each "folder" after the common base assigning it a "datatype" - numeric, text, alphanumeric and then score as follows:
HTTP Response Code is 200: should be obvious, but you can get a 404 www.domain.ext/errors/file-not-found that would pass the other checks listed.
Non Numeric, with Separators, Spell Checked: separators are usually dashes, underscores or spaces. Take each word and perform a spell check. If the words are valid - including proper names.
Spell Checked URL Text on Page if the text passes a spell check, analyze the page content to see if it appears there.
Spell Checked URL Text on Page Inside a Tag: if prior is true, mark again if text in its entirety is inside an HTML tag.
Tag is Important: if prior is true and tag is <title> or <h#> tag.
Usually with this approach you'll have a max of 5 points, unless multiple folders in the URL meet the criteria, with higher values being better. Now you can probably improve this by using a Bayesian probability approach that uses the above to featurize (i.e. detects the occurrence of some phenomenon) URLs, plus come up with some other clever featurizations. But, then you've got to train the algorithm, which may not be worth it.
Now based on your example, you also want to capture situations where the URL has been designed such that a crawler will index because query parameters are now part of the URL instead. In that case you can still typify suffixes' folders to arrive at patterns of data types - in your example's case that a common prefix is always trailed by an integer - and score those URLs as being SEO friendly as well.
I presume you would be using of the curl variants.
You could try sending the same request but with different "user agent" values.
i.e. send the request one using user agent "Mozzilla/5.0" and a second time using User Agent "Googlebot" if the server is doing something special for web crawlers then there should be a different response
With the frameworks today and url routing they provide I don't even need to use mod_rewrite to create friendly urls such http://somedomain.com/pic/1 so I doubt you can detect anything. I would create such urls for all visitors, crawlers or not. Maybe you can spoof some bot headers to pretend you're a known crawler and see if there's any change. Dunno how legal that is tbh.
For the dynamic url's pattern, its better to use <link rel="canonical" href="..." /> tag for other duplicate
I have read a lot about URL rewriting but I still don't get it.
I understand that a URL like
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=19
can be replaced with a friendlier one like
http://www.example.com/Blog/2006/12/19/
and the server code can remain unchanged because there is some filter which transforms the new URL and sends it to the old, but does it replace the URLs in the HTML of the response too?
If the server code remains unchanged then it is possible that in my returned HTML code I have links like:
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=20
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=21
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=22
This defeats the purpose of having the nice URLs if in my page I still have the old ones.
Does URL rewriting (with a filter or something) also replace this content in the HTML?
Put another way... do the rewrite rules apply for the incoming request as well as the HTML content of the response?
Thank you!
The URL rewriter simply takes the incoming URL and if it matches a certain pattern it converts it to a URL that the server understands (assuming your rewrite rules are correct).
It does mean that a specific resource can be accessed multiple ways, but this does not "defeat the point", as the point is to have nice looking URLs, which you still do.
They do not rewrite the outgoing content, only the incoming URL.