Why does Twitter use a hash and exclamation mark in URLs, and how do they rewrite search URLs? - ajax

We understand the hash is for AJAX searches, but the exclamation mark? Anyone know?
Also, the "action" attribute for their search form points to "/search," but when you conduct a search, the hash exclamation mark appears in the URL. Are they simply redirecting from "/search" to "/#!/search"?
Note: the second part of the q remains unanswered: That is, are they redirecting the user from "/search" to "/#!/search", or do they send the user to "/search" and use JS on the page to rewrite the URL? – Crashalot Jan 26 at 23:51
Thanks!

It's become the de facto standard that Google has established to ensure consistency and make ajax urls crawlable.
See http://code.google.com/web/ajaxcrawling/docs/getting-started.html
I believe they are using history.pushState. You can do history.back() in the console and it'll lead you back to the page.

Yes, it redirects with HTTP 302.
By the way, "!" is used to eliminate the case with an empty hash. "http://url#" will make a browser to slide to the top.

To answer the second part then: It is redirecting you to /#!/search.
If you look at the response headers when going to http://twitter.com/britishdev (plug plug) you are returned a 302 (temporary redirect) with the Location header set as "Location: http://twitter.com/#!/britishdev"
Yes JavaScript is then pulling all your detail in on the destination page but regardless that is where you are redirected to.

Related

Adding a UNIQUE ID to the URL parameter if the XMLHttp Open Method - What does this mean and how is it done?

I need help understanding AJAX. I am going through the tutorial on W3C schools ( creating a button that opens text file on the server and displays the result in a div)
One part of the tutorials seems abstract to me, without sufficient explanation. I am sure its a pre-requisite that I have missed or not aware of, detailing below
To avoid getting a cached result in response to an XMLHttpRequest made to the server, the tutorial says one needs to ADD A UNIQUE ID to the URL parameter in the XMLHttp Open Method which has been done (in the tutorial) by adding a ?, another character (t) and an = after the file extention followed by joining a random number to the URL (using Math.random()). See code below.
A simple GET request would be like:
xmlhttp.open("GET","demo_get.asp,true); \\I can understand this
Unique ID added to URL
xmlhttp.open("GET","demo_get.asp?t=" + Math.random(),true); \\ I can't undersatnd this
'?' , 't' & a random number generator added to demo_get.asp - Why T, why not P Q R Z ?? Why "?" after .asp
Should the compiler not go bonkers and report an error if arbitary characters are added to the file location. How is the part of the URL after the file extention handled as in this case ?t= + Math.random()
This has been a case of much agony and frustration for the last 3 days cause I don't get which part of JS i have missed here, what do you call this concept and where can I read it ??
This apart, specifying message headers while sending data - What are HTTP headers and what do they mean. How do I decide what the parameters of the setRequestHeader() method shall be ?
Please help. Rest of Ajax is clear to me.
(I haven't read on the second part - the message headers. I have asked that query here to avoiding posting another question later, just in case it turns out to be as eluding and enigmatic as the UNIQUE ID concept - Apologies in case its a direct simple question I ought to read up myself)
Cache compares the requested URL with those present with it, if a unique id is added to the URL, it does not match and the browser treats it as a fresh GET request, which then is forwarded to the server. This is a standard way to bypass / disable browser caching.
Please refer this document for a better understanding of browser Caching.
See Page No 4, which explains the same thing as stated above.
http://www.f5.com/pdf/white-papers/browser-behavior-wp.pdf

Detecting URL rewrites (SEO urls)

How could a client detect if a server is using Search Engine Optimizing techniques such as using mod_rewrite to implement "seo friendly urls."
For example:
Normal url:
http://somedomain.com/index.php?type=pic&id=1
SEO friendly URL:
http://somedomain.com/pic/1
Since mod_rewrite runs server side, there is no way a client can detect it for sure.
The only thing you can do client side is to look for some clues:
Is the HTML generated dynamic and that changes between calls? Then /pic/1 would need to be handled by some script and is most likely not the real URL.
Like said before: are there <link rel="canonical"> tags? Then the website likes to tell the search engine, which URL of multiple with the same content it should use from.
Modify parts of the URL and see, if you get an 404. In /pic/1 I would modify "1".
If there is no mod_rewrite it will return 404. If it is, the error is handled by the server side scripting language and can return a 404, but in most cases would return a 200 page printing an error.
You can use a <link rel="canonical" href="..." /> tag.
The SEO aspect is usually on words in the URL, so you can probably ignore any parts that are numeric. Usually SEO is applied over a group of like content, such that is has a common base URL, for example:
Base www.domain.ext/article, with fully URL examples being:
www.domain.ext/article/2011/06/15/man-bites-dog
www.domain.ext/article/2010/12/01/beauty-not-just-skin-deep
Such that the SEO aspect of the URL is the suffix. Algorithm to apply is typify each "folder" after the common base assigning it a "datatype" - numeric, text, alphanumeric and then score as follows:
HTTP Response Code is 200: should be obvious, but you can get a 404 www.domain.ext/errors/file-not-found that would pass the other checks listed.
Non Numeric, with Separators, Spell Checked: separators are usually dashes, underscores or spaces. Take each word and perform a spell check. If the words are valid - including proper names.
Spell Checked URL Text on Page if the text passes a spell check, analyze the page content to see if it appears there.
Spell Checked URL Text on Page Inside a Tag: if prior is true, mark again if text in its entirety is inside an HTML tag.
Tag is Important: if prior is true and tag is <title> or <h#> tag.
Usually with this approach you'll have a max of 5 points, unless multiple folders in the URL meet the criteria, with higher values being better. Now you can probably improve this by using a Bayesian probability approach that uses the above to featurize (i.e. detects the occurrence of some phenomenon) URLs, plus come up with some other clever featurizations. But, then you've got to train the algorithm, which may not be worth it.
Now based on your example, you also want to capture situations where the URL has been designed such that a crawler will index because query parameters are now part of the URL instead. In that case you can still typify suffixes' folders to arrive at patterns of data types - in your example's case that a common prefix is always trailed by an integer - and score those URLs as being SEO friendly as well.
I presume you would be using of the curl variants.
You could try sending the same request but with different "user agent" values.
i.e. send the request one using user agent "Mozzilla/5.0" and a second time using User Agent "Googlebot" if the server is doing something special for web crawlers then there should be a different response
With the frameworks today and url routing they provide I don't even need to use mod_rewrite to create friendly urls such http://somedomain.com/pic/1 so I doubt you can detect anything. I would create such urls for all visitors, crawlers or not. Maybe you can spoof some bot headers to pretend you're a known crawler and see if there's any change. Dunno how legal that is tbh.
For the dynamic url's pattern, its better to use <link rel="canonical" href="..." /> tag for other duplicate

how does URL rewrite work in plain english

I have read a lot about URL rewriting but I still don't get it.
I understand that a URL like
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=19
can be replaced with a friendlier one like
http://www.example.com/Blog/2006/12/19/
and the server code can remain unchanged because there is some filter which transforms the new URL and sends it to the old, but does it replace the URLs in the HTML of the response too?
If the server code remains unchanged then it is possible that in my returned HTML code I have links like:
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=20
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=21
http://www.example.com/Blog/Posts.php?Year=2006&Month=12&Day=22
This defeats the purpose of having the nice URLs if in my page I still have the old ones.
Does URL rewriting (with a filter or something) also replace this content in the HTML?
Put another way... do the rewrite rules apply for the incoming request as well as the HTML content of the response?
Thank you!
The URL rewriter simply takes the incoming URL and if it matches a certain pattern it converts it to a URL that the server understands (assuming your rewrite rules are correct).
It does mean that a specific resource can be accessed multiple ways, but this does not "defeat the point", as the point is to have nice looking URLs, which you still do.
They do not rewrite the outgoing content, only the incoming URL.

htaccess redirection for #! in urls

I am using ajax to load pages on my website.
Each time a page is loaded I change the url in the browser to
http//www.example.com/old/page/#!/new/page/
by setting it through window.loaction using javascript
Now what I want to do is when someone comes to my website by entering the url
http//www.example.com/old/page/#!/new/page/
he should automatically get redirected to
http//www.example.com/new/page/
This is somewhat that happens on facebook too.
Can someone help me out with the required .htaccess code to achieve the same.
Thanks in advance
I don't think anything past the # symbol in your URL is even visible on the server side. So htaccess, php, etc won't even know the hash is there to begin with. I think in order to pull this off you're going to have to use a client side redirect.
window.onload = function(){
// First we use a regex to check for the #! pattern in the hash
if(window.location.hash.match(/\#\!/i)){
// If we found a match, use substring to remove the #! and do a redirect
window.location = window.location.hash.substring(2);
}
};
This example will redirect the user immediately on page load. Unfortunately doing a redirect in this manner won't help the search engines to reindex your site, but thats just one of the pitfalls of using fancy javascript or hash based URL's.

IE8 XSS filter: what does it really do?

Internet Explorer 8 has a new security feature, an XSS filter that tries to intercept cross-site scripting attempts. It's described this way:
The XSS Filter, a feature new to Internet Explorer 8, detects JavaScript in URL and HTTP POST requests. If JavaScript is detected, the XSS Filter searches evidence of reflection, information that would be returned to the attacking Web site if the attacking request were submitted unchanged. If reflection is detected, the XSS Filter sanitizes the original request so that the additional JavaScript cannot be executed.
I'm finding that the XSS filter kicks in even when there's no "evidence of reflection", and am starting to think that the filter simply notices when a request is made to another site and the response contains JavaScript.
But even that is hard to verify because the effect seems to come and go. IE has different zones, and just when I think I've reproduced the problem, the filter doesn't kick in anymore, and I don't know why.
Anyone have any tips on how to combat this? What is the filter really looking for? Is there any way for a good-guy to POST data to a 3rd-party site which can return HTML to be displayed in an iframe and not trigger the filter?
Background: I'm loading a JavaScript library from a 3rd-party site. That JavaScript harvests some data from the current HTML page, and posts it to the 3rd-party site, which responds with some HTML to be displayed in an iframe. To see it in action, visit an AOL Food page and click the "Print" icon just above the story.
What does it really do? It allows third parties to link to a messed-up version of your site.
It kicks in when [a few conditions are met and] it sees a string in the query submission that also exists verbatim in the page, and which it thinks might be dangerous.
It assumes that if <script>something()</script> exists in both the query string and the page code, then it must be because your server-side script is insecure and reflected that string straight back out as markup without escaping.
But of course apart from the fact that's it's a perfectly valid query someone might have typed that matches by coincidence, it's also just as possible that they match because someone looked at the page and deliberately copied part of it out. For example:
http://www.bing.com/search?q=%3Cscript+type%3D%22text%2Fjavascript%22%3E
Follow that in IE8 and I've successfully sabotaged your Bing page so it'll give script errors, and the pop-out result bits won't work. Essentially it gives an attacker whose link is being followed license to pick out and disable parts of the page he doesn't like — and that might even include other security-related measures like framebuster scripts.
What does IE8 consider ‘potentially dangerous’? A lot more and a lot stranger things than just this script tag. eg. What's more, it appears to match against a set of ‘dangerous’ templates using a text pattern system (presumably regex), instead of any kind of HTML parser like the one that will eventually parse the page itself. Yes, use IE8 and your browser is pařṣinͅg HT̈́͜ML w̧̼̜it̏̔h ͙r̿e̴̬g̉̆e͎x͍͔̑̃̽̚.
‘XSS protection’ by looking at the strings in the query is utterly bogus. It can't be ‘fixed’; the very concept is intrinsically flawed. Apart from the problem of stepping in when it's not wanted, it can't ever really protect you from anything but the most basic attacks — and the attackers will surely workaround such blocks as IE8 becomes more widely used. If you've been forgetting to escape your HTML output correctly you'll still be vulnerable; all XSS “protection” has to offer you is a false sense of security. Unfortunately Microsoft seem to like this false sense of security; there is similar XSS “protection” in ASP.NET too, on the server side.
So if you've got a clue about webapp authoring and you've been properly escaping output to HTML like a good boy, it's definitely a good idea to disable this unwanted, unworkable, wrong-headed intrusion by outputting the header:
X-XSS-Protection: 0
in your HTTP responses. (And using ValidateRequest="false" in your pages if you're using ASP.NET.)
For everyone else, who still slings strings together in PHP without taking care to encode properly... well you might as well leave it on. Don't expect it to actually protect your users, but your site is already broken, so who cares if it breaks a little more, right?
To see it in action, visit an AOL Food page and click the "Print" icon just above the story.
Ah yes, I can see this breaking in IE8. Not immediately obvious where IE has made the hack to the content that's stopped it executing though... the only cross-domain request I can see that's a candidate for the XSS filter is this one to http://h30405.www3.hp.com/print/start:
POST /print/start HTTP/1.1
Host: h30405.www3.hp.com
Referer: http://recipe.aol.com/recipe/oatmeal-butter-cookies/142275?
csrfmiddlewaretoken=undefined&characterset=utf-8&location=http%253A%2F%2Frecipe.aol.com%2Frecipe%2Foatmeal-butter-cookies%2F142275&template=recipe&blocks=Dd%3Do%7Efsp%7E%7B%3D%25%3F%3D%3C%28%2B.%2F%2C%28%3D3%3F%3D%7Dsp%7Ct#kfoz%3D%25%3F%3D%7E%7C%7Czqk%7Cpspm%3Db3%3Fd%3Do%7Efsp%7E%7B%3D%25%3F%3D%3C%7D%2F%27%2B%2C.%3D3%3F%3D%7Dsp%7Ct#kfoz%3D%25%3F%3D%7E%7C%7Czqk...
that blocks parameter continues with pages more gibberish. Presumably there is something there that (by coincidence?) is reflected in the returned HTML and triggers one of IE8's messed up ideas of what an XSS exploit looks like.
To fix this, HP need to make the server at h30405.www3.hp.com include the X-XSS-Protection: 0 header.
You should send me (ericlaw#microsoft) a network capture (www.fiddlercap.com) of the scenario you think is incorrect.
The XSS filter works as follows:
Is XSSFILTER enabled for this process?
If yes– proceed to next check
If no – bypass XSS Filter and continue loading
Is a "document" load (like a frame, not a subdownload)?
If yes– proceed to next check
If no – bypass XSS Filter and continue loading
Is it a HTTP/HTTPS request?
If yes– proceed to next check
If no – bypass XSS Filter and continue loading
Does RESPONSE contain x-xss-protection header?
Yes:
Value = 1: XSS Filter Enabled (no urlaction check)
Value = 0: XSS Filter Disabled (no urlaction check)
No: proceed to next check
Is the site loading in a Zone where URLAction enables XSS filtering? (By default: Internet, Trusted, Restricted)
If yes– proceed to next check
If no – bypass XSS Filter and continue loading
Is a cross site Request? (Referrer header: Does the final (post-redirect) fully-qualified domain name in the HTTP request referrer header match the fully-qualified domain name of the URL being retrieved?)
If yes – bypass XSS Filter and continue loading
If no – then the URL in the request should be neutered.
Does the heuristic indicate of the RESPONSE data came from unsafe REQUEST DATA?
If yes – modify the response.
Now, the exact details of #7 are quite complicated, but basically, you can imagine that IE does a match of request data (URL/Post Body) to response data (script bodies) and if they match, then the response data will be modified.
In your site's case, you'll want to look at the body of the POST to http://h30405.www3.hp.com/print/start and the corresponding response.
Actually, it's worse than might seem. The XSS filter can make safe sites unsafe. Read here:
http://www.h-online.com/security/news/item/Security-feature-of-Internet-Explorer-8-unsafe-868837.html
From that article:
However, Google disables IE's XSS filter by sending the X-XSS-Protection: 0 header, which makes it immune.
I don't know enough about your site to judge if this may be a solution, but you can probably try.
More in depth, technical discussion of the filter, and how to disable it is here: http://michael-coates.blogspot.com/2009/11/ie8-xss-filter-bug.html

Resources