I have a problem with someone (using many IP addresses) browsing all over my shop using:
example.com/catalog/category/view/id/$i
I have URL rewrite turned on, so the usual human browsing looks "friendly":
example.com/category_name.html
Therefore, the question is - how to prevent from browsing the shop using "old" (not rewritten) URLs, leaving only "friendly" URLs allowed?
This is pretty important, since it is using hundreds of threads which is causing the shop to work really slow.
Since there are many random IP addresses, clearly you can't just block access from a single or small group of addresses. You may need to implement some logging that somehow identifies this crawler uniquely (maybe by browser agent, or possibly with some clever use of the Modernizr javascript library).
Once you've been able to distinguish some unique identifiers of this crawler, you could probably use a rule in .htaccess (if it's a user agent thing) to redirect or otherwise prevent them from consuming your server's oomph.
This SO question provides details on rules for user agents.
Block all bots/crawlers/spiders for a special directory with htaccess
If the spider crawls all the urls of the given pattern:
example.com/catalog/category/view/id/$i
then you can just kill these urls in a .htaccess. The rewrite is made internally from category.html -> /catalog/category/view/id/$i so, you only block the bots.
Once the rewrites are there ... They are there. They are stored in the Mage database for many reasons. One is crawlers like the one crawling your site. Another is users that might have the old page bookmarked. There are a number of methods individuals have come up with to go through and clean up your redirects (Google) ... But as it stands, in Magento, once they are there, they are not easily managed using Magento.
I might suggest generating a new site map and submitting it to the crawler affecting your site. Not only is this crawler going to be crawling tons of pages it doesn't need to, it's going to see duplicate content (bad ju ju).
I am hoping some of you might have some answers here. I've been at this for many hours and I'm not making much headway. From doing an extensive amount of research, I see this is a common problem without many solutions.
I have my login at domain.com that then goes to the user profile at domain.com/profile?u=username that has been rewritten with .htaccess to username.domain.com. I need to have access to SESSION username across all subdomains so I can see if the user is on their own profile or not. I have tried all the basic solutions to get SESSIONS working across all subdomains with .htaccess etc (I do not have direct access to php.ini), but nothing seemed to work; with the exception of session_set_cookie_params(0, '/', '.domain.com'); at the top of the script setting the SESSION. This all of a sudden worked - but the problem is - it also all of a sudden stopped working and has intermittently continued to work and not work without me touching the code.
My questions here are...
Does anyone have any idea why this would be intermittently working and then not working?
Does anyone have any other simple, cross platform solutions to solve this problem.
Failing that, I believe I can store the SESSION in the database and recreate it in all the subdomains. This seems inefficient, but might be the only solution. What are your thoughts and what would be the best way to do this?
I would really appreciate any help in this. This has proven a real challenge.
I was able to fix this by simply adding a text file to my main directory called php.ini with session.cookie_domain = ".domain.com" inside. That was it. And you have to re-launch your browser.
So I am attached to this rather annoying project where a clients client is all nit picky about the little things and he's giving my guy hell who is gladly returning the favor by following the good old rule of shoving shi* down the chain of command.
Now my question. The application consists basically of 3 different mini projects. The backend interface for the administrator, backend interface for the client and the frontend for everyone.
I was specifically asked to apply MOD_REWRITE rules to make things SEO friendly. That was the ultimate aim, so this was basically an exercise in making things more search friendly rather than making the links aesthetically better looking.
So I worked on the frontend, which is basically the landing page for everyone. It looks beautiful, the links are at worst followed by one backslash.
My clients issue. He wants to know why the backend interfaces for the admin and user are still displaying those gigantic ugly links. And these are very very ugly links, I am talking three to four backslashes followed by various get sequences and what not, so you can probably understand the complexities behind MOD_REWRITING something such as this.
In the spur of the moment I said that I left it the way it was to make sure the backend interface wouldn't be sniffed up by any crawlers.
But I am not sure if that's necessarily true. Where do crawlers stop? When do they give up on trying to parse links? I know I can use a .robot file to specify rules. But, as indigenous creatures, what are their instincts?
I know this is more of a rant than anything and I am running a very high risk of having my first question rejected :| But hey, it feels good to have this off my chest.
Cheers!
Where do crawlers stop? When do they give up on trying to parse links?
Robots.txt does not work for all bots.
You can use basic authentication or limited access by IP to hide back-end, if no files are needed for front-end.
If not practicable, try to send 404 or 401 headers for back-end files. But this is just an idea, no guarantee.
But, as indigenous creatures, what are their instincts?
Hyperlinks, toolbars and browser-sided, pre-activated functions for malware-, spam- and fraud-warnings...
Hi guys just want to ask this after hours of mod_rewrite frustration and reading tons of
questions about it on stackoverflow because i tried everything and it didn't work with me. I don't know why, but i had enough so i searched about alternative and am asking here today for opinions. I came up with the following method.method
First assume I have this URL
http://www.domain.com/articles/6
and I have a articles.php page that will take the ID from this URL and pull the article
content from the database (mod_rewrite fails in here), so this is a little solution:
$article_id=explode("/",$_SERVER["REQUEST_URI"]);
show_article($article_id[3]);
the show_article() function will simply take the id and query the database for the article content and I read that the server will not understand that articles is a php page so a little solution too
<FilesMatch "^articles$">
< ForceType application/x-httpd-php >
</FilesMatch>
so two questions :
1- will this solution affect indexing my website pages from search engines spiders ?
2- is this a good solution or mod_rewrite is better?
Note:am sorry if the question not will formatted am not good in formatting if you can make it look better i will appreciate it really sorry
Don't give up with mod_rewrite, it's a bit non-intuitive but VERY powerful and useful piece of software! You'll never get so clean solution in application regarding URL manipulation. To your question:
1) no, it'll not affect indexing. Both your solution and the one involving mod_rewrite are for web spiders transparent,
2) mod_rewrite is definitely better
I do recommend you to ask question regarding your problems with mod_rewrite not doing what you want. I'm pretty sure you'll sort it out with someone.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
From technical perspective the only issue is traffic and incoming links (one of them should redirect to another).
Now I need to choose which one should be primary. Some sites have www (google, microsoft, ruby-lang) and some without www (stackoverflow, github). Seems to me the newer do not use WWW.
What to choose?
Please with some explanations.
UPDATE: This is programming related question. Actually site is for programmers, so I expect to see what techy people think.
UPDATE: Site without WWW is clear winner. Thank you guys!
It doesn't matter which you choose but you should pick one and be consistent. It is more a matter of style but it is important to note that search engines consider these two URLs to be different sites:
http://www.example.com
http://example.com
So whichever you choose for aesthetic reasons should be consistently used for SEO reasons.
Edit: My personal opinion is to forgo the www as it feels archaic to me. I also like shorter URLs. If it were up to me I would redirect all traffic from www.example.com to example.com.
Don't use WWW. It's an unnecessary tongue-twister, and a pain in the arse for graphic designers.
There are some issues you should consider. See for example Use Cookie-free Domains for Components for a cookie validity issue.
But regardless of how you decide: Use just one of that domains as your canonical domain name and use a 301 redirect to correct the invalid. For an Apache webserver, you can use mod_rewrite to do that.
Configure both, obviously. I would make the www redirect to the normal URL, as it only exists to make the people who habitually type it at the beginning of every address happy anyway. Just don't, whatever you do, require the www to be typed manually. Ever.
It depends on your audience, I think. A non-technical audience will assume that the www is there, whereas a technical audience will not instinctively expect it, and will appreciate the shorter URLs.
(Edit: I recently set up a domain for my family to use, including webmail. My wife asked what the address of the webmail was. I said "mail.ourdomain.com". She typed "www.mail.ourdomain.com".)
Either way, make sure the one you don't use cleanly does a 301 Redirect to the one you do use - then neither users nor search engines will need to care.
One aspect of this question deals with CDNs and some web hosts (eg. Google Sites). Such hosts require that you add a CNAME record for your site name that points to the host servers. However, due to the way DNS is designed, CNAME records cannot coexist with other records for the same name, such as NS or SOA records. So, you cannot add a CNAME for your example.com name, and must instead add the CNAME for a subdomain. Of course people normally choose "www" for their subdomain.
Despite this technical limitation, I prefer to omit the www on my sites where possible.
I'd redirect to without www. In Apache 2.x:
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_HOST} ^www\.yourdomain\.com$
RewriteRule (.*) http://yourdomain.com/$1 [R=Permanent]
I think the www is meaningless; we all know we're on the world wide web. It would be much better to use subdomains for load balancing or for device specific sites (like m.google.com for mobiles, for example, even though there is a .mobi top level domain now).
www is used as a standard sub domain, subfolder for websites in the main domain.
http://no-www.org/ are trying to get it deprecated.
Although http://www.w3.org/ include www.
Worth checking both those sites.
It seems to be become a matter of taste and a religion issue at the moment rather than a standard. Whatever you choose, make sure you register or redirect from www as Control+enter etc. shortcuts copy in www.
Would you have other subdomains? If so, that may make using the www make more sense to my mind as some places may have various subdomains used for other purposes like a store or internationalization subdomains.
I normally go with www.sitename.com because it is explicit that it is the main part of your site. Testing.sitename.com is testing. House.sitename.com is my home PC. I like be explicit however I do not mind when sites do not use www. I am not a purest. :)
Use without the www. The general rationale behind this is that since you are writing an address to a web browser, it's already implicit that you are accessing a web site (what else would you do with a browser?) - using the extra www is therefore useless.
To be specific, when receiving a http request, you know the user wants to access the website. The web browser adds the http://-header implicitly, so user only needs to worry about the address. Same goes to other services as well - if you host ftp, it should be enough to point the ftp client to the domain without the ftp. -prefix.
If I understand correctly, the reasons for using the different www., ftp., etc. subdomains are mostly historical, and are no longer relevant these days since traffic is simply directed to the correct server/service - the redundant prefixes have just stuck because of their popularity.
I always make the non-www one redirect to www and refer to them as www.mysite; Think about various forums and instant messenging apps that correctly convert links only when they begin with www. .
You want your url to be memorable, and you want Google et al to register the same url for rankings and the like.
Best practice appears to be to handle the www, but always HTTP redirect it to a non-www variant. That way the search engines know to rank links to both variants as the same site.
Whatever you use, stick to one or else you'll have to make 2 sets of cookies for each domain to make your sessions/cookies work properly.