I have a problem with someone (using many IP addresses) browsing all over my shop using:
example.com/catalog/category/view/id/$i
I have URL rewrite turned on, so the usual human browsing looks "friendly":
example.com/category_name.html
Therefore, the question is - how to prevent from browsing the shop using "old" (not rewritten) URLs, leaving only "friendly" URLs allowed?
This is pretty important, since it is using hundreds of threads which is causing the shop to work really slow.
Since there are many random IP addresses, clearly you can't just block access from a single or small group of addresses. You may need to implement some logging that somehow identifies this crawler uniquely (maybe by browser agent, or possibly with some clever use of the Modernizr javascript library).
Once you've been able to distinguish some unique identifiers of this crawler, you could probably use a rule in .htaccess (if it's a user agent thing) to redirect or otherwise prevent them from consuming your server's oomph.
This SO question provides details on rules for user agents.
Block all bots/crawlers/spiders for a special directory with htaccess
If the spider crawls all the urls of the given pattern:
example.com/catalog/category/view/id/$i
then you can just kill these urls in a .htaccess. The rewrite is made internally from category.html -> /catalog/category/view/id/$i so, you only block the bots.
Once the rewrites are there ... They are there. They are stored in the Mage database for many reasons. One is crawlers like the one crawling your site. Another is users that might have the old page bookmarked. There are a number of methods individuals have come up with to go through and clean up your redirects (Google) ... But as it stands, in Magento, once they are there, they are not easily managed using Magento.
I might suggest generating a new site map and submitting it to the crawler affecting your site. Not only is this crawler going to be crawling tons of pages it doesn't need to, it's going to see duplicate content (bad ju ju).
Related
I am trying to decode and encode Joomla urls but Joomla doesn't seem to have a consistent API for that (how it looks). The main problem comes in when another SEO plugin is installed and the operation is performed as background process (ie: not whilst rendering in a browser through Joomla).
The other big problem is that users copy and paste SEO urls of the own site directly into the content.
Does anyone knows a solution for this ? Supporting all sorts of SEO plugins individually is a total no-go and rather impossible.
I actually thought its the Job of the CMS to guarantee on a API level that SEO urls can be decoded and encoded without knowing the plugins, but no. I also had a look in some plugins and indeed, plugins do
handle code for other plugins whilst it shouldn't be, coz.
Well,
thanks
You can't. JRoute won't work reliably in the administrator, I even tried hacking it, it's a no-go.
Moveover sh404 (one of the leading SEF extensions) does a curl call to the frontend in order to get the paths right. You can find in their code a commented attempt to route in the backend.
Are you are trying to parse content when it's saved, find SEF urls and replace with their non-sef equivalents? If you create a simple component to handle this in the frontend (just get what you need from xmap), then you can query the frontend from the backend with curl/wget and possibly achieve this with a decent rate of success: but I wouldn't expect this to work 100% (sometimes parameters are added by components, or the order of parameters is different from call to call, and the router.php in extensions can be very fragile or even plain wrong).
I found a "bulk import and export url rewrites extension" for Magento when looking on the internet on how to bulk redirect urls from my current live site to the new urls based on the new site which is on a development server still.
I’ve asked my programmer to help me out and they’ve sent me 2 CSV files, one with all request and target urls from the current live site (these are often different as well, probably due to earlier redirects), and one similar to that for the new site. The current live site comes with 2500 urls, the future site with 3500 (probably because some old, inactive and unnecessary categories are still in the new site as well).
I was thinking to paste the current site’s urls into an Excel sheet and to then insert the future urls per url. A lot of work… Then I thought: can’t I limit my work to the approx. 300 urls that Google has indexed (which can be found through Google Webmaster Tools as you probably know)?
What would you recommend? Would there be advantages to using such an extension? Which advantages would that be? (if you keep in mind that my programmer would upload all of my redirects into a .htaccess file for me?)
Thanks in advance.
Kind regards, Bob.
Axel is giving HORRIBLE advice.. 301 redirects tell Google to transfer old PR and authority of that page to the new page in the redirect.. More over if other websites linked to those pages you don't want a bunch of dead links, then someone MIGHT just remove them.
Even worse if you don't handle the 404's correctly Google can and will penalize you.
ALWAYS setup 301 redirects when changing platforms, only someone that either doesn't understand or perhaps care about SEO would suggest otherwise
I don't even know if this is even possible, but I thought I'd ask.
I am creating a small CRUD application but I have multiple sites. Each site would use the CRUD. The application would have common CRUD methods and style, but each individual site would apply different forms. I want to avoid creating multiple CRUD applications that varied only in specific content (just different forms).
I want to have something like this:
mycrud.website1.com
mycrud.website2.com
mycrud.website3.com
I can create a subdomain for each individual site no problem. But is it feasible to point all the subdomains to one MVC application directory? And if it is possible any suggestions for how I might go about restricting users from website1 from seeing website2 or website3 content? Is that something "roles" could take care of (after authenticating user)?
Thanks.
There are a lot of websites that do this, not just with MVC. Some content farms point *.mydomain.com to a single IP and have a wild card mapping in IIS.
From there, your application should look at the URL to determine what it should be doing. Some CMS systems operate in this manner, using the domain as a key to deciding what pages to load.
I've built a private labelable SAS application (Software as a Service) that allows us to host all of our clients in a single application. Some clients have customizations to pages or features. We are able to handle that by creating custom plugins for each client that over-ride the Controllers or Views when needed.
All clients share a common code base and aside from each clients custom theme/template they are the same. Only when a client had us customize one feature did we need to build out their plugin DLL. Now, this is advanced stuff so it would require heavy modifications to your code base but in the end if it's what your application needs it is 100% possible.
First - the easy part is having one web site for all three domains. You can do that simply with DNS entries. No problem. All three domains should point at the same ip.
As far as the content, you could do that in a number of ways. I think your idea of roles is pretty solid. It also leaves open the possible of a given user seeing content from both site1 and site2, if that would ever be necessary.
If you don't want to force users to authenticate, you should look at other options. You could wrap your CRUD logic and data access logic into separate libraries and use them across three different sites in IIS. You could have one site and display content based on the request URL. There's probably a lot of other options too.
I'm dealing with a hosting team that is fairly skiddish of managing many rewrite rules. What are your experiences with the number of rules your sites are currently managing?
I can see dozens (if not more) coming up as the site grows and contracts and need to set expectations that this isn't out of the norm.
Thanks
it should be normal that you have a lot of rewrite rules to your site. As the site gets bigger the amount of pages you need to rewrite grows, and depending on what the pages do, you could have multiple rewrites per page. This is all based on how secure you are making your site. More security means more precautions.
module gives you the ability to transparently redirect one URL to another, without the user’s knowledge. This opens up all sorts of possibilities, from simply redirecting old URLs to new addresses, to cleaning up the ‘dirty’ URLs coming from a poor publishing system — giving you URLs that are friendlier to both readers and search engines.
QUOTE
so pretty much it's at your discretion. How secure do you want it to be.
I haven't had a huge opportunity to research the subject but I figure I'll just ask the question and see if we can create a knowledge base on the subject here.
1) Using subdomains will force a client side cache, is this by default or is there an easy way for a client to disable it? More curious about what kind of a percentage of users I should be expecting to affect.
2) What all will be cached? Images? Stylesheets? Flash SWFs? Javascripts? Everything?
3) I remember reading that you must use a subdomain or www in your URL for this to work, is this correct? (and does this mean SO won't allow it?)
I plan on integrating this onto all of my websites eventually but first I am going to try to do it for a network of flash game websites so I am thinking www.example.com for the website will remain the same but instead of using www.example.com/images, www.example.com/stylesheets, www.example.com/javascript, & www.example.com/swfs I will just create subdomains that point to them (img.example.com, css.example.com, js.example.com & swf.example.com respectively) -- is this the best course of action?
Using subdomains for content elements isn't so much to force caching, but to trick a browser into opening more connections than it might otherwise do. This can speed up page load time.
Caching of those elements is entirely down the HTTP headers delivered with that content.
For static files like CSS, JS etc, a server will typically tell the client when the file was modified, which allows a browser to ask for the file "If-Modified-Since" that timestamp. Specifics of how to improve on this by adding some extra caching headers would depend on which webserver you use. For example, with Apache you can use the mod_expires module to set the Expires header, or the Header directive to output other types of cache control headers.
As an example, if you had a subdirectory with your css files in, and wanted to ensure they were cached for at least an hour, you could place a .htaccess in that directory with these contents
ExpiresActive On
ExpiresDefault "access plus 1 hours"
Check out YSlow's documentation. YSlow is a plugin for Firebug, the amazing Firefox web development plugin. There is lots of good info on a number of ways to speed up your page loads, one of which is using one or more subdomains to encourage the browser to do more parallel object loads.
One thing I've done on two Django sites is to use a custom template tag to create pseudo-paths to images, css, etc. The path contains the time-last-modified as a pseudo directory. This path component is stripped out by an Apache .htaccess mod_rewrite rule. The object is then given a 10 year time-to-live (ExpiresDefault "now plus 10 years") so the browser will only load it once. If the object changes, the pseudo path changes and the browser will fetch the updated object.