I have a migration project from a legacy system to a new system. The move to the new system will create new unique id's for the objects being migrated; however, my users and search indexes will have the URLs with the old ids. I would like to set up an apache redirect or rewrite to handle this but am concerned about performance with that large number of objects (I expect to have approximatelty 500K old id to new id mappings).
Has anyone implemented this on this scale? Or knows if apache can stand to this big a redirect mapping?
If you have a fixed set of mappings, you should give a mod_rewrite rewrite map of the type
Hash File a try.
I had the very same question recently. As I found no practical answer, we implemented an htaccess 6 rules of which 3 had 200,000 conditions.
That means an htaccess file with the size of 150 MB. It was actually fine for half a day, when noone was using this particular website, even though page load times were in the seconds. However next day, our whole server got hammered, with loads well above 400. (machine is 8 cores, 16 GB RAM, SAS RAID5, so no problem with resources usually)
I suggest if you need to implement anything like this. Design your rules, so they don't need conditions, and put them in a dbm rewrite map. this easily solved the performance issues for us.
http://httpd.apache.org/docs/current/rewrite/rewritemap.html#dbm
Can you phrase the rewrites using a smaller number of rules? Is there a pattern which links the old URLs to the new ones?
If not, I'd be concerned about Apache with 500K+ rewrite mappings, that's just way past its comfort zone. Still, it might surprise you.
It sounds to me like you need to write a database-backed application just to handle the redirects, with the mapping itself stored in the database. That would scale much better.
I see this is an old topic but did you every find a a solution?
I have a case where the developers are using htaccess to redirect more than 30,000 URLs using RedirectMatch in a .htaccess file.
I am concerned about performance and management errors given the size of this file.
What I recommended is that since all of the old urls have:
/sub/####
That they move this to the database and create
/sub/index.php
Redirect all requests for:
www.domain.com/sub/###
to
www.domain.com/sub/index.php
Then have index.php send the redirect since the new URLs and old ids can be looked up in the database.
This way only HTTP requests for the old URLs are hitting re-write processes instead of every single HTTP request.
Related
Developing using MVC-3, Razor, C#
Been searching around and cannot find advice I'm looking for. My site will contain user-uploaded images (possibly a high number). What is the best practice for managing these pictures (placement, breakdown into sub-folders, etc...)? Where do I place them that will prevent them from getting accidentally blown away if I republish my site periodically?
If there are any good articles or blog posts, that would be helpful. Also, any advice/tips anyone wants to add would be great.
Thanks for your time!
Rob
EDIT
Also would like to know what people do to prevent hot linking.
A site that I run and has a high volume of images, has all of the images stored in a date folder structure. i.e. 2010/Dec/31/image.jpg
There are two reasons for this.
The first is the limited amount of DB space (200 MB) came with my hosting plan. Obviously if I had gigabytes of space I would have stored them in the DB.
The second reason is to keep the number of images in the folders to a minimum. Directory listings take longer with the more files that are contained in them so a new directory every 24 hours was my workaround.
Can you perhaps tell us more about what resources you have or how many images you estimate will be uploaded daily?
If you are using SQL Server 2005 or above you may use FileStreams. If the files are under 1MB in size you might even have better performance if you store them as VARBINARY(MAX). The best part about storing in the database is you may easily use transactions.
As for replication and backup you may use standard database replication and backup with the files.
If you have the space in your DB, then I recommend that, as backup/restore becomes much easier. If you have limited space for your DB, then a folder structure would work, though I would not store more than 1000 files in a single folder. So you'll want to come up with a solution that helps keep a folder from not holding more than 1000 images and folders in one place. If you think you'll have less than 1000 images per day, then a variation on what Sir Psycho suggested would probably work well which would be a folder for each year, then a sub folder under the year with month and day to store all the images for that day.
To answer your question about hot linking: your best bet is to check the referrer website (which should be found in the head of the request for the image) and make sure it's coming from your domain. If it's not, you can either not send back any information, or you send back an image that let's the user know they cannot see the image from the 3rd party site.
The header data can be spoofed, but odds are random visitors coming to the 3rd party site will not only not have done this, but probably wouldn't know/care how to.
We want to make use of http caching on our website - in particular content validation.
Because our CMS constructs pages from smaller fragments of content, the last modified date of the actual page is not always an accurate indicator that the page has changed. Hence we also want to make use of etags. Because page construction is based on lots of other page fragments we think the only real way to provide an accurate etag is by performing some sort of digest on the content stream itself. This seems a little over cooked as caching is supposed to ease the load off the servers but a content digest is obviously CPU intensive.
I'm looking for the fastest algorithm to create a unique etag that is relevant to the content stream (inode etc just is a kludge and wont work). An MD5 hash is obviously going to get the best unique result but is anybody else making use of other algorithms that are faster in a similar situation?
Sorry forgot the important details... Using Java Servlets - running in websphere 6.1 on windows 2003.
I forgot to mention that there are also live database feeds (we're a bank and need to make sure interest rates are up to date) that can also change the content. So figuring out when content has changed can be tricky to determine.
I would generate a checksum for each fragment, but compute it when the fragment is changed, not when you render the page.
This way, you pay a one-time cost, which should be relatively small, unless we're talking hundreds of changes per second, and there is no additional cost per request.
Beforehand let me thank you all !! Really guys you help a lot. When I will finish my web site and will have much time on watching how userbase is growing I will come here again and again to answer to another people questions(if I can )
So here is the problem.
I made a web-site on CodeIgniter. A social network engine. Something like phpfox, classmates_com or facebook.
It's right now somehow not multilingual, So the UI strings are in the view files, and next step will be move them to the language files.
I want the user to have ability to change the language. So I assume that in database user will have row "lang_local" which would be by default set to en, and then to any other language he will change .
So what is eating my nervs and enery is following.
I will make on this engine several demographic social networks,and I would like to manage theese web-sites in centralized manner with one backend . So whenever I would like to make a new web-network, I just add the domain settings install the script in new folder and add it in database sites
I see it like this
on every table in database like users,comments,messages,categories ,etc I will have a row site_id , and on each query add/update/delete I add a WHERE SITE_ID=XXX
and in table sites(site_id,site_name,domain_name) will have all domains , so that in backend I can filter data by website.
Is this a good way? What if i will need then to be multiserver, what about load balancing? Who can tell me what would be a right,PROFESSIONAL way? My maximum user limit for a database is something like for start 10.000 in one-two year 100.000users
There are loads of ways to do multi-site, but this is a perfectly good way to handle things. I use this approach in my internal work CMS.
The only downside is that it could potentially become massive and have performance issues. You may need to write an export script so you can grab everything belonging to a site then move them to their own install.
im working on a little project of mine and need your help in deciding whether mod_rewrite is performance friendly or parsing each url in php.
urls will almost always have a fixed pattern. very few urls will have different pattern.
for instance, most urls would be like so :
dot.com/resource
some others would be
dot.com/other/resource
i expect around 1000 visitors a day to the site. will server load be an issue?
intuitively, i think mod rewrite would work better. but just for that peace of mind, i'd like input from you guys. if anyone has carried out any tests or can point me towards the same, id be obliged.
thanks.
You may want to check out the following Stack Overflow post:
Any negative impacts when using Mod-Rewrite?
Quoting the accepted answer:
I've used mod_rewrite on sites that get millions/hits/month without any significant performance issues. You do have to know which rewrites get applied first depending on your rules.
Using mod_rewrite is most likely faster than parsing the URL with your current language.
If you are really worried about performance, don't use htaccess files, those are slow. Put all your rewrite rules in your Apache config, which is only read once on startup. htaccess files get re-parsed on every request, along with every htaccess file in parent folders.
To add my own, mod_rewrite is definitely capable of handling 1,000 visitors per day.
I know there are a lot of positive things mod-rewrite accomplishes. But are there any negative? Obviously if you have poorly written rules your going to have problems. But what if you have a high volume site and your constantly using mod-rewrite, is it going to have a significant impact on performance? I did a quick search for some benchmarks on Google and didn't find much.
I've used mod_rewrite on sites that get millions/hits/month without any significant performance issues. You do have to know which rewrites get applied first depending on your rules.
Using mod_rewrite is most likely faster than parsing the URL with your current language.
If you are really worried about performance, don't use .htaccess files, those are slow. Put all your rewrite rules in your Apache config, which is only read once on startup. .htaccess files get re-parsed on every request, along with every .htaccess file in parent folders.
To echo what Ryan says above, rules in a .htaccess can really hurt your load times on a busy site in comparison to having the rules in your config file. We initially tried this (~60million pages/month) but didn't last very long until our servers started smoking :)
The obvious downside to having the rules in your config is you have to reload the config whenever you modify your rules.
The last flag ("L") is useful for speeding up execution of your rules, once your more frequently-accessed rules are towards the top and assessed first. It can make maintenance much trickier if you've a long set of rules though - I wasted a couple of very frustrating hours one morning as I was editing mid-way down my list of rules and had one up the top that was trapping more than intended!
We had difficulty finding relevant benchmarks also, and ended up working out our own internal suite of tests. Once we got our rules sorted out, properly ordered and into our Apache conf, we didn't find much of a negative performance impact.
If you're worried about apache's performance, one thing to consider if you have a lot of rewrite rules is to use the "skip" flag. It is a way to skip matching on rules. So, whatever overhead would have been spent on matching is saved.
Be careful though, I was on a project which utilized the "skip" flag a lot, and it made maintenance painful, since it depends on the order in which things are written in the file.