Should sitemap requests load quickly? - performance

Given the fact that the only clients accessing a sites' sitemaps are search engine spiders, does it make any sense to make these sitemap requests load quickly?

If you are worried over such things, you can always choose to compress them (.gz) - that minimizes load a lot since compressed XML sitemap files are much smaller.

Related

Nutch retrieve too many duplicate images

I am trying to retrieve images with Nutch. The plugin is just searching for the required images and retrieving their urls. What I get at the end contains too many duplicate urls. It retrieved 43 thousand urls and 39 thousand of them were duplicates.
Is this normal or there could be some fault in the code I wrote (which I don't think is the case), Or other wise some problem with the Nutch itself?
Could be for instance that the same images are referenced multiple times?, in which case your results could be perfectly normal, I guess that running a test example on a given/known set of URLs could provide you with a better answer, limit your crawl only the the URLs on the seed file run a test and check which images are being crawled. What is the size of your crawl? Are you fetching already fetched pages or focusing on not yet visited pages? Are you ignoring small images like icons?
Keep in mind that usually on a website a lot of image assets are reused over and over again, specially if the website isn't t

BlogEngine xml format practicality

How practical is using xml format storage of BlogEngine's posts, without adversely affecting its performance ?
For very small number of posts, definitely its a good choice.
XML storage is good for websites that have less than 100 [Posts(and/or)Pages].
If you have more than 100 then the site is going to slow down the server.
Also if the website is a "members" based site under 300 members for xml.
http://blogengine.codeplex.com/discussions/257560
http://blogengine.codeplex.com/discussions/253252

MVC-3 User-Image Management - Best Practices

Developing using MVC-3, Razor, C#
Been searching around and cannot find advice I'm looking for. My site will contain user-uploaded images (possibly a high number). What is the best practice for managing these pictures (placement, breakdown into sub-folders, etc...)? Where do I place them that will prevent them from getting accidentally blown away if I republish my site periodically?
If there are any good articles or blog posts, that would be helpful. Also, any advice/tips anyone wants to add would be great.
Thanks for your time!
Rob
EDIT
Also would like to know what people do to prevent hot linking.
A site that I run and has a high volume of images, has all of the images stored in a date folder structure. i.e. 2010/Dec/31/image.jpg
There are two reasons for this.
The first is the limited amount of DB space (200 MB) came with my hosting plan. Obviously if I had gigabytes of space I would have stored them in the DB.
The second reason is to keep the number of images in the folders to a minimum. Directory listings take longer with the more files that are contained in them so a new directory every 24 hours was my workaround.
Can you perhaps tell us more about what resources you have or how many images you estimate will be uploaded daily?
If you are using SQL Server 2005 or above you may use FileStreams. If the files are under 1MB in size you might even have better performance if you store them as VARBINARY(MAX). The best part about storing in the database is you may easily use transactions.
As for replication and backup you may use standard database replication and backup with the files.
If you have the space in your DB, then I recommend that, as backup/restore becomes much easier. If you have limited space for your DB, then a folder structure would work, though I would not store more than 1000 files in a single folder. So you'll want to come up with a solution that helps keep a folder from not holding more than 1000 images and folders in one place. If you think you'll have less than 1000 images per day, then a variation on what Sir Psycho suggested would probably work well which would be a folder for each year, then a sub folder under the year with month and day to store all the images for that day.
To answer your question about hot linking: your best bet is to check the referrer website (which should be found in the head of the request for the image) and make sure it's coming from your domain. If it's not, you can either not send back any information, or you send back an image that let's the user know they cannot see the image from the 3rd party site.
The header data can be spoofed, but odds are random visitors coming to the 3rd party site will not only not have done this, but probably wouldn't know/care how to.

fastest etag algorithm

We want to make use of http caching on our website - in particular content validation.
Because our CMS constructs pages from smaller fragments of content, the last modified date of the actual page is not always an accurate indicator that the page has changed. Hence we also want to make use of etags. Because page construction is based on lots of other page fragments we think the only real way to provide an accurate etag is by performing some sort of digest on the content stream itself. This seems a little over cooked as caching is supposed to ease the load off the servers but a content digest is obviously CPU intensive.
I'm looking for the fastest algorithm to create a unique etag that is relevant to the content stream (inode etc just is a kludge and wont work). An MD5 hash is obviously going to get the best unique result but is anybody else making use of other algorithms that are faster in a similar situation?
Sorry forgot the important details... Using Java Servlets - running in websphere 6.1 on windows 2003.
I forgot to mention that there are also live database feeds (we're a bank and need to make sure interest rates are up to date) that can also change the content. So figuring out when content has changed can be tricky to determine.
I would generate a checksum for each fragment, but compute it when the fragment is changed, not when you render the page.
This way, you pay a one-time cost, which should be relatively small, unless we're talking hundreds of changes per second, and there is no additional cost per request.

How to generate sitemap on a highly dynamic website?

Should a highly dynamic website that is constantly generating new pages use a sitemap? If so, how does a site like stackoverflow.com go about regenerating a sitemap? It seems like it would be a drain on precious server resources if it was constantly regenerating a sitemap every time someone adds a question. Does it generate a new sitemap at set intervals (e.g. every four hours)? I'm very curious how large, dynamic websites make this work.
On Stackoverflow (and all Stack Exchange sites), a sitemap.xml file is created which contains a link to every question posted on the system. When a new question is posted, they simply append another entry to the end of the sitemap file. It isn't that resource intensive to add to the end of the file but the file is quite large.
That is the only way search engines like Google can effectively crawl the site.
Jeff Atwood talks about it in a blog post: The Importance of Sitemaps
This is from Google's webmaster help page on sitemaps:
Sitemaps are particularly helpful if:
Your site has dynamic content.
Your site has pages that aren't easily discovered by Googlebot during
the crawl process - for example, pages
featuring rich AJAX or Flash.
Your site is new and has few links to it. (Googlebot crawls the web by
following links from one page to
another, so if your site isn't well
linked, it may be hard for us to
discover it.)
Your site has a large archive of content pages that are not well linked
to each other, or are not linked at
all.
There's no need to regenerate the Google sitemap XML each time a question is posted. It's far simpler just to have the XML file generated on-demand directly from the database (and a little caching).
To reduce load, the sitemap can be split into many sitemaps. Partitioning it by day/month would allow you to tell Google to retrieve today's sitemap frequently, but only fetch the sitemap from six months ago once in a while.
I'd like to share my solution here just in case it helps someone as well.
It took me reading this question and many others to decide what to do.
My site structure.
Static pages
Home (Highly dynamic. Cached for 30 mins)
Artists, Albums, Songs, Playlists and Albums (Paginated List)
Legal (Static page with Terms etc)
...etc
Dynamic Pages
Artists, Albums, Songs, Playlists and Albums detail pages
My approach.
sitemap.xml: This url generates a <sitemapindex /> with the first item being /sitemap-main.xml. The number of Artists, Albums, Songs etc are counted and divided by 1,000 (number of urls I want in each sitemap. the limit is 50,000). I round this number up.
So for e.g, 1900 songs = 1.9 = 2.
I generate. add the urls /sitemap-songs-0.xml and /sitemap-songs-1.xml to the index. I repeat this for all other items. Basically, I am paginating.
The output is returned uncached. I want this to always be fresh.
sitemap-main.xml: This lists all the static pages. You can actually use a static file for this as you will only need to update it once in a while.
sitemap-songs-0.xml, sitemap-albums-0.xml, etc: I use a single route for this in SlimPhp 2.
$app->get('/sitemap-:type-:page.xml', function ($type, $page) use ($app) {...
I use a simple switch statement to generate the relevant files. If for this page, I got 1,000 items, the limit specified above, I cache the file for 2 Weeks.
Else, I only cache it for a few hours.
I guess this can help anyone else implement their own system.
Even on something like StackOverflow, there is a certain amount of static organization; there are FAQs, tag pages, question pages, user pages, badge pages, etc.; I'd say in a very dynamic site, the best way to approach a sitemap would be to have a map of the categorizations; each node in the sitemap can point to a page of the dynamically generated data (a node for a question page, a node for a user page, etc.).
Of course, a sitemap may not even be appropriate for a given site; there's a certain amount of judgment call required there.
For a highly dynamic site, I wrote a cron job at my server which runs on daily basis. It makes a rest call to my backend every day, and generates a new sitemap according to all newly generated content, and returns the sitemap in the form of an xml file. This new sitemap overrides the previous one and keeps my website updated according to all the changes. Changing sitemap for each newly added dynamic content is not a good approach I think

Resources