How to improve Search performance on e-commerce site? - performance

I have an e-commerce website built upon ASP.net MVC3. It has appx. 200k products. We are currently searching in product table for providing search on site.
The issue is it is deadly slow, and of course, by analyzing performance on profiler we found that it is the sql search which is the main issue.
What are other alternatives that could be used to speed up the search? Do we need to build a separate cache for search or anything else is needs to be done?
If you look at the other large e-commerce sites like ebay, amazon or flipkart, they are very fast. How do they do it?

They usually build a full text index of what is searchable, using for example Lucene.NET or Solr (uses Java, but an instance of it can be called through .NET using SolrNet).
The index is built from your SQL database and any searches you do would need to make use of that index by sending queries to it like you would on a SQL database, but with a different syntax of course. The index needs to be updated/recreated periodically to stay up to date with the SQL database.
Such a text index is built for querying large amounts of string data and can easily handle hundreds of thousands of products in a product search function on your website. Aside from speed, there are other benefits that would be very hard to do without a text index, such as spelling corrections and fuzzy searches.

Related

Text Search Engine for Mailing List Archive Cataloging and Search

I'm working with a mailing list archive and am tasked with setting up basic search, boolean search, and ultimately some sort of more intelligent tag-based searching.
I see both commercial products and some open-source projects (like Lucene.NET)
Has anyone else done any similar kind of work?
I'm working in Win2k3 server now, so the immediate thought was to use ASP Classic or ASP.NET. However, if there were another platform that was orders of magnitude better for the purpose, then I'd consider that as well. I'm not going to throw out something becuse of that ;)
Since you are setting up mail search you will need two things : a search engine and a database.
There are many search engines that offer what you need.
Sphinx
Solr(Lucene and Solr are merged now)
PostgreSQL(inbuilt search)
They provide advanced search tools like keywords, field-restricted search, boolean queries, phrase search and more. Here is another SO post looking into various text search engines: Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?
Sphinx and Solr are pretty fast in search. Sphinx does full database search and also does partial indexing. Solr uses indexed based search, and is scalable with almost linear performance.
Second most important choice is the database where you store your mails. The mails will be in some format (schema), like fields in a table. It would be plain crazy not to use any format. It is not file search, right? Some search engines require particular DB's to work. Sphinx uses SQL databases only, Solr can be integrated with noSQL databases.
If you are not worried about scaling issues (you have thousands of users, having GB's of data, needing real time performance) then, you are fine with SQL databases. Otherwise you will have to use noSQL database with Solr.
SQL databases(like PostgreSQL) are simplest to work with, do what you need and require minimal setup/effort. Connectors will allow you to send query(mail search) from browser to your database.
Also you said you use Win2k3, you will have to switch to linux distribution to take advantage of these search engines. Win2k3 is slow, does not offer performance comparable to linux distros.
First, you should think about what you need.
What do you want to search in your e-mail archive? Just full text search in the the e-mail’s plein data? You will not get matches in mails that are base64 encoded then, for example. Do you need ‘fielded’ search? E.g.: search only in ‘subject’, ‘from’, ‘to’, ‘body’, ‘attachments’?
How do you want to provide access to search in the mails? Via a web page? On a command line? In some windows program?
If you didn’t yet, you should examine what your data looks like. Maybe ‘mbox’ format (one file with mail plain text concatenated) ‘maildir’ (a directory with many files, each contain one mail), or something else?
Setting up a search engine means to think about how data needs to be prepared:
E-Mails can contain different data inside. You will have to deal with base64 encoded data, character encodings as UTF-8 and attachments.
Usegroup mails may even be split across multiple e-mail messages.
If you want to search different ‘fields’ (‘Subject’, ‘date’, ‘body’) they need to be extracted.
Data needs to be prepared by linguistic means. You will need to find out which language the mails are in (if there are several) and process the data, eg. to make a search on mouse match on notions of mice and, perhaps, rats; or cursor and pointing device, depending on the topic of your mailing list.
Also think about:
Will there be updates to the data in future?
Are there deletes (including messages being relabeled later)?
Then compare the products (commercial or open source) that you favour how much of this they do provide already and what you will have to write yourself. Be aware that providing a search experience is more than downloading a search engine and dropping in a ton of data.

Can Sphinx help user-driven reporting requirements?

I'm creating a mysql-centric site for our client who wants to do custom reporting on their data (via http), something like doing selects in PHPMyAdmin with readonly access (*). Perhaps like reporting in Pentaho, though I've never used it.
My boss is saying Sphinx is the answer, but reading through the Sphinx docs, I don't see that it will make user-driven reporting any easier. Does Sphinx help with user driven reporting? If so, what do they call it?
(*) the system has about 25 tables that would likely be used in reporting, with anywhere between 2 and 50 fields. Reports would likely need up to maybe 5 joins per report.
update: I found http://code.google.com/p/sphinx-report/ so I guess Sphinx doesn't natively do it.
I can only answer for SphinxSEARCH -- dont really know much about the other sphinxes.
It itself doesnt contain features particully for writing reports. Its a general purpose search backend. Just like mysql is not particully designed for reports, but it can be be used as such.
In general think of sphinx as providing a very flexible, fast and scalable 'index' to a database table. Almost like creating a materialized view in mysql.
Because you can think very carefully what data to include in this index. And the index can be 'denormalized' to include all the data (via complex joins if required) - you can then run very fast queries against this index.
Sphinx also supports "GROUP BY" which makes it very useful for creating reports - and because the attribute data is held entirely in memory its generally very fast.
Basically sphinx will be very good to provide the backend for a 'dynamic' and 'interactive' report system. Where speed is required - particully if combined with allowing the user to filter reports (via keywords) - that is where sphinx shines.
Because of the upfront work required in designing this index, its less suited for 'flexible' reports. In building the index, will probably be a number of comprimises, so might be limiting in what reports will be possible. (at least without creating lots of differnt indexes)
short version: lots of upfront work to build the system, for very fast queries down the line.
Sphinx wont really do anything that mysql can't. But using sphinx as part of the system will allow performance to be improved (over a pure mysql solution).

Adding Advanced Search in ASP.NET MVC 3 / .NET

In a website I am working on, there is an advanced search form with several fields, some of them dynamic that show up / hide depending on what is being selected on the search form.
Data expected to be big in the database and records are spread over several tables in a very normalized fashion.
Is there a recommendation on using a 3rd part search engine, sql server full text search, lucene.net, etc ... other than using SELECT / JOIN queries?
Thank you
Thinking a little outside the box here -
Check out CSLA.NET; Using this framework you can create business objects and "denormalise" your search algorithm.
Either way, be sure the database has proper indexes in place for better performance.
On the frontend youre going to need to use some javascript to map which top level fields show sub level fields. Its pretty straight forward.
For the actual search, I would recommend some flavor of Lucene.
You have your option of the .NET flavor of Lucene.NET which Stackoverflow uses, Solr which is arguably easier to setup and get running than Lucene is, or the newest kid on the block which is ElasticSearch which aims to be schema free and infinitely scalable simply by dropping more instances in the cluster.
I have only used Solr myself, and it has a nice .NET client (SolrNet).
first index your database field that is important and very usable
and for search better use full text search
i try it and result is very different from when i dont use full text
and better use select and join query in stored proc and call sp from your program

How to access data in Dynamics CRM?

What is the best way in terms of speed of the platform and maintainability to access data (read only) on Dynamics CRM 4? I've done all three, but interested in the opinions of the crowd.
Via the API
Via the webservices directly
Via DB calls to the views
...and why?
My thoughts normally center around DB calls to the views but I know there are purists out there.
Given both requirements I'd say you want to call the views. Properly crafted SQL queries will fly.
Going through the API is required if you plan to modify data, but it isnt the fastest approach around because it doesnt allow deep loading of entities. For instance if you want to look at customers and their orders you'll have to load both up individually and then join them manually. Where as a SQL query will already have the data joined.
Nevermind that the TDS stream is a lot more effecient that the SOAP messages being used by the API & webservices.
UPDATE
I should point out in regard to the views and CRM database in general: CRM does not optimize the indexes on the tables or views for custom entities (how could it?). So if you have a truckload entity that you lookup by destination all the time you'll need to add an index for that property. Depending upon your application it could make a huge difference in performance.
I'll add to jake's comment by saying that querying against the tables directly instead of the views (*base & *extensionbase) will be even faster.
In order of speed it'd be:
direct table query
view query
filterd view query
api call
Direct table updates:
I disagree with Jake that all updates must go through the API. The correct statement is that going through the API is the only supported way to do updates. There are in fact several instances where directly modifying the tables is the most reasonable option:
One time imports of large volumes of data while the system is not in operation.
Modification of specific fields across large volumes of data.
I agree that this sort of direct modification should only be a last resort when the performance of the API is unacceptable. However, if you want to modify a boolean field on thousands of records, doing a direct SQL update to the table is a great option.
Relative Speed
I agree with XVargas as far as relative speed.
Unfiltered Views vs Tables: I have not found the performance advantage to be worth the hassle of manually joining the base and extension tables.
Unfiltered views vs Filtered views: I recently was working with a complicated query which took about 15 minutes to run using the filtered views. After switching to the unfiltered views this query ran in about 10 seconds. Looking at the respective query plans, the raw query had 8 operations while the query against the filtered views had over 80 operations.
Unfiltered Views vs API: I have never compared querying through the API against querying views, but I have compared the cost of writing data through the API vs inserting directly through SQL. Importing millions of records through the API can take several days, while the same operation using insert statements might take several minutes. I assume the difference isn't as great during reads but it is probably still large.

High performance product catalog in asp.net?

I am planning a high performance e-commerce project in asp.net and need help in selecting the optimal data retrieval model for the product catalog.
Some details,
products in 10-20 categories
1000-5000 products in every category
products listed with name, price, brand and image, 15-40 on every page
products needs to be listed without table-tags
product info in 2-4 tables that will be joined together (product images not stored in db)
webserver and sql database on different hardware
ms sql 2005 on shared db-server (pretty bad performance to start with...)
enable users to search products combining different criteria such as pricerange, brand, with/without image.
My questions are,
what technique shall I use to retrieve the products?
what technique shall I use to present the products?
what cache strategy do you recomend?
how do I solve filtering, sortering, pageing in the most efficient way?
do you have any recomendations for more reading on this subject?
Thanks in advance!
Let the SQL server retrive the data.
With fairly good indexing the SQL server should be able to cope.
in SQL 2005 you can do paging in results, that way you have less data to shuffle back and forth.
I think you will end up with a lot of text searching. Give a try for either lucene or Solr ( http server on top of lucene). CNET developed solr for their product catalog search.
Have you thought about looking at an existing shopping cart platform that allows you to purchase the source code?
I've used www.aspdotnetstorefront.com
They have lots of examples of major e-commerce stores running on this platform. I built www.ElegantAppliance.com on this platform. Several thousand products, over 100 categories/sub-categories.
Make sure your database design is normalised as much as possible - use lookup tables where necessary to make sure you are not repeating data unnecessarily.
Store your images on the server filesystem and store a relative (not full) path reference to them in the database.
Use stored procedures for as much as possible, and always retrieve the least amount of data as you can from the server to help with memory and network traffic efficiencies.
Don't bother with caching, your database should be fast enough to produce results immediately, and if not, make it faster.

Resources