In a website I am working on, there is an advanced search form with several fields, some of them dynamic that show up / hide depending on what is being selected on the search form.
Data expected to be big in the database and records are spread over several tables in a very normalized fashion.
Is there a recommendation on using a 3rd part search engine, sql server full text search, lucene.net, etc ... other than using SELECT / JOIN queries?
Thank you
Thinking a little outside the box here -
Check out CSLA.NET; Using this framework you can create business objects and "denormalise" your search algorithm.
Either way, be sure the database has proper indexes in place for better performance.
On the frontend youre going to need to use some javascript to map which top level fields show sub level fields. Its pretty straight forward.
For the actual search, I would recommend some flavor of Lucene.
You have your option of the .NET flavor of Lucene.NET which Stackoverflow uses, Solr which is arguably easier to setup and get running than Lucene is, or the newest kid on the block which is ElasticSearch which aims to be schema free and infinitely scalable simply by dropping more instances in the cluster.
I have only used Solr myself, and it has a nice .NET client (SolrNet).
first index your database field that is important and very usable
and for search better use full text search
i try it and result is very different from when i dont use full text
and better use select and join query in stored proc and call sp from your program
Related
I have a situation to write search query in elasticsearch having data as follows
{id:"p1",person:{name:"name",age:"12"},relatedTO:{id:"p2"}}
{id:"p2",person:{name:"name2",age:"15"},relatedTO:{id:"p3"}}
{id:"p3",person:{name:"name3",age:"17"},relatedTO:{id:"p1"}}
scenario:- user's want to search people related to p2,and using each related person find who they are related to
1.first find who are related to p2 answer= p1
2.Now find people related to p1 answer=p3. (the requirement as of now is to go only 1-level) so no need to find people related to p3.the final result should be p2,p1,p3.
Normal scenario's we will write a nested sql to get results.How do we achieve this using elastic query language in one-shot
With one shot you will need to use Parent-Child-Relationships, but I wouldn't recommand it to you in the first place, because it is not very performant. Btw: also Grandparents and Grandchildren are supported.
You could also use Application Side Joins - meaning you execute several queries, until you get what you want. (Be aware that the first result sets should be very tiny, otherwise this could get costly)
What I would really recommand to you is read this docu and rethink your use case.
In case you want to model relationships like in facebook or google+ I would tell you to look for a NoSQL Graph Database.
Note: Ideally in Elasticsearch the data is flat, which means denormalized.
I have a table that I've created a Full Text Catalog on. The table has just over 6000 rows. I've added two columns to the index. The first could be considered a unique identifier of sorts and the second could be considered the content for that item (there are 11 other columns in my table that aren't part of the Full Text Catalog). Here is an example of a couple of rows:
TABLE: data_variables
ROW unique_id label
1 A100d1 Personal preference of online shopping sites
2 A100d2 Shopping behaviors for adults in household
In my web application on the front end, I have a text box that the user can type into to get a list of items that match whatever terms they're searching for in the UNIQUE ID or LABEL columns. So, for example, if the user typed in sho or a100 then a list would be populated with both of the rows above. If they typed in behav then a list would be populated with only row 2 above.
This is done via an Ajax request on each keyup. PHP calls a Stored Procedure on the SQL server that looks like:
SELECT TOP 50 dv.id, dv.id + ': ' + dv.label,
dv.type_id, dv.grouping, dv.friendly_label
FROM data_variables dv
WHERE (CONTAINS((dv.unique_id, dv.label), #search))
(#search is the text from the user that is passed into the Stored Procedure.)
I've noticed that this gets pretty sluggish, especially when I wasn't using TOP 50 in the query.
What I'm looking for is a way to speed this up either directly on the SQL Server or by abandoning the full-text indexing idea and using jQuery to search through an array of the searchable items on the client-side. I've looked a bit into the jQuery AutoComplete stuff and some other jQuery plugins for AutoComplete, but haven't yet tried to mock up anything. That would be my next step, but I wanted to check here first to see what advice I would get.
Thanks in advance.
Several suggestions, based around the fact that you have only 6000 rows, so the database should eat this alive.
A. Try using Like operator, just in case it helps. Not expecting it too, but pretty trivial to try. There is something else going on here overall for you to detect this is slow given these small volumes.
B. can you cache queries in advance? With 6000 rows, there are probably only 36*36 combinations of 2 character queries, which should take virtually no memory and save the database any work.
C. Moving the selection out to the client is a good idea, depends on how big the 6000 rows are overall, vs network latency for individual lookups.
D. Combining b and c will give you really good performance I suspect, but with some coding effort required. If the server maintains a list of all single character results in cache, and clients download the letter cache set after first keystroke, then they potentially have a subset of all rows, but won't need to do more network IO for additional keystrokes.
I would advise against a LIKE, unless you're using a linear index (left-to-right) and you're doing queries like LIKE 'work%'. If you're doing something like LIKE '%word%' a regular index isn't going to help you. You typically want to use a Full-Text index when you want to search for words inside a paragraph.
With a lot of data, typically the built-in Full-Text engines in databases aren't very stealer. For the best performance you typically have to go with an external solution that is built specifically for Full-Text.
Some options are Sphinx, Solr, and elasticsearch, just to name a few. I wouldn't say that any of these options are better than the other. There are definitely pros and cons to consider:
What kind of data do you have?
What language support do these solutions have?
What database engines do these solutions support?
The best thing you can do is benchmark these solutions against your existing data. Testing each and every individual component (unit testing) can help you identify the real problems and help you find good solutions.
I had the same problem and went for the LIKE solution. I found too that the or operator to be too taxing and divide the query into two selects with an union all (fastest, and in my scenario it was impossible to find the same text in the index column and the data).
Yours will be like
SELECT TOP 50 from (
select dv.id, dv.id + ': ' + dv.label,
dv.type_id, dv.grouping, dv.friendly_label
FROM data_variables dv
WHERE dv.unique_id like '%'+#search+'%'
UNION ALL
select dv.id, dv.id + ': ' + dv.label,
dv.type_id, dv.grouping, dv.friendly_label
FROM data_variables dv
WHERE dv.label like '%'+#search+'%'
)
Oh!! And test the performance in SQL Server, not the web!
If You plan to increase amount of data it will be best way to use reverse index for full-text searching.
Look at Apache Solr - best fulltext search engine at this moment.
You can simply periodically index Your database data and use solr as search-engine,
it provide simple ajax api and can be queried directly from frontend.
If you really need performance ..you may want to look at; FTS3 and FTS4 ...
snip... from another forum...
For example, if each of the 517430 documents in the "Enron E-Mail Dataset" is inserted into both an FTS table and an ordinary SQLite table created using the following SQL script:
Code:
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table /
CREATE TABLE enrondata2(content TEXT); / Ordinary table */
Then either of the two queries below may be executed to find the number of documents in the database that contain the word "linux" (351). Using one desktop PC hardware configuration, the query on the FTS3 table returns in approximately 0.03 seconds, versus 22.5 for querying the ordinary table.
see...
http://www.sqlite.org/fts3.html
I'm working with a mailing list archive and am tasked with setting up basic search, boolean search, and ultimately some sort of more intelligent tag-based searching.
I see both commercial products and some open-source projects (like Lucene.NET)
Has anyone else done any similar kind of work?
I'm working in Win2k3 server now, so the immediate thought was to use ASP Classic or ASP.NET. However, if there were another platform that was orders of magnitude better for the purpose, then I'd consider that as well. I'm not going to throw out something becuse of that ;)
Since you are setting up mail search you will need two things : a search engine and a database.
There are many search engines that offer what you need.
Sphinx
Solr(Lucene and Solr are merged now)
PostgreSQL(inbuilt search)
They provide advanced search tools like keywords, field-restricted search, boolean queries, phrase search and more. Here is another SO post looking into various text search engines: Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?
Sphinx and Solr are pretty fast in search. Sphinx does full database search and also does partial indexing. Solr uses indexed based search, and is scalable with almost linear performance.
Second most important choice is the database where you store your mails. The mails will be in some format (schema), like fields in a table. It would be plain crazy not to use any format. It is not file search, right? Some search engines require particular DB's to work. Sphinx uses SQL databases only, Solr can be integrated with noSQL databases.
If you are not worried about scaling issues (you have thousands of users, having GB's of data, needing real time performance) then, you are fine with SQL databases. Otherwise you will have to use noSQL database with Solr.
SQL databases(like PostgreSQL) are simplest to work with, do what you need and require minimal setup/effort. Connectors will allow you to send query(mail search) from browser to your database.
Also you said you use Win2k3, you will have to switch to linux distribution to take advantage of these search engines. Win2k3 is slow, does not offer performance comparable to linux distros.
First, you should think about what you need.
What do you want to search in your e-mail archive? Just full text search in the the e-mail’s plein data? You will not get matches in mails that are base64 encoded then, for example. Do you need ‘fielded’ search? E.g.: search only in ‘subject’, ‘from’, ‘to’, ‘body’, ‘attachments’?
How do you want to provide access to search in the mails? Via a web page? On a command line? In some windows program?
If you didn’t yet, you should examine what your data looks like. Maybe ‘mbox’ format (one file with mail plain text concatenated) ‘maildir’ (a directory with many files, each contain one mail), or something else?
Setting up a search engine means to think about how data needs to be prepared:
E-Mails can contain different data inside. You will have to deal with base64 encoded data, character encodings as UTF-8 and attachments.
Usegroup mails may even be split across multiple e-mail messages.
If you want to search different ‘fields’ (‘Subject’, ‘date’, ‘body’) they need to be extracted.
Data needs to be prepared by linguistic means. You will need to find out which language the mails are in (if there are several) and process the data, eg. to make a search on mouse match on notions of mice and, perhaps, rats; or cursor and pointing device, depending on the topic of your mailing list.
Also think about:
Will there be updates to the data in future?
Are there deletes (including messages being relabeled later)?
Then compare the products (commercial or open source) that you favour how much of this they do provide already and what you will have to write yourself. Be aware that providing a search experience is more than downloading a search engine and dropping in a ton of data.
I have an e-commerce website built upon ASP.net MVC3. It has appx. 200k products. We are currently searching in product table for providing search on site.
The issue is it is deadly slow, and of course, by analyzing performance on profiler we found that it is the sql search which is the main issue.
What are other alternatives that could be used to speed up the search? Do we need to build a separate cache for search or anything else is needs to be done?
If you look at the other large e-commerce sites like ebay, amazon or flipkart, they are very fast. How do they do it?
They usually build a full text index of what is searchable, using for example Lucene.NET or Solr (uses Java, but an instance of it can be called through .NET using SolrNet).
The index is built from your SQL database and any searches you do would need to make use of that index by sending queries to it like you would on a SQL database, but with a different syntax of course. The index needs to be updated/recreated periodically to stay up to date with the SQL database.
Such a text index is built for querying large amounts of string data and can easily handle hundreds of thousands of products in a product search function on your website. Aside from speed, there are other benefits that would be very hard to do without a text index, such as spelling corrections and fuzzy searches.
What is the best way in terms of speed of the platform and maintainability to access data (read only) on Dynamics CRM 4? I've done all three, but interested in the opinions of the crowd.
Via the API
Via the webservices directly
Via DB calls to the views
...and why?
My thoughts normally center around DB calls to the views but I know there are purists out there.
Given both requirements I'd say you want to call the views. Properly crafted SQL queries will fly.
Going through the API is required if you plan to modify data, but it isnt the fastest approach around because it doesnt allow deep loading of entities. For instance if you want to look at customers and their orders you'll have to load both up individually and then join them manually. Where as a SQL query will already have the data joined.
Nevermind that the TDS stream is a lot more effecient that the SOAP messages being used by the API & webservices.
UPDATE
I should point out in regard to the views and CRM database in general: CRM does not optimize the indexes on the tables or views for custom entities (how could it?). So if you have a truckload entity that you lookup by destination all the time you'll need to add an index for that property. Depending upon your application it could make a huge difference in performance.
I'll add to jake's comment by saying that querying against the tables directly instead of the views (*base & *extensionbase) will be even faster.
In order of speed it'd be:
direct table query
view query
filterd view query
api call
Direct table updates:
I disagree with Jake that all updates must go through the API. The correct statement is that going through the API is the only supported way to do updates. There are in fact several instances where directly modifying the tables is the most reasonable option:
One time imports of large volumes of data while the system is not in operation.
Modification of specific fields across large volumes of data.
I agree that this sort of direct modification should only be a last resort when the performance of the API is unacceptable. However, if you want to modify a boolean field on thousands of records, doing a direct SQL update to the table is a great option.
Relative Speed
I agree with XVargas as far as relative speed.
Unfiltered Views vs Tables: I have not found the performance advantage to be worth the hassle of manually joining the base and extension tables.
Unfiltered views vs Filtered views: I recently was working with a complicated query which took about 15 minutes to run using the filtered views. After switching to the unfiltered views this query ran in about 10 seconds. Looking at the respective query plans, the raw query had 8 operations while the query against the filtered views had over 80 operations.
Unfiltered Views vs API: I have never compared querying through the API against querying views, but I have compared the cost of writing data through the API vs inserting directly through SQL. Importing millions of records through the API can take several days, while the same operation using insert statements might take several minutes. I assume the difference isn't as great during reads but it is probably still large.