How to implement in-process full text search engine - windows

In one of our commercial applications (Win32, written in Delphi) we'd like to implement full text search. The application is storing user data in some kind of binary format that is not directly recognizable as a text.
Ideally, I'd like to find either an in-process solution (DLL would be OK) or a local server that I could access via TCP (preferably). The API should allow me to submit a textual information to the server (along with the metadata representing the binary blob it came from) and, of course, it should allow me to do a full-text search with at least minimal support for logical operators and substring searching. Unicode support is required.
I found extensive list of search engines on Stack Overflow (What are some Search Servers out there?) but I don't really understand which of those engines could satisfy my needs. I thought of asking The Collective for opinion before I spend a day or two testing each of them.
Any suggestions?

There are a number of options on the market. Either fully fledge commercial products or open source variants. Your choice of a search provider is very dependent on the customers you are targetting.
Microsoft has a free Express version of their Search Server. As far as I know the Express edition is limited to running the Application Tier on one server.
There is also the Apache Lucene project which is open source. It has a nice API that's easy to use and a large community of users. The original project is based on Java, but there are also other implementations such as NLucene for .NET that I have used personally.

I'd recommend having a look at SQLite -- full-text search is included in the latest version.

I suppose the answer depends on your db. For example SQL Server has full text search and also English Language Queries if ever needed.

Take a look at using PostgreSQL and tsearch.

Try using postgresql with tsearch

Sphinx is probably the most efficient and scalable option while SQLite - FTS3 is the most straightforward option.

While not in-process, Solr is very fast (based on Lucene) and easily accessible from any platform (HTTP)

Related

Query files from SQL database and display the result as Windows Search

Warning: novice programmer here. I got SQL database with file paths based on different criteria stored in it. And I want to display the files associated with the desired criteria as a Windows Search result. Any guidelines on how to achieve this ? I'm programming in C#.
I believe there is something useful here: http://msdn.microsoft.com/en-us/library/windows/desktop/ff628790(v=vs.85).aspx It's just that I can't quite understand. Thanks in advance for the time.
You are really looking for extensions to Windows search:
http://technet.microsoft.com/en-us/library/cc725753%28v=ws.10%29.aspx
Or looking at trying to federate Windows Search with another, external service - in this case a SQL database.
http://msdn.microsoft.com/en-gb/library/windows/desktop/dd940456%28v=vs.85%29.aspx
I'm afraid that neither of these is a simple task and will require significant Windows system and programming skills.
The most common way of doing this integration is to do it the other way round. Call out to the Windows Search Index from MS SQL Server. Of course, you would need your own search front-end then as well.

indexing and searching textfiles in ruby and sinatra

I'm making a wiki with Ruby and Sinatra and need to search the wiki's that are stored as text files with markup in a few HTML-renderers (redcarpet, markdown, creole, slim, haml..).
You have a lot of options in Ruby like ferret, solr or lucene gems that handle structured data in a database but not for searching files with free text.
For now I open the text files and search with a regExp but as the wiki's grow that will soon be too slow. Are there any gems that index all the text files in a map and which index I can then use to search the files? It needs to be a Ruby only solution or something that can be easily used from Ruby.
I'm not using one of the common wiki's since none has the features I need.
I do use windows indexing service in a few old ASP apps but I'm far from satisfied with that solution.
My OS'es are Windows Vista, 7 and Windows Server 8.
EDIT: a no database installation needed/no keep server running solution is preferable, so eg with sqlite or file besed storage or something like that
Personally I would choose ElasticSearch: http://www.elasticsearch.org/
It's very easy to get running, and there are some gems for which make it really easy to communicate with it from Ruby ( for instance tire)
I'm not aware of any performant text-file-based full-search engines, so I'd really think that you would be best of by looking for a simple server, which ElasticSearch provides imho.
Take a look at ThinkingSphinx, a ruby bridge between Sphinx and ActiveRecord. Using this gem will allow you to index your models in an easy way, and to perform simple searchs and full text searches on all the models of your application.
Home page and documentation of the project:
http://freelancing-god.github.com/ts/en/
Nice little intro of how to use it:
http://www.synbioz.com/blog/2012/05/18/full_text_search_with_sphinx
I believe that the integration with Sinatra would be seamless if work with ActiveRecord.
I have been using Solr with both SQL and Mongo databases in Rails 2.3 - 3.2 and it has been doing great for me so far. Take a look at this railscast. Solr is a full text search java client develpoed by Apache that can index microsoft documents, text file, rich text documents
and even do OCR on images.

Should I use Lucene.Net for full text search with SQL Compact Edition 4, or is there a better option?

I'm trying to create a full text search facility for a small blog which is running against a SQL Compact Edition 4 database.
There seems to be almost no information out there about this (though I'd be happy if someone can prove me wrong), but as far as I can gather, SQL CE doesn't support the normal SQL Server full-text indexing.
I have briefly looked into using Lucene.Net, but it seems quite complex at first glance; would this be my best option here, or is there a simpler solution which I'm missing?
Lucene.Net would be a good choice even if you had the option of full text search.
Lucene.Net expands past what FTS (full text sql) offers. Including boosting terms, fuzzy queries, simple faceted search which can be found in a contrib project on the 2.9.4g branch, etc.
Its opensource so you don't have to wait on someone else's cycle to modify it or have it extended or add features.
There are a couple of posts and even FOSS contrib projects to help circumvent the higher barrier to entry. I'd recommend the content in the list below over starting with Lucene in Action.
The book is a great resource, but the lastest edition is aimed at Lucene 3.0, the java version, which includes newer APIs that have not made it into the .NET version.
Simple Lucene - http://blogs.planetcloud.co.uk/mygreatdiscovery/post/SimpleLucene-e28093-Lucenenet-made-easy.aspx
Lucene.Net Tutorial (covers version 2.9.2) - http://www.d80.co.uk/post/2011/03/29/LuceneNet-Tutorial.aspx
Lucene.Net will also pay off as a decent library to add to your overall programming repertoire of skill-sets. Search is pretty much apart of most applications these days.
The Lucene.Net project has gone back into incubation with a newer set of committers and goals One of those goals is to make it more .NET idiomatic and easier to use. However, its definitely going to take time and cycles to reach this point.
In the mean time you can always hit up the mailing lists for help or the irc channel #lucene.net for help.
Lucene is the way to go - a colleague of mine recommends "Lucene in Action" free PDF book, and after first 3 chapters you are up and running.
If it's a small blog, you may want to use IndexTank because it's free. There's a WordPress plugin that gives you instant search like this:
http://bothsidesofthetable.com

Is Pentaho ETL and Data Analyzer good choice?

I was looking for ETL tool and on google found lot about Pentaho Kettle.
I also need a Data Analyzer to run on Star Schema so that business user can play around and generate any kind of report or matrix. Again PentaHo Analyzer is looking good.
Other part of the application will be developed in java and the application should be database agnostic.
Is Pentaho good enough or there are other tools I should check.
Pentaho seems to be pretty solid, offering the whole suite of BI tools, with improved integration reportedly on the way. But...the chances are that companies wanting to go the open source route for their BI solution are also most likely to end up using open source database technology...and in that sense "database agnostic" can easily be a double-edged sword. For instance, you can develop a cube in Microsoft's Analysis Services in the comfortable knowledge that whatver MDX/XMLA your cube sends to the database will be intrepeted consistently, holding very little in the way of nasty surprises.
Compare that to the Pentaho stack, which will typically end interacting with Postgresql or Mysql. I can't vouch for how Postgresql performs in the OLAP realm, but I do know from experience that Mysql - for all its undoubted strengths - has "issues" with the types of SQL that typically crops up all over the place in an OLAP solution (you can't get far in a cube without using GROUP BY or COUNT DISTINCT). So part of what you save in licence costs will almost certainly be used to solve issues arising from the fact the Pentaho doesn't always know which database it is talking to - robbing Peter to (at least partially) pay Paul, so to speak.
Unfortunately, more info is needed. For example:
will you need to exchange data with well-known apps (Oracle Financials, Remedy, etc)? If so, you can save a ton of time & money with an ETL solution that has support for that interface already built-in.
what database products (and versions) and file types do you need to talk to?
do you need to support querying of web-services?
do you need near real-time trickling of data?
do you need rule-level auditing & counts for accounting for every single row
do you need delta processing?
what kinds of machines do you need this to run on? linux? windows? mainframe?
what kind of version control, testing and build processes will this tool have to comply with?
what kind of performance & scalability do you need?
do you mind if the database ends up driving the transformations?
do you need this to run in userspace?
do you need to run parts of it on various networks disconnected from the rest? (not uncommon for extract processes)
how many interfaces and of what complexity do you need to support?
You can spend a lot of time deploying and learning an ETL tool - only to discover that it really doesn't meet your needs very well. You're best off taking a couple of hours to figure that out first.
I've used Talend before with some success. You create your translation by chaining operations together in a graphical designer. There were definitely some WTF's and it was difficult to deal with multi-line records, but it worked well otherwise.
Talend also generates Java and you can access the ETL processes remotely. The tool is also free, although they provide enterprise training and support.
There are lots of choices. Look at BIRT, Talend and Pentaho, if you want free tools. If you want much more robustness, look at Tableau and BIRT Analytics.

Which full-text search package should I use for SQLite3?

SQLite3 appears to come with three different full-text search engines, called FTS1, FTS2, and FTS3. The documentation available on the website mentions that FTS1 is stable, FTS2 is in development, and that you should use FTS2. Examples I find online use FTS3, which is in CVS, and not documented versus FTS2. None of the full-text search engines come with the amalgamated source, as near as I can tell.
So, my question: which of these three engines, if any, should I use for full-text indexing in SQLite? Or should I simply use a third-party tool like Sphinx, or a custom solution in Lucene, instead?
As of 3.6.21, FTS3 is well documented, and gained a more officially visible status.
FTS3 is part of the standard sqlite DLL build on Windows, not sure about the amalgamated source.
We've been using it on production for about a year with no particular issues.
I've looked into full-text solutions recently too. It seems like SQLite has no de facto choice right now. No matter what you choose, it's inevitable that you'll have to re-architect it as the various FT2, FT3, etc. solutions mature. So bite the bullet and assume you'll need to do more development in the future to keep pace with changing full-text technology.
Sphinx Search has no direct support for SQLite yet. It supports only MySQL and PostgreSQL right now (ca. August 2009). So you'd have to hack your own SQLite connector or else migrate SQLite data to MySQL or PostgreSQL and then index the data with Sphinx Search. I think someone is working on a Sphinx Search patch to support Firebird, so maybe it's not so hard if you're willing to roll up your sleeves.
Also be aware that Sphinx Search has some limitations about incrementally adding data to the index. You should spend an hour or so reading the doc before you decide to use it.
I don't know of any direct way to index SQLite data in Lucene either. You'd probably have to write your own code to process batches of SQLite data, adding rows to the Lucene index one at a time. This seems to be the usage of Lucene no matter what the database.
update: Solr is a great companion technology for Lucene. Solr gives that search engine many features, including the ability to bulk-load query result data from any JDBC data source.

Resources