I'm planing to implement a Free text search using Lucene.net and also I'm new to Lucene. In our project we've used ASP.net MVC 3.0 and Entity Framework 4.1.
Is it a good decision to use Lucene over free text search in MS SQL server ?
What are the implecations that I need to take care?
Is it possible to use MS SQL Sever to store indexed documents in Lucene over file system ?
Is it a good decision to use Lucene over free text search in MS SQL
server ?
It depends on the amount of data and the query flexibility you want. If you have a large amount of data and you want very flexible queries, yes it is.
What are the implecations that I need to take care?
You will need to manually keep your lucene indexes up to date with the database, and you will need to handle the free text search yourself.
Is it possible to use MS SQL Sever to store indexed documents in
Lucene over file system ?
see http://wiki.apache.org/lucene-java/LuceneFAQ#Can_I_store_the_Lucene_index_in_a_relational_database.3F
I'd recommend you to take a look at the Lucene java FAQ, pretty much everything there applies to Lucene.NET as well and it adresses lots of other questions you may have.
Related
I am a member of an Analytics team that recently moved it's Data Warehouse into Elastic Search. The DW is accessed through Dremio.
However, I am having second thoughts regarding whether Elastic Search is the appropriate DB for an Analytics team that performs a lot of day-to-day Analytics. I would prefer we kept our DW in one of BigQuery/Snowflake/Redshift and use "dbt" tool for transforming data and writing it back into the DB.
I can't find a "dbt"-like tool to perform quick data transformations after reading from Elastic Search and Dremio is not mature enough tool for that. I would like to solicit your thoughts on Elastic Search and whether is an appropriate DB for day-to-day analytics.
I appreciate your responses.
Edit:
I work at an online retailer. Our data is not "big data" in any sense. In the order of a few thousand orders per day. Most of our work is responding to inquiries from various teams/departments. Some of these questions go beyond a simple query. We have to build customized data marts that involve multiple steps in between. As a result, we need a tool that would allow us to transform data quickly and put the result set into a database. One such tool is "dbt" but it doesn't support Elastic Search. So the question is whether there is an appropriate tool for this job or Elastic Search is not appropriate for our use case.
Taking into account
Our data is not "big data" in any sense.
most likely ElasticSearch is not appropriate choice. Only reason to use ES is a lot of search-like queries with 'contains' filtering over text-type fields and only if dataset is too large for fast-enough handling of these queries by SQL-compatible DB.
It looks like PostgreSQL can do the job. If you're looking for columnar-DB for lighting-fast OLAP queries (aggregations) you can check open-source ClickHouse.
Finally, Dremio is not the only BI tool that can work with ElasticSearch (or PostgreSQL, ClickHouse etc). Some BI tool allows you to use 'painless' scripts for dimensions/measures and you can calculate a lot of things directly in ES queries.
Depends on what specific metrics you need, ES aggregation can support a lot of basic metrics. For cost considerations and less infra to support and reducing complexity, I generally advise companies to start with that before over engineering or prematurely optimizing
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
I'm creating a microservice to handle the contacts that are created in the software. I'll need to create contacts and also search if a contact exists based on some information (name, last name, email, phone number). The idea is the following:
A customer calls, if it doesn't exist we create the contact asking all his personal information. The second time he calls, we will search coincidences by name, last name, email, to detect that the contact already exists in our DB.
What I thought is to use a MongoDB as primary storage and use ElasticSearch to perform the query, but I don't know if there is really a big difference between this and querying in a common relational database.
EDIT: Imagine a call center that is getting calls all the time from mostly different people, and we want to search fast (by name, email, last name) if that person it's in our DB, wouldn't ElasticSearch be good for this?
A relational database can store data and also index it.
A search engine can index data but also store it.
Relational databases are better in read-what-was-just-written performance. Search engines are better at really quick search with additional tricks like all kinds of normalization: lowercase, ä->a or ae, prefix matches, ngram matches (if indexed respectively). Whether its 1 million or 10 million entries in the store is not the big deal nowadays, but what is your query load? Well, there are only this many service center workers, so your query load is likely far less than 1qps. No problem for a relational DB at all. The search engine would start to make sense if you want some normalization, as described above, or you start indexing free text comments, descriptions of customers.
If you don't have a problem with performance, then keep it simple and use 1 single datastore (maybe with some caching in your application).
Elasticsearch is not meant to be a primary datastore so my advice is to use a simple relational database like Postgres and use simple SQL queries / a ORM mapper. If the dataset is not really large it should be fast enough.
When you have performance issues on searches you can use a combination of relation db and Elasticsearch. You can use Elasticsearch feeders to update ES with your data in you relational db.
Indexed RDBMS works well for search
If your data is structured i.e. columns are clearly defined, searching 1 million records will also not be a problem in RDBMS.
When to use Elastic
Text Search: Searching words across multiple properties (e.g. description, name etc.)
JSON Store and search: If data being stored is in json format and later needs to be searched
Auto Suggestions: Elastic is better at providing autocomplete suggestions
Elastic as an application data provider
Elastic should not be seen as data store, even if you storing data in it. It is about how you perceive elastic. Elastic should be used to store and setup data for the application. It is the application which decides how and when to use elastic (search and suggestions). Elastic is not a nosql storage alternative if compared to RDBMS, you should use a nosql database instead.
This perception puts elastic in line with redis and kafka. These tools are key components of an application design and they are used to serve as events stores, search engines and cache etc. to the applications.
Database with Elastic
Your design should use both. For storing the contacts use the database, index the contacts for querying. Also make the data available in elastic for searching, autocomplete and related matches.
As always, it depends on your specific use case. You briefly described it, but how are you acually going to use the data?
If it's just something simple like checking if a customer exists and then creating a new customer, then use the RDMS option. Moreover, if you don't expect a large dataset, so that scaling isn't an issue (hence the designation that Elasticsearch is for BigData), but you have transactions and data integrity is important, then a RDMS will be the right fit. Some examples could be for tax, leasing, or financial reporting systems.
However, if you have a large dataset, you need a wide range of query capabilities, such as a fuzzy search or searches where the user
can select multiple filters on the data or you want to do some predictive analysis on the data, then Elasticsearch is the clear choice.
For example, I worked on an web based app with a large customer base: 11 million, with 200+ hits per second at peak time for a find a doctor application. The customer could check some checkboxes to determine, specialty, spoken languages, ratings, hospitals, etc. all sorted by the distance from the users location with a 2 second or less response time. It would be very difficult for a RDMS to match that.
I want to search for multiple strings in a very large database. These strings are part of different attributes of database table. I have tried string search using LIKE in sql query. But it is taking a lot of time to get results. I have used Oracle database.
Should I use indexing of database? I found that Lucene can be used for it.
I also got some suggestions of using big data concepts. Which approach should I use?
The easiest way is:
1.) adding an index to the columns you like to search trough
2.) using oracle text as #lalitKumarB wrote
The most powerful way is:
3.) use an separate search engine (solr, elaticsearch).
But, probably you have to change you application in order to explicit use the search index for searching trough the data,...
I had the same situation some years before. Trying to search text in an big database. After a wile I found out, that database based search will never reach the performance of an dedicate search engine. And: you will have much more search features working out of the box, if you use solr (for example), like spelling correction, "More like this", ...
One option is to hold the data on orcale, searching in solr and return the ID of the document in order to only load the one row form oracle, the is referenced by the ID.
2nd option is to keep oracle as base datapool for your search engine and search in solr (or elasticsearch) in order to return the whole document/row from solr, not only the ID. So you don't need to load the data from the database any more.
The best option depends on your needs.
You have the choice between elasticsearch, solr or lucene
I am learning NoSQL and looking at different options for one of my client's requirements. I have gone through various resources before putting up this question (a person with little knowledge in NoSQL)
I need to store data at faster rate and read data.
Fully fail-safe and easily scalable.
Able to search through data for Analytics.
I ended up with a short list of: Cassandra and Elasticsearch
What I do understand is Cassandra is a perfect NoSQL storage solution for me, as I can write data and read data using indexes. Where it fails or it could fail is on Analytics. In the future, if I want to get data from from_date to to_date, or more ways to get data for analytics, if I don't design the Data model properly or keeping long term sight, which might be quite hard in ever changing world.
While Elastic Search is best at indexing (backed by Lucene), and can search the data randomly by throwing some random text. But does it work the same for even if I want to retrieve data from_date to to_date (I expect it might be). But the real question is, is it a Search Engine, or perfect NoSQL data storage like Cassandra? If yes, why do we still need Cassandra?
If both of these are in different world, please explain that! How do we combine them to get a more effective solution?
One of our applications uses data that is stored into both Cassandra and ElasticSearch. We use Cassandra to access those records whenever we can, and have data duplicated into query tables designed to adhere to specific application-side requests. For a more liberal search than our query tables can allow, ElasticSearch performs that functionality nicely.
We have asked that same question (of ourselves)..."Why don't we just get everything from ElastsicSearch?"
The answer is that ElasticSearch was designed to be a search engine, and not a persistent data store. Sometimes ElasticSearch loses writes. Schema changes are difficult to do in ElasticSearch without blowing everything away and reloading. For that purpose, I have written jobs that are designed to keep ElasticSearch in-sync with our Cassandra cluster. There was also a fairly recent discussion on Quora about this topic, that yielded similar points.
That being said, ElasticSearch works great as a search engine. And Cassandra works great as a scalable, high-performance datastore. But querying data is different from searching for data. There are times that we need one or the other, and a combination of the two works well for our application. It may (or it may not) work well for yours.
As for analytics, I have had some success in using the Cassandra Spark connector, to serve more complex OLAP queries.
Edit 20200421
I've written a newer answer to a similar question:
ElasticSearch vs. ElasticSearch+Cassandra
Cassandra + Lucene is a great option. There are different initiatives for this issue, for example:
Stratio’s Cassandra Lucene Index - Derived from Stratio Cassandra, is a plugin for Apache Cassandra that extends its index functionality. (https://github.com/Stratio/cassandra-lucene-index)
Stratio Cassandra, it's a native integration with Apache Lucene, it is very interesting. (https://github.com/Stratio/stratio-cassandra) - THIS PROJECT HAS BEEN DISCONTINUED IN FAVOUR OF Stratio’s Cassandra Lucene Index
Tuplejump Calliope, it's like Stratio Cassandra, but it's less active. (https://github.com/tuplejump/stargate-core)
DSE Search by Datastax. It allows using Cassandra with Apache Solr, but it's a proprietary option.(http://www.datastax.com/what-we-offer/products-services/datastax-enterprise)
After working on this problem myself I have realized that NoSQL databases like casandra are good when you want to make sure you are preserving your data schema with reliable writing operation, and don't want to take advantage of indexing operations that elasticsearch offers. In case you want to preserve some indexes data then elasticsearch is good in case you are trusting your scheme and only going to do far more reads than writes.
My case was data analytics. So I preserved a lot of my Latices in elastic search since later I wanted to traverse through the data a lot to see what should be my next step. I would have used casandra if I wanted to have a lot of changes in the schema of the data in my analytic pilelines.
Also there are many nice representing tools like kibana that you can use to present your data with some good graphics. Maybe I am lazy but they are very good looking and they helped me.
Storing data in a combination of Cassandra and ElasticSearch gives you most functionality. It allows you to lookup key-value tables, and also allows you to search data in indexes.
The combination gives you a lot of flexibility, ideal for your application.
Elassandra is the combined solution of Cassandra + Elastic search , It uses Elastic search to index the data and Cassandra as the data store , i'm not sure about the performance but as per this article , its performance is good.
If your application needs search feature then , Elassandra is the best open source option. DSE search is available but its expensive.
We had developed an application where we used Elasticsearch and Cassandra.
Similar data was stored into Cassandra and indexed into Elasticsearch.
Our application's UI was having features like searches, aggregations, data export, etc.
The back-end microservices were continuously getting huge data (on Kafka topics) and storing it into Cassandra. Once the data is stored into Cassandra, the services would make sure the data is indexed into Elasticsearch.
Cassandra was acting as "Source of truth" for Elasticsearch. In the cases, where reindexing of the ES index was required, we queried Cassandra and reindexed the data into ES.
This solution helped us, as this was very easy to scale and the searches and aggregations were much faster.
Cassandra is great at retrieving data by ID. I don't know much about secondary index performance, but I doubt it's as fast as Elasticsearch. Certainly Elasticsearch wins when it comes to full text search functionality (text analysis, relevancy scoring, etc).
Cassandra wins on update performance, too. Elasticsearch supports updates, but an update is really a reindex + soft delete in an atomic operation.
Cassandra has a very nice replication model (if you need to be extra-fail-safe). Elasticsearch is OK, too, I'm not in the camp that says ES is particularly unreliable (it has issues sometimes, like all software).
Elasticsearch also has aggregations for real-time analytics. And because searches are so fast, analytics on a subset of data will be fast, too.
If your requirements are satisfied well enough by one of them (like here it seems like ES would work well), I would just use one. If you have requirements from both worlds, then you can either:
use one of them and work around the downsides. For example, you may be able to handle many updates with Elasticsearch, but with more shards and more hardware
use both and make sure they're in sync
As elasticsearch is built on Lucene index and if you want to store indexing in elasticsearch it performs best comparing to indexing in Cassandra itself for retrieving the data.
If your requirements are not related to real-time retrieval then you can use elasticsearch as NoSQL database also, there are thoughts that ElasticSearch loses writes & Schema changes are difficult, but if your volume of data is not too big. You can easily achive elasticsearch as a search engine with best indexing along with elasticsearch as aNoSQL database. There are several way that you can prevent it. I have worked on the schema changes in elasticsearch, if your data structure is consistent then it will create any issues.
Being a supporter of ElasticSearch or SOlr. I have worked on both the search engines and i experienced that both the search engines can be used fluently if you configure them correctly.
Only cons that i can think of it, if you are targetting real time result and can't comprosie milliseconds delay in your response. Then its better to take help of other NoSQL databases like cassandra or couchbase.
Cassandra with solr, work better than Cassandra with elasticSearch.
I have been working with elasticsearch for the past 2 months. I have used both REST approach and API support in different languages to index, get and search data. I also read a lot about elasticsearch and found out it is not a good option to use it as a data store. Why is this? And I'm also curious about how elasticsearch internally stores the indexed data. Any good link or explanation??
Elastic Search is built on top of Apache Lucene - here's a reference doc on the Lucene index file structure:
http://lucene.apache.org/core/4_7_2/core/org/apache/lucene/codecs/lucene46/package-summary.html#package_description
Regarding whether or not it's a good option as a data store I think that's more individual opinion and specific use cases than a fact that can be proved. It does not have the transaction support that something like MySQL does if that's what you are looking for. In that case it's somewhat on a par with other NoSQL solutions. This is a pretty decent writeup on the trade-offs and issues: https://www.found.no/foundation/elasticsearch-as-nosql/
In the end it depends on what you are doing with your data and what level of robustness you require.