I have some confusion about ElasticSearch's Index.
In some place I read it's the equivalent of rdbms' database and some other place, an Index is like what we have at the end of books : list of words with corresponding documents that contain the word.
If someone can clarify.
Thanks
An Elasticsearch cluster can contain multiple Indices (databases). These indices hold multiple Documents (rows), and each document has Properties or field(columns).
you can check list of your available indices with http://localhost:9200/_cat/indices?v .
but in general (computer sciences and DB) indexing means like you said.
list of words with corresponding documents that contain the word
. this structure improves the speed of data retrieval operations on a database table. this concept could be used in many DB like mysql or oracle. in elasticsearch by default all document will be indexed. (you can change this settings to not indexing some columns/fields)
Related
My task is a full-text search system for a really large amount of documents. Now I have documents as RTF file and their metadata, so all this will be indexed in elastic search. These documents are unchangeable (they can be only deleted) and I don't really expect many new documents per day. So is it a good idea to use elastic as primary DB in this case?
Maybe I'll store the RTF file separately, but I really don't see the point of storing all this data somewhere else.
This question was solved here. So it's a good case for elasticsearch as the primary DB
Elastic is more known as distributed full text search engine , not as database...
If you preserve the document _source it can be used as database since almost any time you decide to apply document changes or mapping changes you need to re-index the documents in the index(known as table in relation world) , there is no possibility to update parts of the elastic lucene inverse index , you need to re-index the whole document ...
Elastic index survival mechanism is one of the best , meaning that if you loose node the index lost replicas are automatically replicated to some of the other available nodes so you dont need to do any manual operations ...
If you do regular backups and having no requirement the data to be 24/7 available it is completely acceptable to hold the data and full text index in elasticsearch as like in database ...
But if you need highly available combination I would recommend keeping the documents in mongoDB (known as best for distributed document store) for example and use elasticsearch only in its original purpose as full text search engine ...
When I search say car engine(this is first time any user has searched for this keyword) in Elastic search/lucene , does search engine search the index for individual words in index table first and then find intersection. For example :- Say engine found the 10
documents for car and then it will search for engine say it got 5 documents. Now in 5 documents(minimal no of documents), it will search for car. It has found 2 documents.
Now search engine will rank it based on above results . Is this how multiple words are searched in index table at high level ?
For future searches against same keyword, does search engine make new entry for key car engine in index table ?
Yes, it does search for individual terms and takes the intersection or union of the results, according to your query. It uses something called an "inverted index" which it generates, as and when the documents to be searched are "indexed" into elasticsearch.
Indexing operations are different from searching. So, No, it wouldn't index user searches unless you tell it to (in your application).
The basic functioning of elasticsearch can be split into two parts:
Indexing. You create an index of documents by indexing all the documents that you want to search in. These documents could be anything from your MySQL store, or from Logstash etc, or could be made up of users' search queries that your application indexes into a relevant elastic index.
Searching. You search for the indexed documents using some keywords that could be user generated or application generated or a mixture, using ElasticSearch queries (DSL). If a result is found (according to your query) then elasticsearch returns the relevant records.
I'd encourage you to read this doc for a better understanding of how elastic searches docs:
https://www.elastic.co/blog/found-elasticsearch-from-the-bottom-up
In the context of ELK (Elasticsearch, Logstash, Kibana), I learnt that Logstash has FILTER to make use of grok to divide log messages into different fields. According to my understanding, it only helps to make the unstructured log data into more structured data. But I do no have any idea about how Elasticsearch can make use of the fields (done by grok) to improve the querying performance? Is it possible to build indices on base of the fields like in traditional relational database?
From Elasticsearch: The Definitive Guide
Inverted index
Relational databases add an index, such as a B-tree index, to specific columns in
order to improve the speed of data retrieval. Elasticsearch and Lucene use a
structure called an inverted index for exactly the same purpose.
By default, every field in a document is indexed (has an inverted
index) and thus is searchable. A field without an inverted index is
not searchable. We discuss inverted indexes in more detail in Inverted Index.
So you not need to do anything special. Elasticsearch already indexes all the fields by default.
This is not a specific coding problem, but I would like to get an understanding of the use of terms in ES. Can someone give me a detailed definition of:
Repository
Cluster
Index (indices)
Types
buckets
I've seen them again and again but could never grasp which ones belong to which.
Repository - Assuming you are coming from spring land, is an abstract term used to refer to a data store. A repository can be a database or a file system or an inverted index based system like ES.
Cluster - Collection of elastic search instances running on the same machine or different machines. They are linked together using the cluster name either declared in elasticsearch.yml or -Des.cluster.name
Index - the logical storage structure used by elastic search to refer to stored data
Types - A specific kind of document that is stored in an index
Buckets - Comes into play when you are doing operations like min/max/count/avg etc on the indexed data. Think of buckets as a group of documents that match the query criteria. Buckets can contain nested buckets.
I have to index around 10 million documents in solr for full text search. Each of these documents have around 25 additional metadata fields attached to them. Each of the metadata fields individually are small (upto 64 characters). Common queries would be involving a search term along with multiple metadata fields used to filter the data. So my questions is which would provide better performance wrt search response time. (indexing time is not a concern):
a. Index the text data as well as push all metadata fields into solr as stored fields and query solr for all the fields using a single query. (Effectively solr does the filtering with metadata as well as search)
b. Store the metadata fields in a db like Mysql. Use solr only for full text and then use the document ids returned from solr as an input to the database to filter based on other metadata to retrieve the final set of documents.
Thanks
Arijit
Definitely a). Solr isn't simply a fulltext search engine, it's much more. It's filter queries are at least as good/fast as MySQL select.
b) is just silly. Fetch many ids from MySQL by selecting those with correct metadata, do a fulltext search in Solr while filtering against that ids list, fetch document from MySQL or Solr (if you choose to store data in it, not just indexes). I can't imagine a case where this would be faster.
Why complicate things, especially if indexing time and HD space is not an issue, you should store all your data (meaning: subset needed by users) in Solr.
Exception would be if you had large amount of text to store (and retrieve) in each document. In those cases it would be faster to fetch it from RDB after you get your search results back. Anyway, noone can tell for sure which one would be faster in your case, so I suggest you test performance of both approaches (using JMeter for example).
Also, since you don't care about index time, you should do all the processing you can at index time instead of at query time (e.g. synonyms, payloads where they can replace boosting, ...).
See here for some additional info on Solr performance:
http://wiki.apache.org/solr/SolrPerformanceFactors