Data Archival in Elastic Search - elasticsearch

Can you help me with how to Archive data in Elastic search. That I don't know what is curator and data shrink. I am fresher to elastic search. from where I have to start studying and what all things I have to do.

Elastic search is an fulltext index. You can use this technology to index some data to get fast an powerful access to you data.
But, it's an index.
I don't think, elastic search is the right place to archive data.
Especially not if the archive is to fulfill certain archival standards.
You can archive your data somewhere else and use elastic search to search over your archived data.
If I were you, I would use a specialized tool for storing and archiving data and index this data with Elasticsearch for powerful search.

you need to look at using ILM for this, it's the replacement for curator and will handle it much cleaner

Related

ElasticSearch as primary DB for document library

My task is a full-text search system for a really large amount of documents. Now I have documents as RTF file and their metadata, so all this will be indexed in elastic search. These documents are unchangeable (they can be only deleted) and I don't really expect many new documents per day. So is it a good idea to use elastic as primary DB in this case?
Maybe I'll store the RTF file separately, but I really don't see the point of storing all this data somewhere else.
This question was solved here. So it's a good case for elasticsearch as the primary DB
Elastic is more known as distributed full text search engine , not as database...
If you preserve the document _source it can be used as database since almost any time you decide to apply document changes or mapping changes you need to re-index the documents in the index(known as table in relation world) , there is no possibility to update parts of the elastic lucene inverse index , you need to re-index the whole document ...
Elastic index survival mechanism is one of the best , meaning that if you loose node the index lost replicas are automatically replicated to some of the other available nodes so you dont need to do any manual operations ...
If you do regular backups and having no requirement the data to be 24/7 available it is completely acceptable to hold the data and full text index in elasticsearch as like in database ...
But if you need highly available combination I would recommend keeping the documents in mongoDB (known as best for distributed document store) for example and use elasticsearch only in its original purpose as full text search engine ...

How to use Elasticsearch to make files in a directory searchable?

I am very new to search engines and Elasticsearch, so please bear with me and apologies if this question sounds vague. I have a large directory with lots of .csv and .hdr files, and I want to be able to search text within these files. I've done the tutorials and read some of the documentation but I'm still struggling to understand the concept of indexing. It seems like all the tutorials show you how to index one document at a time, but this will take a long time as I have lots of files. Is there an easier way to make elasticsearch index all the documents in this directory and be able to search for what I want?
Elasticsearch can only search on documents it has indexed. Indexed means Elasticsearch has consumed a document one by one and stored it internally.
Normaly internal structure matters and you shold understand what you're doing to get best performance.
So you need a way to get your files into elastic search, I'm affraid there is no "one click way" to achieve this...
You need
Running cluster
Designed index on for the documents
Get document from filesystem to Elasticsearch
Your question is focused on 3).
For this, search for script examples or tools that can crawl your directory and provide Elasticsearch with documents.
5 seconds of using Google brought me to
https://github.com/dadoonet/fscrawler
https://gist.github.com/stevehanson/7462063
Theoretically it could be done with Logstash (https://www.elastic.co/guide/en/logstash/current/plugins-inputs-file.html), but I would give fscrawler a try.

Comparison of Handling Logs and PDFs in Solr & Elasticsearch and Data Visualization in Banana & Kibana

How do Elasticsearch and Solr compare in respect to the following:
Indexing logs.
Indexing events.
Indexing PDF documents.
Ease of creating and distributing visualizations. Kibana vs Banana.
Support and documentation for developers.
Any help is appreciated.
EDIT
More specifically, i am trying to figure out how exactly a PDF document or an event can be indexed at all. I have worked a little bit on Elasticsearch and since i am a fan of JSON, i found it quite useful when i tried to index structured data.
For example logs are mostly structured and thus i guess easier to index and search. Now what if i want to index the whole log file itself?
Follow up
Is Kibana the only visualization tool available for Elasticsearch?
Is Banana the only visualization tool available for Solr?
Here is an answer to try to address just the Elasticsearch aspect of the post.
Take a look at https://github.com/elastic/elasticsearch-mapper-attachments for handling PDFs
For events/logs, you would need to transform those into structured data to index in Elasticsearch. You can have a field in there for the source (the log file the data came from and other information like that) - you will have all the data in the whole log file indexed in that fashion. You can take advantage of ES aggregations to group results based on log file, calculate statistics, etc.
The ELK stack is definitely worth a look.
I don't know if Kibana is the only visualization tool but it is probably the most popular and likely to offer more than something else.

Solr HBase search engine

I need to use SolrCloud as the search engine on top of HBase and HDFS for searching a very large num of documents.
Currently these docs are in different data sources. I am getting confused whether Solr should search, index and store these docs within itself or Solr should just be used for indexing and docs along with their metadata of the docs should reside in HBAse/HDFS layer.
I have tried searching how the Solr HBase integration works best (meaning what should be done at the Solr level and what at the Hadoop level) but in vain. Anyone has done this kind of Big Data search earlier and can give some pointers? Thanks
Solr provides fast search via its indexes. Solr uses inverted indexes for this. So, you index documents to solr, it creates the indexes. Based on how you have defined the schema.xml, solr decides how the indexes has to be created. The indexes and the field values are stored in HDFS (based on your config in solrconfig.xml)
With respect to Hbase, you can directly query run you query on hbase without having to use Solr. SolrBase is an SOLR and Hbase integration available. Also have a look at liliy
The good design followed is search for things in solr, get the id of the records quickly, and then if needed, fetch the entire record from Hbase. You need to make sure that entire data is there in hbase, and only sufficient data is indexed. Needless to say that both solr and hbase should be in sync. One ready made framework, is NGDATA/hbase indexer here.
Solr works wonders to get the counts, grouping counts, stats. So once you get those numbers and their id's, Hbase can take over. once u have row key in hbase(id), you get low latency search results, that suits well with web applications too

couchbase data replication elasticsearch

I went through Couchbase xcdr replication documentation, but failed to understand below point:
1. couchbase replicate the all the data in bucket in batches to elstic search. And elastic search provide the indexing for these data for realtime statical data. My question is if all the data is replicated to elsastic search , then in this case elastic search is like database which can hold huge amount of data. So can we replace couchbase with elastic search?
2.how the data in form json is send to d3.js for display statical graph.
All of the data is replicated to Elastic Search, but is not held there by default. The indexes and such are created, but the documents are discarded. Elastic Search is not a database and does not perform like one and certainly not on the level of Couchbase. Take a look at this presentation where it talks about performance and stuff and why Cochbas
If your data are not critical or if you have another source of truth, you can use Elasticsearch only.
Otherwise, I'd keep Couchbase and Elasticsearch.
There is a resiliency page on Elastic.co website which describes potential known problems. https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html
My 2 cents.

Resources