Currently in my organization we are holding semi structured data in elastic search and we use queries for fast text search and aggregation, but we have other products which lie in other databases so we want to put all the data in a data lake like HDFS
So if I use HDFS as a data lake to hold raw data, how will use elastic search with it? I mean elastic search index data before using it, so is it possible to hold the data in the data lake , and then elastic search will query the data from the data lake directly without needing to store the data in elastic? or will i hold the data in the data lake then process it and store it again in elastic so it can index it?
to summarize, I want to know the concepts of elastic and hadoop intergation
Both Spark and Hive offer Elasticsearch connectors; there's no need to export documents into HDFS, other than possibly backup functionality.
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/reference.html
Related
Use Case:
An application uses spark to process data for 5 minutes, the data to be processed could be of several hundred thousands of records in data storage.
The choice for data storage is Elastic Search.
Issue:
Do we have a connector for the spark in elasticsearch similar to the connector in MongoDB?
https://www.mongodb.com/products/spark-connector.
Investigation:
I spent a lot of time but the best I could find was a solution using search API with scroll(we can fetch the limited number of records for given number interval), but this does not fit my use-case.
Please note that my elastic search will have JSON data and we do not want to save RDD.
as mentioned in below
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
You can use spark connector for ES , and data is not saved in any binary form - but RDD/Dataframe is serialized as JSON and thats what goes into Elasticsearch.
I am a new comer for Elastic search and my question is "I want to store a large amount of log files into Elastic search database. And I am confused with how data files are stored?, which type of files should be stored in the elastic search?, is Elastic search stores only stored structured data files(JSON Format files or some other structure format)? or It will stores unstructured data as well?".
Thanks.
Elasticsearch stores nothing itself, but relies on Apache Lucene for this. Each elasticsearch shard is in itself a fully-functional and independent "index" that can be hosted on any node in the cluster.
https://lucene.apache.org/core/ "Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java."
More about what elasticsearch stores: https://www.elastic.co/blog/found-dive-into-elasticsearch-storage
To understand how the data is stored: https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html the inverted index:
Elasticsearch uses a structure called an inverted index, which is
designed to allow very fast full-text searches. An inverted index
consists of a list of all the unique words that appear in any
document, and for each word, a list of the documents in which it
appears.
I have some GB of data to be transferred to ES. I know of a way to first dump data from Redshift to S3 and then to ES.
Is there any other way.
You can check this Elastic Dump
For reference: how to move elasticsearch data from one server to another
I went through Couchbase xcdr replication documentation, but failed to understand below point:
1. couchbase replicate the all the data in bucket in batches to elstic search. And elastic search provide the indexing for these data for realtime statical data. My question is if all the data is replicated to elsastic search , then in this case elastic search is like database which can hold huge amount of data. So can we replace couchbase with elastic search?
2.how the data in form json is send to d3.js for display statical graph.
All of the data is replicated to Elastic Search, but is not held there by default. The indexes and such are created, but the documents are discarded. Elastic Search is not a database and does not perform like one and certainly not on the level of Couchbase. Take a look at this presentation where it talks about performance and stuff and why Cochbas
If your data are not critical or if you have another source of truth, you can use Elasticsearch only.
Otherwise, I'd keep Couchbase and Elasticsearch.
There is a resiliency page on Elastic.co website which describes potential known problems. https://www.elastic.co/guide/en/elasticsearch/resiliency/current/index.html
My 2 cents.
My goal here is to get all documents from an index of an ES cluster and insert them in another ES cluster keeping the same metadata.
I had a look at mget API to retrieve data and Bulk API to insert it but this Bulk API needs a special structure:
action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
So my idea is to retrieve my data of EScluster1 in a file and rearranged it to meet the structure of Bulk API and index it to EScluster2.
Do you see a better and/or faster way to proceed?
elasticdump does this. If you want to do this manually, you'll want to query using scroll and then bulk index what comes out of that. Not too hard to script together. With elastic dump you can pump the data around without writing to a file. However, it is kind of limited when you have e.g. parent/child relations in your index.