Should logs, metrics, and analytics all go to one data lake or be stored separately? - elasticsearch

Background:
I am setting up my first elastic stack, and while I will be starting simple, I want to make sure I'm starting with good architecture. I would eventually like to have a solution for the following: hosting metrics, server logs (expressjs APM), single page app monitoring (APM RUM js agent), Redis metrics, MongoDB metrics, and custom event analytics (ie: sale, customer cancelled, etc).
Question:
Should I store all of this on one Elasticsearch cluster and use search to filter out the different cases, OR do I create a separate instance for each and keep them clearly defined to their roles.
(I would prefer the single data lake)

For logging use case:
you can store all the logs on a file system share before ingesting them into any search solution , so that you can re-ingest if needed
after storage , you can either ingest them into just one cluster with different indices , or to multiple clusters , its open choice , but it depends on the amount of data
if the size and compute of each justify a separatre ES cluster then do it , othervise , use a single cluster , with a failover cluster
For metrics:
you can directly ingest them into one cluster with different index patterns
if size and compute requirements justfies , make separate clusters
make a failover/backup cluster if needed
In both the cases , you will also need to store the cluster snapshots.
I personally recommend ELK for logging uses case , and Promethous for metrics.
Reporting/Analytics:
For some use cases like reporting/analytics on monthly and yearly basis , the log data will be huge , and you will need to ingest the data from the file share into hadoop to summerize it/ roll up based on some fields , and then , ingest the reduced data into ELK , this can reduce the size and compute requirements by 1000 factor.

Related

Can logstash perform statistical analysis on the data coming from filebeat?

OK,my problem is if it is possible to use logstash to perform statistical analysis on the collected log data.Now I have used filebeat to collect nginx logs into the es cluster and put the required labels on these logs.I plan to read these logs from the es cluster and write a program to make statistics on these logs, such as the traffic in a certain region for a period of time.Now, I want to know whether the logs collected by filebeat can be transferred to logstash for data statistics.
After a short period of research, I haven't found that logstash has this function. I hope you can help me.Thanks.
I want to know whether logstash can realize the functions I need
Logstash is basically ingests, transforms, and ships your data regardless of format or complexity. Derive structure from unstructured data with grok, decipher geo coordinates from IP addresses, anonymize or exclude sensitive fields, and ease overall processing. For I don't know your specific use case, but you can use Kibana for analyzing statistical data. You even don't need logstash if you have a node in your elasticsearch cluster with ingest node roles.

Elastic search API Vs Spring data Vs logstash

I am planing to use elastic search for our dashboard using spring boot based rest services. After research i see top 3 options
Option A:
Use Elastic Search Java API ( from comment looks like going to go away)
Use Elastic Search Java Rest Client
Use spring-data-elasticsearch ( planing to use es 5.6 but challenging for latest es 6 as I don't see it's supports right now)
Option B:
Or shall I use logstash approach to
Sync data between postgressql and elastic search using logstash ?
Which one among them will be long term approach to get near real time data from ES in high load scenario ??
Usecase: I need to save some data from postgresql table to elastic search for my dashboard (near real time )
Update is frequent for both tables and es
to maintain current state
Load is going to increase in couple of week
The options you listed, in essence, are: should you go with a ready to use solution (logstash) or should you implement your own.
Try logstash first to see if it works for you - it'll take less time than implementing your own solution, and you can get working solution in minutes (if it's not hundreds of tables)
If you want near-real time, then you need to figure out if it allows you to:
handle incremental updates, i.e. if its 'tracking_column' configuration will work for your data structure and it will only load updated records in each run, not the whole table.
run it at the desired frequency
and in general, satisfies your latency requirements
If you decide to go with your own solution, keep in mind that spring-data-elasticsearch is a higher level wrapper for underlying elasticsearch client. If there are latency goals, then working on the lower level (elasticsearch clients) may give you better control and more options to tune the pipeline.
Otherwise, the client choice will not matter that much as data feed features (volume/update frequency) and db/es cluster configuration.

Index large amount of data into elasticsearch

I've got more than 6 billion of social media data in HBase (including content/time/author and other possible fields) with 4100 regions in 48 servers , and I need to flush these data into Elasticsearch now.
I'm clear about the bulk API of ES, and using bulk in Java with MapReduce still cost many days (at least a week or so). I can use spark instead but I don't think it will help a lot.
I'm wondering is there any other tricks to write these large data into ElasticSearch ? Like manually write to es index files and using some kinds of recover to load the files in local file system ?
Appreciate any possible advice, thanks.
==============
Some details about my cluster environments:
spark 1.3.1 standalone (I can change it on yarn to use Spark 1.6.2 or 1.6.3)
Hadoop 2.7.1 (HDP 2.4.2.258)
ElasticSearch 2.3.3
AFAIK Spark is best option for indexing out of below 2 options.
along with that below are approaches I'd offer :
Divide (input scan criteria) and conquer 6 billion of social media data :
Id recommend create multiple Spark/Mapreduce jobs with different search criteria(to divide 6 billion of social media data in 6 pieces based on category or something else) and trigger them in parallel.
For example based on data capture Time Range(scan.setTimeRange(t1, t2)) or else with some fuzzy row logic(FuzzyRowFilter), should definitely speed up things.
OR
Kind of Streaming approach :
You can also consider as and when you are inserting data through spark or mapreduce you can simultaneously create indexes for them.
For example in case of SOLR : clouder has NRT hbase lily indexer... i.e as and when hbase table is populated based on WAL (write ahead log) entries simultaneously it will create solr indexes. check any thing is there like that for Elastic search.
Even if its not there for ES as well, don't have to bother, while ingesting data it self using Spark/Mapreduce program you can create by yourself.
Option 1 :
Id suggest if you are okay with spark it is good solution
Spark Supports native integration of ES from hadoop 2.1.
see
elasticsearch-hadoop provides native integration between
Elasticsearch and Apache Spark, in the form of an RDD (Resilient
Distributed Dataset) (or Pair RDD to be precise) that can read data
from Elasticsearch. The RDD is offered in two flavors: one for Scala
(which returns the data as Tuple2 with Scala collections) and one for
Java (which returns the data as Tuple2 containing java.util
collections).
Samples here with different spark version from 1.3 on wards
More samples other than Hbase
Option 2 : As you are aware bit slow than spark
Writing data to Elasticsearch
With elasticsearch-hadoop, Map/Reduce jobs can write data to Elasticsearch making it searchable through indexes. elasticsearch-hadoop supports both (so-called) old and new Hadoop APIs.
I found a practical trick to improve the bulk index performance by myself.
I can calculate the hash routing in my client and make sure that each bulk request containing all index requests with the same routing. According to the routing result and shard info with ip, I directly send the bulk request to corresponding shard node. This trick can avoid the bulk reroute cost and cut down the bulk request thread pool occupation which may cause EsRejectedException.
For example, I have 48 nodes in different machines. Assuming that I send a bulk request containing 3000 index requests to any node, these index requests will be rerouted to other nodes (usually all the nodes) according by routing. And the client thread has to wait for the whole process finished, including processing local bulk and waiting for other nodes' bulk responses. However, without the reroute phase, the network costs are gone (except for forwarding to the replica nodes), and the client just need to wait less time. Meanwhile, assuming that I have only 1 replica, the total occupation of bulk threads are 2 only. ( client-> primary shard and primary shard -> replica shard )
Routing hash:
shard_num = murmur3_hash (_routing) % num_primary_shards
Try to take a look into: org.elasticsearch.cluster.routing.Murmur3HashFunction
Client can get the shards and index aliases by request to cat apis.
shard info url: cat shards
aliases mapping url: cat aliases
Some attentions:
ES may change default hash function in different version, which means the client code may not be version compatible.
This trick is based on the assumption that the hash results are basically balanced.
Client should think about fault tolerance such as connection timeout to the corredponding shard node.

Use Elasticsearch as backup store

My application receives and parse thousands of small JSON snippets each about ~1Kb every hour. I want to create a backup of all incoming JSON snippets.
Is it a good idea to use Elasticsearch to backup this snippets in an index with f.ex. "number_of_replicas:" 4? Never read that anyone has used Elasticsearch for this.
Is my data safe in Elasticsearch when I use a cluster of servers and replicas or should I better use another storage for this use case?
(Writing it to the local file system isn't safe, as our hard discs crashes often. First I have thought about using HDFS, but this isn't made for small files.)
First you need to find difference between replica and backups.
replica is more than one copy of data at run time.It increases high availability and failover support,it wont support accidental delete of data.
Backup is copy of whole data at backup time.it will be used to restore when system crashed.
Elastic search for back up.. its not good idea.. Elastic search is a search engine not DB.If you have not configured ES cluster carefully,then you will end up with loss of data.
So in my opinion ,
To store json object, we got lot of dbs.. For example mongodb is a nosql db.We can easily configure it with more replicas.It means high availability of data and failover support.As you asked its also opensource and more reliable.
for more info about mongodb refer https://www.mongodb.org/
Update:
In elasticsearch if you create index with more shards it'll be distributed among nodes.If a node fails then the data will be lost.But in mongoDB more node means ,each mongodb node contains its own copy of data.If a mongodb fails then we can retrieve out data from replica mongodbs. We need to be more conscious about replica setup and shard allocation in Elasticsearch. But in mongoDB it's easier and good architecture too.
Note: I didn't say storing data in elasticsearch is not safe.I mean, comparing to mongodb,it's difficult to configure replica and maintain in elasticsearch.
Hope it helps..!

How to build distribute search base on hadoop and lucene

I'm preparing to make distribute search module with lucence and hadoop but fell confused with something:
as we know , hdfs is a distribute file system ,when i put a file to hdfs , the file will be divided into severial blocks and stored in diffrent slave machine in the claster , but if i use lucene to write index on hdfs , i want to see the index on each machine , how to acheived it ?
i have read some of the hadoop/contrib/index and some katta ,but don't understand the idea of the "shards ,looks like part of the index" , it was stored on local disk of one computer or only one directionary distribut in the cluster ?
Thanks for advance
-As for your Question 1:
You can implement the Lucene "Directory" interface to make it work with with hadoop and let hadoop handle the files you submit to it. You could also provide your own implementation of "IndexWriter" and "IndexReader" and use your hadoop client to write and read the Index. This way since you could have more control about the format the index you will write. You can "see" or access the index on each machine via the your lucene/hadoop implementation.
-For your question 2:
A shard is a subset of the index. When you run your query all shards are processed in the same time and the results of the index search on all shards are combined. On each machine of your cluster you will have a part of your index: a shard. So a part of the index will be stored on a local machine but will appear to you as as a single file distributed across the cluster.
I can also suggest you to checkout the distributed search SolrCloud, or here
It is runs on Lucene as indexing/search engine and already enables you to have a clustered index. It also provides an API for submitting the files to index and query the index. Maybe it is sufficient for your use case.

Resources