I have a problem making real-time dashboard, which visualize my log information from HDFS. Below is sketch of my design, my log information is generated from the fly so I use Kafka and Streaming Spark to deal with it. But after exporting from Streaming Spark, I don't know how to visualize it to a local website. Could you give me any idea to do so?
Data input in HDFS --> Kafka --> Streaming Spark --> ? --> Front-end web (d3js,...)
Thank you
P/s: Kafka and Streaming Spark are managed by Ambari server, HDP 2.4
Related
What is difference between Apache flume and Apache storm?
Is is possible to ingest logs data into Hadoop cluster using storm?
Both are used for streaming data so can storm be used as an alternative to flume?
Apache Flume is a service for collecting large amounts of streaming data, particularly logs. Flume pushes data to consumers using mechanisms it calls data sinks. Flume can push data to many popular sinks right out of the box, including HDFS, HBase, Cassandra, and some relational databases.
Apache Storm involves streaming data. It is the bridge between batch processing and stream processing, which Hadoop is not natively designed to handle. Storm runs continuously, processing a stream of incoming data and dicing it into batches, so Hadoop can more easily ingest it. Data sources are called spouts and each processing node is a bolt. Bolts perform computations and processes on the data, including pushing output to data stores and other services.
If you need something that works out of the box, choose Flume, once you decide whether to push or pull makes more sense. If streaming data is, for now, just a small add-on to your already developed Hadoop environment, Storm is a good choice.
It is possible to ingest logs data into the Hadoop cluster using a storm
We can use the storm as an alternative to the flume
I have a three node nifi cluster. We just installed nifi packages on linux machines and cluster with separate zookeeper cluster. I am planning to monitor nifi performance via nagios but we saw hortonworks ambari provides fetures for management and monitoring also.
What is the best practice for nifi deployment on prod
how should we scale up
how can we monitor nifi
Should we monitor queue/process performance
Should use something like ambari
regards..
Edit-1:
#James actually I am collecting user event logs from several sources within company.
All events are first written to Kafka. Nifi consumes kafka, does simple transformations like getting a field from payload to attribute.
After transformations data is written to both elasticsearch and hdfs. Before writing to hdfs we are merging flowfiles so writing to hdfs is in batches.
I have around 50k/s event.
I am trying to do a POC in Hadoop for log aggregation. we have multiple IIS servers hosting atleast 100 sites. I want to to stream logs continously to HDFS and parse data and store in Hive for further analytics.
1) Is Apache KAFKA correct choice or Apache Flume
2) After streaming is it better to use Apache storm and ingest data into Hive
Please help with any suggestions and also any information of this kind of problem statement.
Thanks
You can use either Kafka or flume also you can combine both to get data into HDFSbut you need to write code for this There are Opensource data flow management tools available, you don't need to write code. Eg. NiFi and Streamsets
You don't need to use any separate ingestion tools, you can directly use those data flow tools to put data into hive table. Once table is created in hive then you can do your analytics by providing queries.
Let me know you need anything else on this.
I am working on building a distributed real time cluster system to supervise and analyze a network. I did several researches on internet and I came out with few technologies:
for real time processing : logstash, storm and apache streaming
for storage: elasticsearch
for analysis: Apache Spark over Hadoop (I will use ES-Hadoop to connect with Elasticsearch)
for data visualization: kibana, D3js, c3js
However, logstash is not often mentioned as spark streaming and storm. I found in internet the following architecture presented in the below picture:
I have two questions:
I don't understand why logstash is not often mentioned as a real-tim processing system like spark streaming and storm. What are the main reasons ? I hav been using it and it is very powerful..
Regarding the Analyze part, can I use the machine learning librairies in that configuration ?
Logstash is not cluster stream processing system. It is simply a JVM based process. The latest version supports on disk buffer but does not have the nearly the same delivery guaranties as Spark or Storm. Take a look at http://storm.apache.org/releases/1.0.3/Guaranteeing-message-processing.html
Yes but not sure why use Elastic for storing data first. Why not HDFS->SparkML->Elastic? The main thing to think here is managing models, training and testing.
We are building a real-time big data tool with open source tools. Our main goal is to supervise and analyze a network by getting logs from a kafka server in real-time. We saw in tutorials that we have to divide our tool in two sections: Analytic and Supervision as shown below.
For the supervision section we chose the solution Elasticsearch and Logstash.
Regarding the section analytic, my team and I are comparing Apache Storm Streaming and Apache Storm in order to use it with Elasticsearch. Despite the fact that Apache Storm is a true real-time data processing tool and faster than Apache Spark Streaming, it does not provide machine learning libraries like with Apache Spark. That's why we are thinking to choose Apache Spark. The elastic website indicates that it exists a connector ES-Hadoop to connect a Elasticsearch database to a Hadoop ecosystem. We can see that in the below figure.
However, We are a little bit confused with this picture because there is only spark SQL and not all the spark frameworks (MLlib, Spark Streaming..). We did some assumptions and we came out with two final possible architectures. We only wanted to know if there are technically correct and if we are not in the wrong direction.
With Apache Spark streaming:
With Apache Storm:
Both your architectural diagrams are ok. Keep on mind that spark streaming will not work in this scenario. Es-hadoop provides you with easy access apis to get and put data from and into elastic. Its also provides the methods to get the data inro the spark framework (RDD) or data frames inthe case of spark sql. Once the data is in the framework, all ml libraries can be applied to the data for ml or analytics generation. Elastic is not capable of streaming data so spark streaming in the strict sense is not possible. So in the diagram, the arrow to hdfs optional and then to spark streaming can be removed and the arrow juat pointa to hdfs. My concern, however, would be running mllib algos on the data in realtime and expect realtime performance. Typical use case might be do modwl generation off line and use the model in realtime for analysis.