Where to run the flume agent that writes to HDFS?

Where to run the flume agent that writes to HDFS? - hadoop

I have 25-20 agents sending the data to couple of collector agents and and these collector agents then have to write it to the HDFS.
Where to run these collector agents? On the Data node of the Hadoop cluster or outside of the cluster? What are the pros/cons of each and how are people currently running them?

tier 2 flume agents use hdfsSink write directly to HDFS. what's more , Tier1 can use failover sinkgroup. In case of one of tier 2 flume agent is down.

I assume your using something like Flume. If that's the case, the Flume agent (at least the first tier) runs where ever the data is being sourced from. IE: Web Server for Web Logs..
Flume does support other protocols, like JMS, so the location will vary in those scenario's.
For production clusters, you don't want to run "agents" like flume on Datanodes. Best to level the resources of that hardware for the cluster.
If you have a lot of agents, you'll want to use a tiered architecture to consolidate and funnel the numerous sources into a smaller set of agents that will write to HDFS. This helps control visibility and exposure of the cluster to external servers.

Related

Ingest log files from edge nodes to Hadoop

I am looking for a way to stream entire log files from edge nodes to Hadoop. To sum up the use case:
We have applications that produce log files ranging from a few MB to hundreds of MB per file.
We do not want to stream all the log events as they occur.
Pushing the log files in their entirety after they have written completely is what we are looking for (written completely = got moved into another folder for example... this is not a problem for us).
This should be handled by some kind of lightweight agents on the edge nodes to the HDFS directly or - if necessary - an intermediate "sink" that will push the data to HDFS afterwards.
Centralized Pipeline Management (= configuring all edge nodes in a centralized manner) would be great
I came up with the following evaluation:
Elastic's Logstash and FileBeats
Centralized pipeline management for edge nodes is available, e.g. one centralized configuration for all edge nodes (requires a license)
Configuration is easy, WebHDFS output sink exists for Logstash (using FileBeats would require an intermediate solution with FileBeats + Logstash that outputs to WebHDFS)
Both tools are proven to be stable in production-level environments
Both tools are made for tailing logs and streaming these single events as they occur rather than ingesting a complete file
Apache NiFi w/ MiNiFi
The use case of collecting logs and sending the entire file to another location with a broad number of edge nodes that all run the same "jobs" looks predestined for NiFi and MiNiFi
MiNiFi running on the edge node is lightweight (Logstash on the other hand is not so lightweight)
Logs can be streamed from MiNiFi agents to a NiFi cluster and then ingested into HDFS
Centralized pipeline management within the NiFi UI
writing to a HDFS sink is available out-of-the-box
Community looks active, development is lead by Hortonworks (?)
We have made good experiences with NiFi in the past
Apache Flume
writing to a HDFS sink is available out-of-the-box
Looks like Flume is more of a event-based solution rather than a solution for streaming entire log files
No centralized pipeline management?
Apache Gobblin
writing to a HDFS sink is available out-of-the-box
No centralized pipeline management?
No lightweight edge node "agents"?
Fluentd
Maybe another tool to look at? Looking for your comments on this one...
I'd love to get some comments about which of the options to choose. The NiFi/MiNiFi option looks the most promising to me - and is free to use as well.
Have I forgotten any broadly used tool that is able to solve this use case?

I experience similar pain when choosing open source big data solutions, simply that there are so many paths to Rome. Though "asking for technology recommendations is off topic for Stackoverflow", I still want to share my opinions.
I assume you already have a hadoop cluster to land the log files. If you are using an enterprise ready distribution e.g. HDP distribution, stay with their selection of data ingestion solution. This approach always save you lots of efforts in installation, setup centrol managment and monitoring, implement security and system integration when there is a new release.
You didn't mention how you would like to use the log files once they lands in HDFS. I assume you just want to make an exact copy, i.e. data cleansing or data trasformation to a normalized format is NOT required in data ingestion. Now I wonder why you didn't mention the simplest approach, use a scheduled hdfs commands to put log files into hdfs from edge node?
Now I can share one production setup I was involved. In this production setup, log files are pushed to or pulled by a commercial mediation system that makes data cleansing, normalization, enrich etc. Data volume is above 100 billion log records every day. There is an 6 edge nodes setup behind a load balancer. Logs are firstly land on one of the edge nodes, then hdfs command put to HDFS. Flume was used initially but replaced by this approach due to performance issue.(it can very likely be that engineer was lack of experience in optimizing Flume). Worth to mention though, the mediation system has a managment UI for scheduling ingestion script. In your case, I would start with cron job for PoC then use e.g. Airflow.
Hope it helps! And would be glad to know your final choice and your implementation.

Flink Cluster Performance is much worse than standalone

I use flink to process HDFS files or local files.
When I use standalone setup, the server can process the data at 500k/s.
But when I use cluster，the server can only process the data at 100k/s.
It is so weird，I can not figure out what is going on.
I found that when I use cluster（2 servers), there is always one server which has low speeds to read/write data. The flink cluster is based on hadoop.
Can anyone help me？

What is the best practice for nifi production deployment

I have a three node nifi cluster. We just installed nifi packages on linux machines and cluster with separate zookeeper cluster. I am planning to monitor nifi performance via nagios but we saw hortonworks ambari provides fetures for management and monitoring also.
What is the best practice for nifi deployment on prod
how should we scale up
how can we monitor nifi
Should we monitor queue/process performance
Should use something like ambari
regards..
Edit-1:
#James actually I am collecting user event logs from several sources within company.
All events are first written to Kafka. Nifi consumes kafka, does simple transformations like getting a field from payload to attribute.
After transformations data is written to both elasticsearch and hdfs. Before writing to hdfs we are merging flowfiles so writing to hdfs is in batches.
I have around 50k/s event.

Running spark cluster on standalone mode vs Yarn/Mesos

Currently I am running my spark cluster as standalone mode. I am reading data from flat files or Cassandra(depending upon the job) and writing back the processed data to the Cassandra itself.
I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?

Spark standalone cluster manager can also give you cluster mode capabilities.
Spark standalone cluster will provide almost all the same features as the other cluster managers if you are only running Spark.
When you submit your application in cluster mode all you job related files would be copied on to one of the machines on the cluster which would then submit the job on your behalf, if you submit the application in client mode the machine from which the job is being submitted would be taking care of driver related activities. This means that the machine from which the job has been submitted cannot go offline, whereas in cluster mode the machine from which the job has been submitted can go offline.
Having a Cassandra cluster would also not change any of these behaviors except it can save you network traffic if you can get the nearest contact point for the spark executor(Just like Data locality).
The failed stages gets rescheduled if you use either of the cluster managers.

I was wondering if I switch to Hadoop and start using a Resource manager like YARN or mesos, does it give me an additional performance advantage like execution time and better resource management?
In Standalone cluster model, each application uses all the available nodes in the cluster.
From spark-standalone documentation page:
The standalone cluster mode currently only supports a simple FIFO scheduler across applications. However, to allow multiple concurrent users, you can control the maximum number of resources each application will use. By default, it will acquire all cores in the cluster, which only makes sense if you just run one application at a time.
In other cases (when you are running multiple applications in the cluster) , you can prefer YARN.
Currently sometime when I am processing huge chunk of data during shuffling with a possibility of stage failure. If I migrate to a YARN, can Resource manager address this issue?
Not sure since your application logic is not known. But you can give a try with YARN.
Have a look at related SE question for benefits of YARN over Standalone and Mesos:
Which cluster type should I choose for Spark?

How to decide the flume topology approach?

I am setting up flume but very not sure of what topology to go ahead with for our use case.
We basically have two web servers which can generate logs at the speed of 2000 entries per second. Each entry of size around 137Bytes.
Currently we have used rsyslog( writing to a tcp port) to which a php script writes these logs to. And we are running a local flume agent on each webserver , these local agents listen to a tcp port and put data directly in hdfs.
So localhost:tcpport is the "flume source " and "hdfs" is the flume sink.
I am not sure about the above approach and am confused between three approaches:
Approach 1: Web Server, RSyslog & Flume Agent on each machine and a Flume collector running on the Namenode in hadoop cluster, to collect the data and dump into hdfs.
Approach 2: Web Server, RSyslog on same machine and a Flume collector (listening on a remote port for events written by rsyslog on web server)running on the Namenode in hadoop cluster, to collect the data and dump into hdfs.
Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all agents writing directly to the hdfs.
Also, we are using hive, so we are writing directly into partitioned directories. So we want to think of an approach that allows us to write on Hourly partitions.
Basically I just want to know If people have used flume for similar purposes and if it is the right and reliable tool and if my approach seems sensible.
I hope that's not too vague. Any help would be appreciated.

The typical suggestion for your problem would be to have a fan-in or converging-flow agent deployment model. (Google for "flume fan in" for more details). In this model, you would ideally have an agent on each webserver. Each of those agents forward the events to few aggregator or collector agents. The aggregator agents then forward the events to a final destination agent that writes to HDFS.
This tiered architecture allows you to simplify scaling, failover etc.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio