How to decide the flume topology approach? - hadoop

I am setting up flume but very not sure of what topology to go ahead with for our use case.
We basically have two web servers which can generate logs at the speed of 2000 entries per second. Each entry of size around 137Bytes.
Currently we have used rsyslog( writing to a tcp port) to which a php script writes these logs to. And we are running a local flume agent on each webserver , these local agents listen to a tcp port and put data directly in hdfs.
So localhost:tcpport is the "flume source " and "hdfs" is the flume sink.
I am not sure about the above approach and am confused between three approaches:
Approach 1: Web Server, RSyslog & Flume Agent on each machine and a Flume collector running on the Namenode in hadoop cluster, to collect the data and dump into hdfs.
Approach 2: Web Server, RSyslog on same machine and a Flume collector (listening on a remote port for events written by rsyslog on web server)running on the Namenode in hadoop cluster, to collect the data and dump into hdfs.
Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all agents writing directly to the hdfs.
Also, we are using hive, so we are writing directly into partitioned directories. So we want to think of an approach that allows us to write on Hourly partitions.
Basically I just want to know If people have used flume for similar purposes and if it is the right and reliable tool and if my approach seems sensible.
I hope that's not too vague. Any help would be appreciated.

The typical suggestion for your problem would be to have a fan-in or converging-flow agent deployment model. (Google for "flume fan in" for more details). In this model, you would ideally have an agent on each webserver. Each of those agents forward the events to few aggregator or collector agents. The aggregator agents then forward the events to a final destination agent that writes to HDFS.
This tiered architecture allows you to simplify scaling, failover etc.

Related

Can StreamSets be used to fetch data onto a local system?

Our team is exploring options for HDFS to local data fetch. We were suggested about StreamSets and no one in the team has an idea about it. Could anyone help me to understand if this will fit our requirement that is to fetch the data from HDFS onto our local system?
Just an additional question.
I have setup StreamSets locally. For example on local ip: xxx.xx.x.xx:18630 and it works fine on one machine. But when I try to access this URL from some other machine on the network, it doesn't work. While my other application like Shiny-server etc works fine with the same mechanism.
Yes - you can read data from HDFS to a local filesystem using StreamSets Data Collector's Hadoop FS Standalone origin. As cricket_007 mentions in his answer, though, you should carefully consider if this is what you really want to do, as a single Hadoop file can easily be larger than your local disk!
Answering your second question, Data Collector listens on all addresses by default. There is a http.bindHost setting in the sdc.properties config file that you can use to restrict the addresses that Data Collector listens on, but it is commented out by default.
You can use netstat to check - this is what I see on my Mac, with Data Collector listening on all addresses:
$ netstat -ant | grep 18630
tcp46 0 0 *.18630 *.* LISTEN
That wildcard, * in front of the 18630 in the output means that Data Collector will accept connections on any address.
If you are running Data Collector directly on your machine, then the most likely problem is a firewall setting. If you are running Data Collector in a VM or on Docker, you will need to look at your VM/Docker network config.
I believe by default Streamsets only exposes its services on localhost. You'll need to go through the config files to find where you can set it to listen on external addresses
If you are using the CDH Quickstart VM, you'll need to externally forward that port.
Anyway, StreamSets is really designed to run as a cluster, on dedicated servers, for optimal performance. It's production deployments are comparable to Apache Nifi offered in Hortonworks HDF.
So no, it wouldn't make sense to use the local FS destinations for anything other than testing/evaluation purposes.
If you want HDFS exposed as a local device, look into installing an NFS Gateway. Or you can use Streamsets to write to FTP / NFS, probably.
It's not clear what data you're trying to get, but many BI tools can perform CSV exports or Hue can be used to download files from HDFS. At the very least, hdfs dfs -getmerge is the one minimalist way to get data from HDFS to local, however, Hadoop typically stores many TB worth of data in the ideal cases, and if you're using anything smaller, then dumping those results into a database is typically the better option than moving around flatfiles

Flink Cluster Performance is much worse than standalone

I use flink to process HDFS files or local files.
When I use standalone setup, the server can process the data at 500k/s.
But when I use cluster,the server can only process the data at 100k/s.
It is so weird,I can not figure out what is going on.
I found that when I use cluster(2 servers), there is always one server which has low speeds to read/write data. The flink cluster is based on hadoop.
Can anyone help me?

What is the best practice for nifi production deployment

I have a three node nifi cluster. We just installed nifi packages on linux machines and cluster with separate zookeeper cluster. I am planning to monitor nifi performance via nagios but we saw hortonworks ambari provides fetures for management and monitoring also.
What is the best practice for nifi deployment on prod
how should we scale up
how can we monitor nifi
Should we monitor queue/process performance
Should use something like ambari
regards..
Edit-1:
#James actually I am collecting user event logs from several sources within company.
All events are first written to Kafka. Nifi consumes kafka, does simple transformations like getting a field from payload to attribute.
After transformations data is written to both elasticsearch and hdfs. Before writing to hdfs we are merging flowfiles so writing to hdfs is in batches.
I have around 50k/s event.

Where to run the flume agent that writes to HDFS?

I have 25-20 agents sending the data to couple of collector agents and and these collector agents then have to write it to the HDFS.
Where to run these collector agents? On the Data node of the Hadoop cluster or outside of the cluster? What are the pros/cons of each and how are people currently running them?
tier 2 flume agents use hdfsSink write directly to HDFS. what's more , Tier1 can use failover sinkgroup. In case of one of tier 2 flume agent is down.
I assume your using something like Flume. If that's the case, the Flume agent (at least the first tier) runs where ever the data is being sourced from. IE: Web Server for Web Logs..
Flume does support other protocols, like JMS, so the location will vary in those scenario's.
For production clusters, you don't want to run "agents" like flume on Datanodes. Best to level the resources of that hardware for the cluster.
If you have a lot of agents, you'll want to use a tiered architecture to consolidate and funnel the numerous sources into a smaller set of agents that will write to HDFS. This helps control visibility and exposure of the cluster to external servers.

hadoop logging facility?

If I am to use zookeeper as a work queue and connect to it individual consumers/workers. What would you recommend as a good distributed setup for logging these workers' activities?
Assume the following:
1) At anytime we could be down to 1 single computer housing the hadoop cluster. The system will autoscale up and down as needed but has alot of down time where only 1 single computer is needed.
2) I just need the ability to access all of the workers logs without accessing the individual machine that worker is located at. Bare in mind, that by the time I get to read one of these logs that machine might very well be terminated and long gone.
3) We'll need easy access to the logs i.e being able to cat/grep and tail or alternatively in a more SQLish manner - we'll need real time ability to both query as well as monitor output for short periods of time in real time. (i.e tail -f /var/log/mylog.1)
I appreciate your expert ideas here!
Thanks.
Have you looked at using Flume, chukwa or scribe - ensure that your flume etc process has access to the log files that you are trying to aggregate onto a centralized server.
flume reference:
http://archive.cloudera.com/cdh/3/flume/Cookbook/
chukwa:
http://incubator.apache.org/chukwa/docs/r0.4.0/admin.html
scribe:
https://github.com/facebook/scribe/wiki/_pages
hope it helps.
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd. Fluentd's Java library is clever enough to buffer locally when Fluentd daemon is down. This lessens the possibility of the data loss.
Fluentd: Data Import from Java Applications
High availability configuration is also available, which basically enables you to have centralized log aggregation system.
Fluentd: High Availability Configuration

Resources