Why does Flume need to have a AMQP source? - hadoop

Flume has several third party plugins to support AMQP source.
Why would we want to send message to rabbitmq or qpid and then to flume and not directly to flume ?
Am i missing something ?
Also , in what cases i should use messaging queues like Qpid , rabbitMQ and when something like Flume ?
I read Qpid , RabbitMQ gurantees ordered delivery which is not important in my case.
Any other differences ?
Can we add channels and sink dynamically to a running flume agent ? Adding a new channel to a source with a file roll sink , it does not require any code change just a conf file change and a restart. Is there a way to do it dynamically i.e without restarting of flume agent

It depends on your use case basically. As you have mentioned, in your use case, ordered delivery is not important, then may be Flume will fit. Flume is actually faster because of this feature and it has a cheaper fault tolerance setup. Check this link for more details.
In addition, Flume fits well when dealing with Hadoop enviroment (HDFS as a sink), as it actually evolved from that. And you also see for the same reason use cases, where RabbitMQ (as source) messages are being pushed through Flume.

Related

Kafka Connect separated logging

Currently we are using a couple of custom connetor plugins for our confluent kafka connect distributed worker cluster. One thing that bothers me for a long time is that kafka connect writes all logs from all deployed connectors to one File/Stream. This makes debugging an absolute nightmare. Is there a way to let kafka connect log the connectors in different Files/Streams?
Via the connect-log4j.properties I am able to let a specific class log to a different File/Stream. But this means that with every additional connector I have to adjust the connect-log4j.properties
Thanks
Kafka Connect does not currently support this. I agree that it is not ideal.
One option would be to split out your connectors and have a dedicated worker cluster for each, and thus separate log files.
Kafka Connect is part of Apache Kafka so you could raise a JIRA to discuss this further and maybe contribute it back via a PR?
Edit April 12, 2019: See https://cwiki.apache.org/confluence/display/KAFKA/KIP-449%3A+Add+connector+contexts+to+Connect+worker+logs

Spark Architecture for processing small binary files saved in HDFS

I don't know how to build architecture for following use case:
I have an Web application where users can upload files(pdf&pptx) and directories to be processed. After upload is complete web application put this files and directories in HDFS, then send a messages on kafka with path to this files.
Spark Application read messages from kafka streaming, collect them on master(driver), and after that process them. I collect messages first because i need to move the code to data, and not move data where the message is received. I understood that spark assign job to executor which already have file locally.
I have issues with kafka because i was forced to collect them first for the above reason, and when want to create checkpoint app crash "because you are attempting to reference SparkContext from a broadcast variable" even if the code run before adding checkpointing( I use sparkContext there because i need to save data to ElasticSearch and PostgreSQL. I don't know how exactly i can do code upgrading in this conditions.
I read about hadoop small files problems, and I understand what problems are in this case. I read that HBase is a better solution to save small files than just save in hdfs. Other problem in hadoop small files problems is big number of mappers and reducers created for computation, but i don't understand if this problem there in spark.
What is the best architecture for this use case?
How to do Job Scheduling? It's kafka good for that? or I need to use other service like rabbitMQ or something else?
Exist some method to add jobs to an running Spark application through some REST API?
How is the best way to save files? Is better to use Hbase because i have small files(<100MB)? Or I need to use SequenceFile? I think SequenceFile isn't for my use case because i need to reprocess some files randomly.
What is the best architecture do you think for this use case?
Thanks!
There is no one single "the best" way to build architecture. You need to make decisions and stick to them. Make the architecture flexible and decoupled so that you can easily replace components if needed.
Consider following stages/layers in your architecture:
Retrieval/Acquisition/Transport of source data (files)
Data processing/transformation
Data archival
As a retrieval component, I would use Flume. It is flexible, supports a lot of sources, channels (including Kafka) and sinks. In your case you can configure source that monitors the directory and extracts the newly received files.
For data processing/transformation - it depends what task you are solving. You probably decided on Spark Streaming. Spark streaming can be integrated with Flume sink (http://spark.apache.org/docs/latest/streaming-flume-integration.html) There are other options available, e.g. Apache Storm. Flume combines very well with Storm. Some transformations can also be applied in Flume.
For data archival - do not store/archive the files directly in Hadoop, unless they are bigger than few hundredths of megabytes. One solution would be to put them in HBase.
Make your architecture more flexible. I would place processed files in a temporary HDFS location and have some job regualarly archive them into zip, HBase, Hadoop Archive (there is such an animal) or any other solution.
Consider using Apache NiFi (aka HDF - Hortonworks Data Flow). It uses internally queues, provides a lot of processors. It can make your life easier and get the workflow developed in minutes. Give it a try. There is nice Hortonworks tutorial which , combined with HDP Sandbox running on a virtual machine/Docker, can bring you up to speed in very short time (1-2 hours?).

Does Apache Kafka Store the messages internally in HDFS or Some other File system

We have a project requirement of testing the data at Kafka Layer. So JSON files are moving into hadoop area and kafka is reading the live data in hadoop(Raw Json File). Now I have to test whether the data sent from the other system and read by kafka should be same.
Can i validate the data at kafka?. Does kafka store the messages internally on HDFS?. If yes then is it stored in a file structure similar to what hive saves internally just like a single folder for single table.
Kafka stores data in local files (ie, local file system for each running broker). For those files, Kafka uses its own storage format that is based on a partitioned append-only log abstraction.
The local storage directory, can be configured via parameter log.dir. This configuration happens individually for each broker, ie, each broker can use a different location. The default value is /tmp/kafka-logs.
The Kafka community is also working on tiered-storage, that will allow brokers to no only use local disks, but to offload "cold data" into a second tier: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage
Furthermore, each topic has multiple partitions. How partitions are distributed, is a Kafka internal implementation detail. Thus you should now rely on it. To get the current state of your cluster, you can request meta data about topics and partitions etc. (see https://cwiki.apache.org/confluence/display/KAFKA/Finding+Topic+and+Partition+Leader for an code example). Also keep in mind, that partitions are replicated and if you write, you always need to write to the partition leader (if you create a KafkaProducer is will automatically find the leader for each partition you write to).
For further information, browse https://cwiki.apache.org/confluence/display/KAFKA/Index
I think you can, but you have to do that manually. You can let kafka sink whatever output to HDFS. Maybe my answer is a bit late and this 'confluent' reference appeared after that, but briefly one can do the followings:
Assuming you have all servers are running (check the confluent
website)
Create your connector:
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics='your topic'
hdfs.url=hdfs://localhost:9000
flush.size=3
Note: The approach assumes that you are using their platform
(confluent platform) which I haven't use.
Fire the kafka-hdfs streamer.
Also you might find more useful details in this Stack Overflow discussion.
This happens with most of the beginner. Let's first understand that component you see in Big Data processing may not be at all related to Hadoop.
Yarn, MapReduce, HDFS are 3 main core component of Hadoop. Hive, Pig, OOOZIE, SQOOP, HBase etc work on top of Hadoop.
Frameworks like Kafka or Spark are not dependent on Hadoop, they are independent entities. Spark supports Hadoop, like Yarn, can be used for Spark's Cluster mode, HDFS for storage.
Same way Kafka as an independent entity, can work with Spark. It stores its messages in the local file system.
log.dirs=/tmp/kafka-logs
You can check this at $KAFKA_HOME/config/server.properties
Hope this helps.

How to decide the flume topology approach?

I am setting up flume but very not sure of what topology to go ahead with for our use case.
We basically have two web servers which can generate logs at the speed of 2000 entries per second. Each entry of size around 137Bytes.
Currently we have used rsyslog( writing to a tcp port) to which a php script writes these logs to. And we are running a local flume agent on each webserver , these local agents listen to a tcp port and put data directly in hdfs.
So localhost:tcpport is the "flume source " and "hdfs" is the flume sink.
I am not sure about the above approach and am confused between three approaches:
Approach 1: Web Server, RSyslog & Flume Agent on each machine and a Flume collector running on the Namenode in hadoop cluster, to collect the data and dump into hdfs.
Approach 2: Web Server, RSyslog on same machine and a Flume collector (listening on a remote port for events written by rsyslog on web server)running on the Namenode in hadoop cluster, to collect the data and dump into hdfs.
Approach 3: Web Server, RSyslog & Flume Agent on same machine. And all agents writing directly to the hdfs.
Also, we are using hive, so we are writing directly into partitioned directories. So we want to think of an approach that allows us to write on Hourly partitions.
Basically I just want to know If people have used flume for similar purposes and if it is the right and reliable tool and if my approach seems sensible.
I hope that's not too vague. Any help would be appreciated.
The typical suggestion for your problem would be to have a fan-in or converging-flow agent deployment model. (Google for "flume fan in" for more details). In this model, you would ideally have an agent on each webserver. Each of those agents forward the events to few aggregator or collector agents. The aggregator agents then forward the events to a final destination agent that writes to HDFS.
This tiered architecture allows you to simplify scaling, failover etc.

Flume: Send files to HDFS via APIs

I am new to Apache Flume-ng. I want to send files from client-agent to server-agent, who will ultimately write files to HDFS. I have seen http://cuddletech.com/blog/?p=795 . This is the best which one i found till now. But it is via script not via APIs. I want to do it via Flume APIs. Please help me in this regard. And tell me steps, how to start and organize code.
I think you should maybe explain more about what you want to achieve.
The link you post appears to be just fine for your needs. You need to start a Flume agent on your client to read the files and send them using the Avro sink. Then you need a Flume agent on your server which uses an Avro source to read the events and write them where you want.
If you want to send events directly from an application then have a look at the embedded agent in Flume 1.4 or the Flume appender in log4j2 or (worse) the log4j appender in Flume.
Check this http://flume.apache.org/FlumeDeveloperGuide.html
You can write client to send events or use Embedded agent.
As for the code organization, it is up to you.

Resources