I am new to Apache Flume-ng. I want to send files from client-agent to server-agent, who will ultimately write files to HDFS. I have seen http://cuddletech.com/blog/?p=795 . This is the best which one i found till now. But it is via script not via APIs. I want to do it via Flume APIs. Please help me in this regard. And tell me steps, how to start and organize code.
I think you should maybe explain more about what you want to achieve.
The link you post appears to be just fine for your needs. You need to start a Flume agent on your client to read the files and send them using the Avro sink. Then you need a Flume agent on your server which uses an Avro source to read the events and write them where you want.
If you want to send events directly from an application then have a look at the embedded agent in Flume 1.4 or the Flume appender in log4j2 or (worse) the log4j appender in Flume.
Check this http://flume.apache.org/FlumeDeveloperGuide.html
You can write client to send events or use Embedded agent.
As for the code organization, it is up to you.
Related
I need to get data from csv files ( daily extraction from différent business Databasses ) to HDFS then move it to Hbase and finaly charging agregation of this data to a datamart (sqlServer ).
I would like to know the best way to automate this process ( using java or hadoops tools )
I'd echo the comment above re. Kafka Connect, which is part of Apache Kafka. With this you just use configuration files to stream from your sources, you can use KSQL to create derived/enriched/aggregated streams, and then stream these to HDFS/Elastic/HBase/JDBC/etc etc etc
There's a list of Kafka Connect connectors here.
This blog series walks through the basics:
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-1/
https://www.confluent.io/blog/blogthe-simplest-useful-kafka-connect-data-pipeline-in-the-world-or-thereabouts-part-2/
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/
Little to no coding required? In no particular order
Talend Open Studio
Streamsets Data Collector
Apache Nifi
Assuming you can setup a Kafka cluster, you can try Kafka Connect
If you want to program something, probably Spark. Otherwise, pick your favorite language. Schedule the job via Oozie
If you don't need the raw HDFS data, you can load directly into HBase
I am trying to do a POC in Hadoop for log aggregation. we have multiple IIS servers hosting atleast 100 sites. I want to to stream logs continously to HDFS and parse data and store in Hive for further analytics.
1) Is Apache KAFKA correct choice or Apache Flume
2) After streaming is it better to use Apache storm and ingest data into Hive
Please help with any suggestions and also any information of this kind of problem statement.
Thanks
You can use either Kafka or flume also you can combine both to get data into HDFSbut you need to write code for this There are Opensource data flow management tools available, you don't need to write code. Eg. NiFi and Streamsets
You don't need to use any separate ingestion tools, you can directly use those data flow tools to put data into hive table. Once table is created in hive then you can do your analytics by providing queries.
Let me know you need anything else on this.
I'm downloading a file source and creating and stream to process line by line to finally sink into HDFS.
For that purpose I'm using Spring Cloud Dataflow + Kafka.
Question: is there any way to know when the complete file has been sunk into HDFS to trigger an event?
is there any way to know when the complete file has been sunk into HDFS to trigger an event?
This type of use-case typically falls under task/batch as opposed to streaming pipeline. If you build a filehdfs task (batch-job) application, you could then have a stream listening to various task-events in order to make further downstream decisions or data processing.
Please refer to "Subscribing to Task/Batch Events" from the reference guide for more details.
Different applications are writing their logs to different directory structure. I want to read those logs and put it in sink(can be hadoop or physical file).
How does flume supports multiple sources for single agent? is it possible to have multiple sources for a single agent ?
Can anyone guide me in this?
Thanks and regards
Chhaya
Configure your flume agent with multiple sources - one for each log file. They should probably be spooling file source type. Note that when the source gets the file it needs to not change - you need to configure the source to be sure of this..
Those sources can then go to a single channel which can have a single sink.
Flume has several third party plugins to support AMQP source.
Why would we want to send message to rabbitmq or qpid and then to flume and not directly to flume ?
Am i missing something ?
Also , in what cases i should use messaging queues like Qpid , rabbitMQ and when something like Flume ?
I read Qpid , RabbitMQ gurantees ordered delivery which is not important in my case.
Any other differences ?
Can we add channels and sink dynamically to a running flume agent ? Adding a new channel to a source with a file roll sink , it does not require any code change just a conf file change and a restart. Is there a way to do it dynamically i.e without restarting of flume agent
It depends on your use case basically. As you have mentioned, in your use case, ordered delivery is not important, then may be Flume will fit. Flume is actually faster because of this feature and it has a cheaper fault tolerance setup. Check this link for more details.
In addition, Flume fits well when dealing with Hadoop enviroment (HDFS as a sink), as it actually evolved from that. And you also see for the same reason use cases, where RabbitMQ (as source) messages are being pushed through Flume.