Apache flume rolling out the HDFS files on hourly basis - hadoop

I'm new to Flume and I was exploring options to roll over my HDFS files on hourly basis using Flume.
In my project Apache Flume will read the messages from Rabbit MQ and it will write it to HDFS.
hdfs.rollInterval - It closes the file based on the time interval when it got opened.
New file will be created only when Flume reads a message after the file got closed. This option is not solving our problem.
hdfs.path = /%y/%m/%d/%H - This option is working fine and it creates folder on hourly basis. But the problem is new folder will be created only when new message comes.
For example: Messages are coming till 11.59, the file will be in open state. Then the messages stop coming till 12.30. But, the file will still be in open state. After 12.30 new message comes. Then because of hdfs.path configuration, previous file will be closed and new file will be created in new folder.
Previous file cannot be used for computation till it is closed.
We need an option of closing the opened files on hourly basis perfectly. I'm wondering if there any options in flume for doing that.

hdfs.rollInterval is described as
Number of seconds to wait before rolling current file
So this line should cause the files to allocate for an hour at a time
hdfs.rollInterval = 3600
And I would additionally ignore file size and event count, so add these as well
hdfs.rollSize = 0
hdfs.rollCount = 0

hdfs.idleTimeout
Timeout after which inactive files get closed (0 = disable automatic closing of idle files)
For example, you can set this property to 180. The file will be opened

Related

TailFile Processor- Apache Nifi

I'm using Tailfile processor to fetch logs from a cluster(3 nodes) scheduled to run every minute. The log file name changes for every hour
I was confused on which Tailing mode should I use . If I use Single File it is not fetching the new file generated after 1 hour. If I use the multifile, It is fetching the file after 3rd minute of file name change which is increasing the size of the file. what should be the rolling filename for my file and which mode should I use.
Could you please let me know. Thank you
Myfilename:
retrieve-11.log (generated at 11:00)- this is removed but single file mode still checks for this file
after 1 hour retrieve-12.log (generated at 12:00)
My Processor Confuguration:
Tailing mode: Multiple Files
File(s) to Tail: retrieve-${now():format("HH")}.log
Rolling Filename Pattern: ${filename}.*.log
Base Directory: /ext/logs
Initial Start Position: Beginning of File
State Location: Local
Recursive lookup: false
Lookup Frequency: 10 minutes
Maximum age: 24 hours
Sounds like you aren't really doing normal log file rolling. That would be, for example, where you write to logfile.log then after 1 day, you move logfile.log to be logfile.log.1 and then write new logs to a new, empty logfile.log.
Instead, it sounds like you are just writing logs to a different file based on the hour. I assume this means you overwrite each file every 24h?
So something like this might work?
EDIT:
So given that you are doing the following:
At 10:00, `retrieve-10.log` is created. Logs are written here.
At 11:00, `retrieve-11.log` is created. Logs are now written here.
At 11:10, `retrieve-10.log` is moved.
TailFile is only run every 10 minutes.
Then targeting a file based on the hour won't work. At 10:00, your tailFile only reads retrieve-10.log. At 11:00 your tailFile only reads retrieve-11.log. So worst case, you miss 10 minuts of logs between 10:50 and 11:00.
Given that another process is cleaning up the old files, there isn't going to be a back log of old files to worry about. So it sounds like there's no need to set the hour specifically.
tailing mode: multiple files
files to tail: /path/retrieve-*.log
With this, at 10:00, tailFile tails retrieve-9.log and retrieve-10.log. At 10:10, retrieve-9.log is removed and it tails retrieve-10.log. At 11:00 it tails retrieve-10.log and retrieve-11.log. At 11:10, retrieve-10.log is removed and it tails retrieve-11.log. Etc.

How to write data in real time to HDFS using Flume?

I am using Flume to store sensor data in HDFS. Once the data is received through MQTT. The subscriber posts the data in JSON format to Flume HTTP listener. It is currently working fine, but the problem is that flume is not writing to HDFS file till I stop it (or the size of the file reachs 128MB). I am using Hive to apply a schema on read. Unfortunately, the resulting hive table contains only 1 entry. This is normal because Flume did not write new coming data to file (loaded by Hive).
Is there any manner to force Flume to write new coming data to HDFS in a near-real time way? So, I don't need to restart it or to use small files?
here is my flume configuration:
# Name the components on this agent
emsFlumeAgent.sources = http_emsFlumeAgent
emsFlumeAgent.sinks = hdfs_sink
emsFlumeAgent.channels = channel_hdfs
# Describe/configure the source
emsFlumeAgent.sources.http_emsFlumeAgent.type = http
emsFlumeAgent.sources.http_emsFlumeAgent.bind = localhost
emsFlumeAgent.sources.http_emsFlumeAgent.port = 41414
# Describe the sink
emsFlumeAgent.sinks.hdfs_sink.type = hdfs
emsFlumeAgent.sinks.hdfs_sink.hdfs.path = hdfs://localhost:9000/EMS/%{sensor}
emsFlumeAgent.sinks.hdfs_sink.hdfs.rollInterval = 0
emsFlumeAgent.sinks.hdfs_sink.hdfs.rollSize = 134217728
emsFlumeAgent.sinks.hdfs_sink.hdfs.rollCount=0
#emsFlumeAgent.sinks.hdfs_sink.hdfs.idleTimeout=20
# Use a channel which buffers events in memory
emsFlumeAgent.channels.channel_hdfs.type = memory
emsFlumeAgent.channels.channel_hdfs.capacity = 10000
emsFlumeAgent.channels.channel_hdfs.transactionCapacity = 100
# Bind the source and sinks to the channel
emsFlumeAgent.sources.http_emsFlumeAgent.channels = channel_hdfs
emsFlumeAgent.sinks.hdfs_sink.channel = channel_hdfs
I think the tricky bit here is that you would like to write data to HDFS in near real time but don't want small files either (for obvious reasons) and this could be a difficult thing to a achieve.
You'll need to find optimal balance between the following two parameters:
hdfs.rollSize (Default = 1024) - File size to trigger roll, in bytes (0: never roll based on file size)
and
hdfs.batchSize (Default = 100) - Number of events written to file before it is flushed to HDFS
If your data is not likely to reach 128 MB in the preferred time duration, then you may need to reduce the rollSize but only to an extent that you don't run into the small files problem.
Since, you have not set any batch size in your HDFS sink, you should see the results of HDFS flush after every 100 records but once the size of the flushed records jointly reaches 128 MB, the contents would be rolled up in a 128 MB file. Is this also not happening? Could you please confirm?
Hope this helps!

Write event with flume to S3 via HDFS Sink ensure transaction

We are using flume and S3 to store our events.
I recognized that events are only transferred to S3 whenever the HDFS sink rolls to the next file or flume is shutdown gracefully.
This can, in my mind, lead to potential data loss. The Flume Documentation writes:
...Flume uses a transactional approach to guarantee the reliable
delivery of the Events...
here my configuration:
agent.sinks.defaultSink.type = HDFSEventSink
agent.sinks.defaultSink.hdfs.fileType = DataStream
agent.sinks.defaultSink.channel = fileChannel
agent.sinks.defaultSink.serializer = avro_event
agent.sinks.defaultSink.serializer.compressionCodec = snappy
agent.sinks.defaultSink.hdfs.path = s3n://testS3Bucket/%Y/%m/%d
agent.sinks.defaultSink.hdfs.filePrefix = events
agent.sinks.defaultSink.hdfs.rollInterval = 3600
agent.sinks.defaultSink.hdfs.rollCount = 0
agent.sinks.defaultSink.hdfs.rollSize = 262144000
agent.sinks.defaultSink.hdfs.batchSize = 10000
agent.sinks.defaultSink.hdfs.useLocalTimeStamp = true
#### CHANNELS ####
agent.channels.fileChannel.type = file
agent.channels.fileChannel.capacity = 1000000
agent.channels.fileChannel.transactionCapacity = 10000
I assume that I just do something wrong, any Ideas?
After some investigation I found one of the main problems using S3 with flume and the HDFS Sink.
One of the main differences between plain HDFS and the S3 implementation is that S3 does not directly support rename. When a file is renamed in S3 the file will be copied and to the new name and the old file will be deleted. (see: How to rename files and folder in Amazon S3?)
Flume by default extend files with .tmp when the file is not full. After the rotation the file will be renamed to the final filename. In HDFS this will be no problem but in S3 this can cause problems according to this issue:
https://issues.apache.org/jira/browse/FLUME-2445
Because S3 with HDFS sink seams not 100% trustworthy I prefer the more safe way of saving all files local and sync/delete the finished files with the aws tool s3 sync (http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html)
In worse case the files are not synced or the local disk is full but both problems can be easily solved via a monitoring system that anyways should be used.

S3 Flume HDFS SINK Compression

I am trying to write the flume events in Amaozn S3.The events written in S3 is in compressed format. My Flume configuration is given below. I am facing a data loss. Based on the configuration given below, if I publish 20000 events, I receive only 1000 events and all other data is lost. But When I disable the rollcount, rollSize and rollInterval configurations, all the events are received but there are 2000 small files created. Is there any wrong in my configuration settings? Should I add any other configurations?
injector.sinks.s3_3store.type = hdfs
injector.sinks.s3_3store.channel = disk_backed4
injector.sinks.s3_3store.hdfs.fileType = CompressedStream
injector.sinks.s3_3store.hdfs.codeC = gzip
injector.sinks.s3_3store.hdfs.serializer = TEXT
injector.sinks.s3_3store.hdfs.path = s3n://CID:SecretKey#bucketName/dth=%Y-%m-%d-%H
injector.sinks.s3_1store.hdfs.filePrefix = events-%{receiver}
# Roll when files reach 256M or after 10m, whichever comes first
injector.sinks.s3_3store.hdfs.rollCount = 0
injector.sinks.s3_3store.hdfs.idleTimeout = 600
injector.sinks.s3_3store.hdfs.rollSize = 268435456
#injector.sinks.s3_3store.hdfs.rollInterval = 3600
# Flush data to buckets every 1k events
injector.sinks.s3_3store.hdfs.batchSize = 10000
For starters: if you disable your setting for rollCount, rollSize and so on, flume will revert to defaults, hence the small files you receive, those are the default.
The relevant aspect is this:
injector.sinks.s3_3store.hdfs.batchSize = 10000
it basically tells your sink to collect 10.000 events before flushing. If you reduce that amount, you'll get smaller files too, because S3 in contrast to regular HDFS doesn't support file appends. Once you flush, the files will be closed and a new file will be created.
Try to determine which amount of events your sink will receive within a short time frame of a couple of minutes or so and set that value as you batch size.

Flume to HDFS split a file to lots of files

I'm trying to transfer a 700 MB log file from flume to HDFS.
I have configured the flume agent as follows:
...
tier1.channels.memory-channel.type = memory
...
tier1.sinks.hdfs-sink.channel = memory-channel
tier1.sinks.hdfs-sink.type = hdfs
tier1.sinks.hdfs-sink.path = hdfs://***
tier1.sinks.hdfs-sink.fileType = DataStream
tier1.sinks.hdfs-sink.rollSize = 0
The source is a spooldir, channel is memory and sink is hdfs.
I have also tried to send a 1MB file, and flume split it to 1000 files, each one of size of 1KB.
Another thing I have noticed is that the transfer was very slow, 1MB took about 1 minute.
Am I doing something wrong?
You need to disable the rolltimeout too, that's done with the following settings:
tier1.sinks.hdfs-sink.hdfs.rollCount = 0
tier1.sinks.hdfs-sink.hdfs.rollInterval = 300
rollcount prevents roll overs, rollIntervall here is set to 300 seconds, setting that to 0 will disable timeouts. You will have to chosse which mechanism you want for rollovers, otherwise Flume will only close the files upon shutdown.
The default values are the following:
hdfs.rollInterval 30 Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollSize 1024 File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount 10 Number of events written to file before it rolled (0 = never roll based on number of events)

Resources