flume posting data into HDFS but characters issues - hadoop

Below mentioned is my flume configuration.
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = http
a1.sources.r1.port = 5140
a1.sources.r1.channels = c1
a1.sources.r1.handler = org.apache.flume.source.http.JSONHandler
a1.sources.r1.handler.nickname = random props
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://10.0.40.18:9160/flume-test
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
there is no error in flume log file but when reading file using hadoop command having issue.
hadoop fs -cat hdfs://10.0.40.18:9160/flume-test/even1393415633931
flume log message is hdfs file created is "hdfs://10.0.40.18:9160/flume-test/even1393415633931"
Any help appreciable.

First, try replacing HDFS sink with a logger to see if your input is correctly arriving.
After that is confirmed, I would recommend trying to adjust flush settings for the sink. HDFS sink batches events before flushing to HDFS through hdfs.batchSize, which is by default 100. This is probably the issue, as you will need to send 100 JSON posts before your output flushes for the first time.
Lastly, you may also want to try tweaking hdfs.writeFormat which is by default set to Writable and not Text.

It sounds like you want a text file so you should use DataStream like this:
a1.sinks.k1.hdfs.file.Type = DataStream

Related

Data loss (skipping) using Flume with Kafka source and HDFS sink

I am experiencing data loss (skipping chunks of time in data) when I am pulling data off a kafka topic as a source and putting it into an HDFS file (DataStream) as a sink. The pattern seems to be in 10, 20 or 30 minute blocks of data skipping. I have verified that the skipped data is in the topic .log file that is being generated by Kafka. (The original data is coming from a syslog, going through a different flume agent and being put into the Kafka topic - the data loss isn't happening there).
I find it interesting and unusual that the blocks of skipped data are always 10, 20 or 30 mins and happen at least once an hour in my data.
Here is a copy of my configuration file:
a1.sources = kafka-source
a1.channels = memory-channel
a1.sinks = hdfs-sink
a1.sources.kafka-source.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.kafka-source.zookeeperConnect = 10.xx.x.xx:xxxx
a1.sources.kafka-source.topic = firewall
a1.sources.kafka-source.groupId = flume
a1.sources.kafka-source.channels = memory-channel
a1.channels.memory-channel.type = memory
a1.channels.memory-channel.capacity = 100000
a1.channels.memory-channel.transactionCapacity = 1000
a1.sinks.hdfs-sink.type = hdfs
a1.sinks.hdfs-sink.hdfs.fileType = DataStream
a1.sinks.hdfs-sink.hdfs.path = /topics/%{topic}/%m-%d-%Y
a1.sinks.hdfs-sink.hdfs.filePrefix = firewall1
a1.sinks.hdfs-sink.hdfs.fileSuffix = .log
a1.sinks.hdfs-sink.hdfs.rollInterval = 86400
a1.sinks.hdfs-sink.hdfs.rollSize = 0
a1.sinks.hdfs-sink.hdfs.rollCount = 0
a1.sinks.hdfs-sink.hdfs.maxOpenFiles = 1
a1.sinks.hdfs-sink.channel = memory-channel
Any insight would be helpful. I have been searching online for answers for awhile.
Thanks.

Using Flume file_roll sink type stuck after few minutes

I am using flume file_roll sink type to sink high volume of data (rate ~10000 events/second) via syslogTCP source type. however the process(spark streaming job) which is pushing data to syslogTCP port stuck after 15 - 20 min ingesting arrount 1.5 million events. I also observed some file descriptor issue in the linux box where flume-ng agent is running.
Below is the flume configuration i am using:
agent2.sources = r1
agent2.channels = c1
agent2.sinks = f1
agent2.sources.r1.type = syslogtcp
agent2.sources.r1.bind = i-170d29de.aws.amgen.com
agent2.sources.r1.port = 44442
agent2.channels.c1.type = memory
agent2.channels.c1.capacity = 1000000000
agent2.channels.c1.transactionCapacity = 40000
agent2.sinks.f1.type = file_roll
agent2.sinks.f1.sink.directory = /opt/app/svc-edl-ops-ngmp-dev/rdas/flume_output
agent2.sinks.f1.sink.rollInterval = 300
agent2.sinks.f1.sink.rollSize = 104857600
agent2.sinks.f1.sink.rollCount = 0
agent2.sources.r1.channels = c1
agent2.sinks.f1.channel = c1
because of performance issue mainly because of high ingestion rate I cannot use HDFS sink type.:
This was my bad. I was using console logging and at some point The putty terminal was freezing because of connectivity issue. causing entire flume agent to chock.
By redirecting flume console output OR having a log4j.property which writes output to console has resolved the freezing issue.

How to put files in flume spooldir one by one?

I am using flume spooldir to put files in HDFS, but I am getting so many small files in HDFS. I thought of using batch size and roll interval, but I don't want to get dependent on size and interval. So I decided to push files in flume spooldir one at a time. How can I do this?
According to https://flume.apache.org/FlumeUserGuide.html#spooling-directory-source, if you set a1.sources.src-1.fileHeader = true, then you can specify any headers (for example the file name header) in the HDFS Sink (see %{host} in the escape sequence description at https://flume.apache.org/FlumeUserGuide.html#hdfs-sink.
EDIT:
For an example config, you can try the following:
a1.sources = r1
a1.sources.r1.type = spooldir
a1.sources.r1.channels = c1
a1.sources.r1.spoolDir = /flumespool
a1.sources.r1.basenameHeader = true
a1.channels = c1
a1.channels.c1.type = memory
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flumeout/%{basename}
a1.sinks.k1.hdfs.fileType = DataStream

Flume sinks data in inconsistance fashion

I have got a problem. I am using apache flume to read the logs from txt file to sink to hdfs. somehow some records are getting skipped while reading. I am using fileChannel please check the below configuration.
agent2.sources = file_server
agent2.sources.file_server.type=exec
agent2.sources.file_server.command = tail -F /home/datafile/error.log
agent2.sources.file_server.channels = fileChannel
agent2.channels = fileChannel
agent2.channels.fileChannel.type=file
agent2.channels.fileChannel.capacity = 12000
agent2.channels.fileChannel.transactionCapacity = 10000
agent2.channels.fileChannel.checkpointDir=/home/data/flume/checkpoint
agent2.channels.fileChannel.dataDirs=/home/data/flume/data
# Agent2 sinks
agent2.sinks = hadooper loged
agent2.sinks.hadooper.type = hdfs
agent2.sinks.loged.type=logger
agent2.sinks.hadooper.hdfs.path = hdfs://localhost:8020/flume/data/file
agent2.sinks.hadooper.hdfs.fileType = DataStream
agent1.sinks.hadooper.hdfs.writeFormat = Text
agent2.sinks.hadooper.hdfs.writeFormat = Text
agent2.sinks.hadooper.hdfs.rollInterval = 600
agent2.sinks.hadooper.hdfs.rollCount = 0
agent2.sinks.hadooper.hdfs.rollSize = 67108864
agent2.sinks.hadooper.hdfs.batchSize = 10
agent2.sinks.hadooper.hdfs.idleTimeout=0
agent2.sinks.hadooper.channel = fileChannel
agent2.sinks.loged.channel = fileChannel
agent2.sinks.hdfs.threadsPoolSize = 20
Please help.
I think the problem is your are using 2 sinks reading both of them from a single channel; in that case, a Flume event read by one of the 2 sinks is not read by the other one, and viceversa.
If you want both sinks receive a copy of the same Flume event, then you will need to create a dedicated channel for each sink. Once created these channels, the default channel selector, which is ReplicatingChannelSelector, will create a copy into each channel.

flume loss data when collect online data to hdfs

I used flume-ng 1.5 version to collect logs.
There are two agents in the data flow and they are on two hosts, respectively.
And the data is sended from agent1 to agent2.
The agents's component is as follows:
agent1: spooling dir source --> file channel --> avro sink
agent2: avro source --> file channel --> hdfs sink
But it seems to loss data about 1/1000 percentage of million data.
To solve problem I tried these steps:
look up agents log: cannot find any error or exception.
look up agents monitor metrics: the events number that put and take from channel always equals
statistic the data number by hive query and hdfs file use shell, respectively: the two number is equal and less than the online data number
agent1's configuration:
#agent
agent1.sources = src_spooldir
agent1.channels = chan_file
agent1.sinks = sink_avro
#source
agent1.sources.src_spooldir.type = spooldir
agent1.sources.src_spooldir.spoolDir = /data/logs/flume-spooldir
agent1.sources.src_spooldir.interceptors=i1
#interceptors
agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor
agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).*
agent1.sources.src_spooldir.interceptors.i1.serializers=s1
agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name=dt
#sink
agent1.sinks.sink_avro.type = avro
agent1.sinks.sink_avro.hostname = 10.235.2.212
agent1.sinks.sink_avro.port = 9910
#channel
agent1.channels.chan_file.type = file
agent1.channels.chan_file.checkpointDir = /data/flume/agent1/checkpoint
agent1.channels.chan_file.dataDirs = /data/flume/agent1/data
agent1.sources.src_spooldir.channels = chan_file
agent1.sinks.sink_avro.channel = chan_file
agent2's configuration
# agent
agent2.sources = source1
agent2.channels = channel1
agent2.sinks = sink1
# source
agent2.sources.source1.type = avro
agent2.sources.source1.bind = 10.235.2.212
agent2.sources.source1.port = 9910
# sink
agent2.sinks.sink1.type= hdfs
agent2.sinks.sink1.hdfs.fileType = DataStream
agent2.sinks.sink1.hdfs.filePrefix = log
agent2.sinks.sink1.hdfs.path = hdfs://hnd.hadoop.jsh:8020/data/%{dt}
agent2.sinks.sink1.hdfs.rollInterval = 600
agent2.sinks.sink1.hdfs.rollSize = 0
agent2.sinks.sink1.hdfs.rollCount = 0
agent2.sinks.sink1.hdfs.idleTimeout = 300
agent2.sinks.sink1.hdfs.round = true
agent2.sinks.sink1.hdfs.roundValue = 10
agent2.sinks.sink1.hdfs.roundUnit = minute
# channel
agent2.channels.channel1.type = file
agent2.channels.channel1.checkpointDir = /data/flume/agent2/checkpoint
agent2.channels.channel1.dataDirs = /data/flume/agent2/data
agent2.sinks.sink1.channel = channel1
agent2.sources.source1.channels = channel1
Any suggestions are welcome!
there is a bug in file line deseriazer when encounter some specific character of utf which point is between U+10000 and U+10FFFF, they represent in utf16 by two 16-bit code unit called surrogate pairs.

Resources