I am trying to put real time wireshark data into HBase with Flume using following agent configurations.
Source
A1.Source.k1.type= exec
A1.Source.k1.command = tail -f /usr/sbin/tshark
Sink
A1.Sinks.C1.Type = hbase
A1.Sinks.C1.columnFamily =
A1.Sinks.C1.table =
And I use tshark in root
Tshark -i eth0
Data seems to be stored but it looks like this - x0/x0/x0/.
Any idea where I am wrong
Related
I'm using Flume to collect tweets and store them on HDFS.
The collecting part is working fine, and I can find all my tweets in my file system.
Now I would like to extract all these tweets in one single file.
The problem is that the different tweets are stored as follow :
As we can see, the tweets are stored inside blocks of 128 MB but only use a few Ko, which is a normal behaviour for HDFS correct me if I'm wrong.
However how could I get all the different tweets on one file ?
Here is my conf file that I run with the follwing command :
flume-ng agent -n TwitterAgent -f ./my-flume-files/twitter-stream-tvseries.conf
twitter-stream-tvseries.conf :
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type =
org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey=hidden
TwitterAgent.sources.Twitter.consumerSecret=hidden
TwitterAgent.sources.Twitter.accessToken=hidden
TwitterAgent.sources.Twitter.accessTokenSecret=hidden
TwitterAgent.sources.Twitter.keywords=GoT, GameofThrones
TwitterAgent.sources.Twitter.keywords=GoT, GameofThrones
TwitterAgent.sinks.HDFS.channel=MemChannel
TwitterAgent.sinks.HDFS.type=hdfs
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://ip-addressl:8020/user/root/data/twitter/tvseries/tweets
TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream
TwitterAgent.sinks.HDFS.hdfs.writeformat=Text
TwitterAgent.sinks.HDFS.hdfs.batchSize=1000
TwitterAgent.sinks.HDFS.hdfs.rollSize=0
TwitterAgent.sinks.HDFS.hdfs.rollCount=10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval=600
TwitterAgent.channels.MemChannel.type=memory
TwitterAgent.channels.MemChannel.capacity=10000
TwitterAgent.channels.MemChannel.transactionCapacity=1000
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel
You can configure the HDFS sink to produce a message by time, event or size. So, if you want to save multiple messages till 120MB limit is reached, set
hdfs.rollInterval = 0 # This is to create new file based on time
hdfs.rollSize = 125829120 # This is to create new file based on size
hdfs.rollCount = 0 # This is to create new file based on events (different tweets in your case)
You can use the following commands to concatenate the files into single file:
find . -type f -name 'FlumeData*' -exec cat {} + >> output.file
or if you want to store the data into Hive tables for later analysis, create an external table and consume it into Hive DB.
I am trying to feed some netflow data into kafka. I have some netflow.pcap files which I read like
tcpdump -r netflow.pcap and get such an output:
14:48:40.823468 IP abts-kk-static-242.4.166.122.airtelbroadband.in.35467 > abts-kk-static-126.96.166.122.airtelbroadband.in.9500: UDP, length 1416
14:48:40.824216 IP abts-kk-static-242.4.166.122.airtelbroadband.in.35467 > abts-kk-static-126.96.166.122.airtelbroadband.in.9500: UDP, length 1416
.
.
.
.
In the official docs they mention the traditional way of starting a kafka producer, starting a kafka consumer and in the terminal input some data on producer which will be shown in the consumer. Good. Working.
Here they show how to input a file to kafka producer. Mind you, just one single file, not multiple files.
Question is:
How can I feed the output of a shell script into kakfa broker?
For example, the shell script is:
#!/bin/bash
FILES=/path/to/*
for f in $FILES
do
tcpdump -r netflow.pcap
done
I can't find any documentation or article where they mention how to do this. Any idea? Thanks!
Well, based on the link you gave on how to use the shell kafka producer with an input file, you can do the same with your output. You can redirect the output to a file and then use the producer.
Pay attention that I used >> in order to append to the file and not to overwrite it.
For example:
#!/bin/bash
FILES=/path/to/*
for f in $FILES
do
tcpdump -r netflow.pcap >> /tmp/tcpdump_output.txt
done
kafka-console-produce.sh --broker-list localhost:9092 --topic my_topic
--new-producer < /tmp/tcpdump_output.txt
I'm using webhdfs to ingest data from Local file system to HDFS. Now I want to ensure integrity of files ingested into HDFS.
How can I make sure transferred files are not corrrupted/altered etc?
I used below webhdfs command to get the checksum of file
curl -i -L --negotiate -u: -X GET "http://$hostname:$port/webhdfs/v1/user/path?op=GETFILECHECKSUM"
How should I use above checksum to ensure the integrity of Ingested files? please suggest
Below is the steps I'm following
>md5sum locale_file
740c461879b484f4f5960aa4f67a145b
>hadoop fs -checksum locale_file
locale_file MD5-of-0MD5-of-512CRC32C 000002000000000000000000f4ec0c298cd6196ffdd8148ae536c9fe
Checksum of file on local system is different than same file on HDFS I need to compare checksum how can I do that?
One way to do that will be to calculate the checksum locally and than match it against the hadoop checksum after you ingest it.
I wrote a library to calculate check sum locally for it, in case any body is interested.
https://github.com/srch07/HDFSChecksumForLocalfile
Try this
curl -i "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILECHECKSUM"
Refer follow link for full information
https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/WebHDFS.html#Get_File_Checksum
It can be done from the console like below
$ md5sum locale_file
740c461879b484f4f5960aa4f67a145b
$ hadoop fs -cat locale_file |md5sum -
740c461879b484f4f5960aa4f67a145b -
You can also verify local file via code
import java.io._
import org.apache.commons.codec.digest.DigestUtils;
val md5sum = DigestUtils.md5Hex("locale_file")
and for the Hadoop
import org.apache.hadoop.fs._
import org.apache.hadoop.io._
val md5sum = MD5Hash.digest(FileSystem.get(hadoopConfiguration).open(new Path("locale_file"))).toString
I want to read a log file from different server in flume which is up and running on some different server.......so for doing so how can I achive this by changing my flume-conf.properties file.......what should i write in the configuration file of flume to achieve this...
a1.sources = AspectJ
a1.channels = memoryChannel
a1.sinks = kafkaSink
a1.sources.AspectJ.type = com.flume.MySource
a1.sources.AspectJ.command = tail -F /tmp/data/Log.txt
for achiving this what should I write in place of
a1.sources.AspectJ.command = tail -F /tmp/data/Log.txt
I believe what you want to ask is that, if Flume is setup on host 'F' and your log files exists on host 'L', how will you configure flume to read log files from host 'L', correct ?
If so, then you need to setup Flume on host 'L' and not on 'F'. Setup flume on the same host where the log files are and setup the Sink to point to Kafka topic.
i am new to flume so please tell me...how to store log files from my local machine to local my HDFS using flume
i have issues in setting classpath and flume.conf file
Thank you,
ajay
agent.sources = weblog
agent.channels = memoryChannel
agent.sinks = mycluster
## Sources #########################################################
agent.sources.weblog.type = exec
agent.sources.weblog.command = tail -F REPLACE-WITH-PATH2-your.log-FILE
agent.sources.weblog.batchSize = 1
agent.sources.weblog.channels =
REPLACE-WITH-
CHANNEL-NAME
## Channels ########################################################
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100 agent.channels.memoryChannel.transactionCapacity = 100
## Sinks ###########################################################
agent.sinks.mycluster.type =REPLACE-WITH-CLUSTER-TYPE
agent.sinks.mycluster.hdfs.path=/user/root/flumedata
agent.sinks.mycluster.channel =REPLACE-WITH-CHANNEL-NAME
Save this file as logagent.conf and run with below command
# flume-ng agent –n agent –f logagent.conf &
We do need more information to know why things are working for you.
The short answer is that you need a Source to read your data from (maybe the spooling directory source), a Channel (memory channel if you don't need reliable storage) and the HDFS sink.
Update
The OP reports receiving the error message, "you must include conf file in flume class path".
You need to provide the conf file as an argument. You do so with the --conf-file parameter. For example, the command line I use in development is:
bin/flume-ng agent --conf-file /etc/flume-ng/conf/flume.conf --name castellan-indexer --conf /etc/flume-ng/conf
The error message reads that way because the bin/flume-ng script adds the contents of the --conf-file argument to the classpath before running Flume.
If you are appending data to your local file, you can use an exec source with "tail -F" command. If the file is static, use cat command to transfer the data to hadoop.
The overall architecture would be:
Source: Exec source reading data from your file
Channel : Either memory channel or file channel
Sink: Hdfs sink where data is being dumped.
Use user guide to create your conf file (https://flume.apache.org/FlumeUserGuide.html)
Once you have your conf file ready, you can run it like this:
bin/flume-ng agent -n $agent_name -c conf -f conf/your-flume-conf.conf