How to feed shell script output to kafka? - bash

I am trying to feed some netflow data into kafka. I have some netflow.pcap files which I read like
tcpdump -r netflow.pcap and get such an output:
14:48:40.823468 IP abts-kk-static-242.4.166.122.airtelbroadband.in.35467 > abts-kk-static-126.96.166.122.airtelbroadband.in.9500: UDP, length 1416
14:48:40.824216 IP abts-kk-static-242.4.166.122.airtelbroadband.in.35467 > abts-kk-static-126.96.166.122.airtelbroadband.in.9500: UDP, length 1416
.
.
.
.
In the official docs they mention the traditional way of starting a kafka producer, starting a kafka consumer and in the terminal input some data on producer which will be shown in the consumer. Good. Working.
Here they show how to input a file to kafka producer. Mind you, just one single file, not multiple files.
Question is:
How can I feed the output of a shell script into kakfa broker?
For example, the shell script is:
#!/bin/bash
FILES=/path/to/*
for f in $FILES
do
tcpdump -r netflow.pcap
done
I can't find any documentation or article where they mention how to do this. Any idea? Thanks!

Well, based on the link you gave on how to use the shell kafka producer with an input file, you can do the same with your output. You can redirect the output to a file and then use the producer.
Pay attention that I used >> in order to append to the file and not to overwrite it.
For example:
#!/bin/bash
FILES=/path/to/*
for f in $FILES
do
tcpdump -r netflow.pcap >> /tmp/tcpdump_output.txt
done
kafka-console-produce.sh --broker-list localhost:9092 --topic my_topic
--new-producer < /tmp/tcpdump_output.txt

Related

Efficient way of sending the same data to multiple dynamic processes

I have a stream of line-buffered data, and many readers from other processes
The readers need to attach to the system dynamically, they are not known to the process writing the stream
First i tried to read every line and simply send them to a lot of pipes
#writer
command | while read -r line; do
printf '%s\n' "$line" | tee listeners/*
done
#reader
mkfifo listeners/1
cat listeners/1
But that's consume a lot of CPU
So i though about writing to a file and cleaning it repeatedly
#writer
command >> file &
while true; do
: > file
sleep 1
done
#reader
tail -f -n0 file
But sometimes, a line is not read by one or more readers before truncation, making a race condition
Is there a better way on how i could implement this?
Sounds like pub/sub to me - see Wikipedia.
Basically, new interested parties come along whenever they like and "subscribe" to your channel. The process receiving the data then "publishes" it, line by line, to that channel.
You can do it with MQTT using mosquitto or with Redis. Both have command-line interfaces/bindings, as well as Python, C/C++, Ruby, PHP etc. Client and server need not be on same machine, some clients could be elsewhere on the network.
Mosquitto example here.
I did a few tests on my Mac with Redis pub/sub. The client code in Terminal to subscribe to a channel called myStream looks like this:
redis-cli SUBSCRIBE myStream
I then ran a process to synthesise 10,000 lines like this:
time seq 10000 | while read a ; do redis-cli PUBLISH myStream "$a" >/dev/null 2>&1 ; done
And that takes 40s, so it does around 250 lines per second, but it has to start a whole new process for each line and create and tear down the connection to Redis... and we don't want to send your CPU mad.
More appropriately for your situation then, here is how you can create a file with 100,000 lines, and read them one at a time, and send them to all your subscribers in Python:
# Make a "BigFile" with 100,000 lines
seq 100000 > BigFile
and read the lines and publish them with:
#!/usr/bin/env python3
import redis
if __name__ == '__main__':
# Redis connection
r = redis.Redis(host='localhost', port=6379, db=0)
# Read file line by line...
with open('BigFile', 'r') as infile:
for line in infile:
# Publish the current line to subscribers
r.publish('myStream', line)
The entire 100,000 lines were sent and received in 4s, so 25,000 lines per second. Here is a little recording of it in action. At the top you can see the CPU is not unduly troubled by it. The second window from the top is a client, receiving 100,000 lines and the next window down is a second client. The bottom window shows the server running the Python code above and sending all 100,000 lines in 4s.
Keywords: Redis, mosquitto, pub/sub, publish, subscribe.

Can you view historic logs for parse.com cloud code?

On the Parse.com cloud-code console, I can see logs, but they only go back maybe 100-200 lines. Is there a way to see or download older logs?
I've searched their website & googled, and don't see anything.
Using the parse command-line tool, you can retrieve an arbitrary number of log lines:
Usage:
parse logs [flags]
Aliases:
logs, log
Flags:
-f, --follow=false: Emulates tail -f and streams new messages from the server
-l, --level="INFO": The log level to restrict to. Can be 'INFO' or 'ERROR'.
-n, --num=10: The number of the messages to display
Not sure if there is a limit, but I've been able to fetch 5000 lines of log with this command:
parse logs prod -n 5000
To add on to Pascal Bourque's answer, you may also wish to filter the logs by a given range of dates. To achieve this, I used the following:
parse logs -n 5000 | sed -n '/2016-01-10/, /2016-01-15/p' > filteredLog.txt
This will get up to 5000 logs, use the sed command to keep all of the logs which are between 2016-01-10 and 2016-01-15, and store the results in filteredLog.txt.

Errors in UDP sending in a sub-script (bash)

Using a Raspi/Debian - I have a script that parses the results from an iwlist scan and sends them via UDP to a Pure Data patch. This runs fine in gui mode, but now I'm trying to automate the whole process in another script with the following:
pd-extended -nogui /home/pi/patch.pd & /home/pi/libOSC/scan.sh && fg
But when I run this new script, the UDP appears to only send the info to Pure Data once, and then the scanning continues but Pd does not receive the packet. Any help with this would be appreciated.
What happens when you run /home/pi/libOSC/scan.sh? It sends the results only once? Then maybe you need to do it differently, like calling that script from within pd using the 'shell' or 'popen' objects for instance. Or you implement a polling command via UDP that will return the values.
how does your scan.sh script look like?
you probably want to make it something like:
pdhost=localhost
pdport=9999
do_scan() {
## some code here that does the scan and print's the result to stdout
}
doscan | while read line
do
echo "${line};" | pdsend ${pdhost} ${pdport}
done
rather than the following:
doscan | pdsend ${pdhost} ${pdport}

"cat" bit-stream with no EOF

I have a file opened both for reading and writing and associated this file with file descriptor 3, i.e. exec 3<>/dev/udp/10.10.10.1/161. When I redirect a crafted UDP packet to file descriptor 3 and receive a reply, then how can I read it from file-descriptor 3? Usual tools like cat or read do not work well as UDP packet(essentially just a bit stream) received as a reply does not have a newline or EOF and for example cat does not know that there is no more data to expect. For example here you can see how I had to SIGINT the cat:
$ cat <&3
0Gpublic�:�0,0+C1841.local^C
$
I would like to check if there was any UDP data received from 10.10.10.1 and this means that if file-descriptor 3 contains some data(even a single bit), then reply was received.
Your problem is that you cannot recognize an end-of-packet properly. There is not EOF (as you noticed) signifier (like a special character or file-closed event or similar). Instead all you can do is either
read a fixed size of characters (in case your packets are fixed in size) or
read single tokens (maybe bytes) until your packet's syntax states that it is complete or
read until a timeout occurred.
The first two are up to your responsibility, in case this is possible.
The last one can be achieved using a cat in a subshell which you kill after a certain amount of time:
cat <&3 & pid=$!
sleep 0.1
kill "$pid" 2>/dev/null
Put this in a function and each call will last one 0.1s and output what could be read in that time.

How to load data from local machine to hdfs using flume

i am new to flume so please tell me...how to store log files from my local machine to local my HDFS using flume
i have issues in setting classpath and flume.conf file
Thank you,
ajay
agent.sources = weblog
agent.channels = memoryChannel
agent.sinks = mycluster
## Sources #########################################################
agent.sources.weblog.type = exec
agent.sources.weblog.command = tail -F REPLACE-WITH-PATH2-your.log-FILE
agent.sources.weblog.batchSize = 1
agent.sources.weblog.channels =
REPLACE-WITH-
CHANNEL-NAME
## Channels ########################################################
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100 agent.channels.memoryChannel.transactionCapacity = 100
## Sinks ###########################################################
agent.sinks.mycluster.type =REPLACE-WITH-CLUSTER-TYPE
agent.sinks.mycluster.hdfs.path=/user/root/flumedata
agent.sinks.mycluster.channel =REPLACE-WITH-CHANNEL-NAME
Save this file as logagent.conf and run with below command
# flume-ng agent –n agent –f logagent.conf &
We do need more information to know why things are working for you.
The short answer is that you need a Source to read your data from (maybe the spooling directory source), a Channel (memory channel if you don't need reliable storage) and the HDFS sink.
Update
The OP reports receiving the error message, "you must include conf file in flume class path".
You need to provide the conf file as an argument. You do so with the --conf-file parameter. For example, the command line I use in development is:
bin/flume-ng agent --conf-file /etc/flume-ng/conf/flume.conf --name castellan-indexer --conf /etc/flume-ng/conf
The error message reads that way because the bin/flume-ng script adds the contents of the --conf-file argument to the classpath before running Flume.
If you are appending data to your local file, you can use an exec source with "tail -F" command. If the file is static, use cat command to transfer the data to hadoop.
The overall architecture would be:
Source: Exec source reading data from your file
Channel : Either memory channel or file channel
Sink: Hdfs sink where data is being dumped.
Use user guide to create your conf file (https://flume.apache.org/FlumeUserGuide.html)
Once you have your conf file ready, you can run it like this:
bin/flume-ng agent -n $agent_name -c conf -f conf/your-flume-conf.conf

Resources