ruby read lines which updates latest only from file - ruby

There is a script which updates the output to a file and planning to setup monitoring of that process and send the latest output information to central server.
For example, the monitoring script will look for the output file for every minute and it has to read only whatever the latest data updated on the output file and send it to central server. Means it should read the old data whatever it read already. Is there any way to make this happen with ruby version >= 1.8.7
Here is my scenario,
the script test.sh is running which writes output to file called /tmp/script_output.txt
I have another monitoring script jobmonitor.rb which is triggered by test.sh and is used to monitor the job status and sends information to central server. Right now it reads the entire contents of script_output.txt file and sending the details and it has to send the information for every 2 mins for live monitoring.
Some cases the output file script_output.txt will have more than 500 lines, so every 2 mins the monitoring script sending the entire content which includes 2 mins old content as well.
So I am looking way to read the output file content for every 2 mins with the updated content only.
For example:
at present the script_output.txt file has contents,
line1
line2
.
.
.
line10
and the monitoring script sent these information.
Now after 2 mins the output file has below contents,
line1
line2
.
.
.
line10
line11
.
.
.
line 20
Now the monitoring script capture all 20 lines and sends the information, instead I need to send the output contents from line11-20 only.
Update:
I tried with below,
open(filepath, 'r') do |f|
while true
puts "#{f.readline()}"
sleep 60
end
end
it reads one line only but I want to read the n number of lines which was updated in 120secs or from next line of last read to till now.
Note : I was raised this question already on https://stackoverflow.com/questions/43446881/ruby-read-lines-from-file-only-latest-updates but I missed to update detailed information initially and so the question was closed. Sorry for that.

Related

Upload to HDFS stops with warning "Slow ReadProcessor read"

When I try to upload files that are about 20 GB into HDFS they usually upload till about 12-14 GB then they stop uploading and I get a bunch of these warnings through command line
"INFO hdfs.DataStreamer: Slow ReadProcessor read fields for block BP-222805046-10.66.4.100-1587360338928:blk_1073743783_2960 took 62414ms (threshold=30000ms); ack: seqno: 226662 reply: SUCCESS downstreamAckTimeNanos: 0 flag: 0, targets:"
However, if I try to upload the files like 5-6 times, they sometimes work after the 4th or 5th attempt. I believe if I alter some data node storage settings I can achieve consistent uploads without issues but I don't know what parameters to modify in the hadoop configurations. Thanks!
Edit: This happens when I put the file into HDFS through python program which uses a subprocess call to put the file in. However, even if I directly call it from command line I still run into the same issue.

Efficient way of sending the same data to multiple dynamic processes

I have a stream of line-buffered data, and many readers from other processes
The readers need to attach to the system dynamically, they are not known to the process writing the stream
First i tried to read every line and simply send them to a lot of pipes
#writer
command | while read -r line; do
printf '%s\n' "$line" | tee listeners/*
done
#reader
mkfifo listeners/1
cat listeners/1
But that's consume a lot of CPU
So i though about writing to a file and cleaning it repeatedly
#writer
command >> file &
while true; do
: > file
sleep 1
done
#reader
tail -f -n0 file
But sometimes, a line is not read by one or more readers before truncation, making a race condition
Is there a better way on how i could implement this?
Sounds like pub/sub to me - see Wikipedia.
Basically, new interested parties come along whenever they like and "subscribe" to your channel. The process receiving the data then "publishes" it, line by line, to that channel.
You can do it with MQTT using mosquitto or with Redis. Both have command-line interfaces/bindings, as well as Python, C/C++, Ruby, PHP etc. Client and server need not be on same machine, some clients could be elsewhere on the network.
Mosquitto example here.
I did a few tests on my Mac with Redis pub/sub. The client code in Terminal to subscribe to a channel called myStream looks like this:
redis-cli SUBSCRIBE myStream
I then ran a process to synthesise 10,000 lines like this:
time seq 10000 | while read a ; do redis-cli PUBLISH myStream "$a" >/dev/null 2>&1 ; done
And that takes 40s, so it does around 250 lines per second, but it has to start a whole new process for each line and create and tear down the connection to Redis... and we don't want to send your CPU mad.
More appropriately for your situation then, here is how you can create a file with 100,000 lines, and read them one at a time, and send them to all your subscribers in Python:
# Make a "BigFile" with 100,000 lines
seq 100000 > BigFile
and read the lines and publish them with:
#!/usr/bin/env python3
import redis
if __name__ == '__main__':
# Redis connection
r = redis.Redis(host='localhost', port=6379, db=0)
# Read file line by line...
with open('BigFile', 'r') as infile:
for line in infile:
# Publish the current line to subscribers
r.publish('myStream', line)
The entire 100,000 lines were sent and received in 4s, so 25,000 lines per second. Here is a little recording of it in action. At the top you can see the CPU is not unduly troubled by it. The second window from the top is a client, receiving 100,000 lines and the next window down is a second client. The bottom window shows the server running the Python code above and sending all 100,000 lines in 4s.
Keywords: Redis, mosquitto, pub/sub, publish, subscribe.

How to read every line from a txt file and print starting from the line which starts with "Created_Date" in shell scripting [duplicate]

This question already has answers here:
How to get the part of a file after the first line that matches a regular expression
(12 answers)
Closed 4 years ago.
5G_Fixed_Wireless_Dashboard_TestScedule||||||||||||||||^M
Report Run Date||08/07/2018|||||||||||||||||||||^M
Requesting User Company||NEW|||||||||||||||||||||^M
Report Criteria|||||||||||||||||||||||^M
" Service Job Updated from Date:
Service Job Updated to Date:
Service Job Created from Date: 08/06/2018
Service Job Created to Date:
Service Job Status:
Resolution Code:"|||||||||||||||||||||||^M
Created Date|Job Status|Schedule Date|Job
Number|Service Job Type|Verizon Customer Order
Number|Verizon Location Code|Service|Installation
Duration|Part Number
I want to print starting from Created Date. The result
file should be something like below.
Created Date|Job Status|Schedule Date|Job
Number|Service Job Type|Verizon Customer Order
Number|Verizon Location Code|Service|Installation
Duration|Part Number
I have tried the following lines after you people linked me to some other questions. But my requirement is to print the result to the same file.
FILELIST=find $MFROUTDIR -maxdepth 1 -name "XXXXXX_5G_Order_*.txt"
for nextFile in $FILELIST;do
cat $nextFile | sed -n -e '/Created Date/,$p'
done
By writing above lines of code, output is printed on console. Could you please suggest some way to print it in same file.
This can be easily done with a simple awk command:
awk '/^Created Date/{p=1} p' file
Created Date|Job Status|Schedule Date|Job
Number|Service Job Type|Verizon Customer Order
Number|Verizon Location Code|Service|Installation
Duration|Part Number
We set a flag p to 1 when we encounter a line that starts with Created Date. Later we use awk default action to print each line when p==1.
References:
Effective AWK Programming
Awk Tutorial

lftp - restart position

When trying to mirror using lftp I receive the following output (-d debugging mode):
<--- 227 Entering Passive Mode {some numbers}
---- Connecting data socket to (more numbers and port)
---- Data connection established
---> REST 0
<--- 350 Restart position accepted (0).
---> RETR {some filename}
When I open this file, the file is corrupted - the content of the file is shifted down by several lines and then on top of it a normal copy of the file is written. For example, if file had five lines (line breaks not shown for compactness): line1 line2 line3 line4 line5, then the corrupted file would read: line1 line2 line3 line3 line4 line5.
Given the other problems I am experiencing with this ftp/network combination, I understand that this is not lftp's fault. However, I wonder if disabling restart position changes would somehow fix those corrupted files (at least it works for the other files). By reading the manual I can see these two options:
hftp:use-range (boolean)
when true, lftp will use Range header for transfer restart.
http:use-range (boolean)
when true, lftp will use Range header for transfer restart.
I don't know if this is relevant to what I am trying to achieve (force lftp to always download the data in full, without restarting position), or whether what I want is achievable in principle. I would try these options by actually running them, but I cannot see any predictable pattern in when files get corrupted and re-downloading the same files always gives the correct version. So any help is appreciated! :)
Not sure if this is the solution, but based on logs I think that the problem for me was caused by get -c commands, so I removed --continue from the mirror job.

How to resume reading a file?

I'm trying to find the best and most efficient way to resume reading a file from a given point.
The given file is being written frequently (this is a log file).
This file is rotated on a daily basis.
In the log file I'm looking for a pattern 'slow transaction'. End of such lines have a number into parentheses. I want to have the sum of the numbers.
Example of log line:
Jun 24 2015 10:00:00 slow transaction (5)
Jun 24 2015 10:00:06 slow transaction (1)
This is easy part that I could do with awk command to get total of 6 with above example.
Now my challenge is that I want to get the values from this file on a regular basis. I've an external system that polls a custom OID using SNMP. When hitting this OID the Linux host runs a couple of basic commands.
I want this SNMP polling event to get the number of events since the last polling only. I don't want to have the total every time, just the total of the newly added lines.
Just to mention that only bash can be used, or basic commands such as awk sed tail etc. No perl or advanced programming language.
I hope my description will be clear enough. Apologizes if this is duplicate. I did some researches before posting but did not find something that precisely correspond to my need.
Thank you for any assistance
In addition to the methods in the comment link, you can also simply use dd and stat to read the logfile size, save it and sleep 300 then check the logfile size again. If the filesize has changed, then skip over the old information with dd and read the new information only.
Note: you can add a test to handle the case where the logfile is deleted and then restarted with 0 size (e.g. if $((newsize < size)) then read all.
Here is a short example with 5 minute intervals:
#!/bin/bash
lfn=${1:-/path/to/logfile}
size=$(stat -c "%s" "$lfn") ## save original log size
while :; do
newsize=$(stat -c "%s" "$lfn") ## get new log size
if ((size != newsize)); then ## if change, use new info
## use dd to skip over existing text to new text
newtext=$(dd if="$lfn" bs="$size" skip=1 2>/dev/null)
## process newtext however you need
printf "\nnewtext:\n\n%s\n" "$newtext"
size=$((newsize)); ## update size to newsize
fi
sleep 300
done

Resources