How to split a real-time stdout stream into several files? - bash

I have a python script which is continuously writing a text stream to stdout.
Something like this (genstream.py):
while 1:
print (int(time.time()))
time.sleep(1)
I want a bash script which launch the python script, save its output to a set of files, let's say to split the output every hour to avoid the creation of a huge file which is difficult to manage.
The so created files will be then processed (i.e. one at the end of each hour) by the same bash script to insert the values into a database and moved to an archive folder.
I did my search in google/stack overflow (e.g. split STDIN to multiple files (and compress them if possible) Bash reading STDOUT stream in real-time or https://unix.stackexchange.com/questions/26175/ ) but I didn't find any solution so far.
I've tried to use also something easy like this (so without taking in account the time but only the number of lines)
python3 ./genstream.py | split -l5 -
but I have no output.
I've tried a combination of (named-)pipes and tee but nothing seems to work.

Try this:
python3 ./genstream.py | while read line; do
echo "$line" >> split_$(date +%Y-%m-%d-%H)
done

Related

Is there a way to save output from bash commands to a "file/variable" in bash without creating a file in your directory

I'm writing commands that do something like ./script > output.txt so that I can use the files in later scripts like ./script2 output.txt otherFile.txt > output2.txt. I remove them all at the end of the script, but when I'm testing certain things or debugging it's tricky to search through all my sub directories and files which have been created in the script.
Is the best option just to create a hidden file?
As always, there are numerous ways to do so. If you want to avoid files altogether, you can save the output (STDOUT) of a command in a variable and pass it to the next command as a file using the <() operator:
output=$(cat /usr/include/stdio.h)
cat <(echo "$output")
Alternatively, you can do so in a single command line:
cat <(cat /usr/include/stdio.h)
This assumes that the next command strictly requires a file for input.
I tend to avoid temporary files whenever possible to eliminate the need for a cleanup step that gets executed in all cases unless large amounts of data have to be processed.

Displaying stdout on screen and a file simultaneously

I'd like to log standard output form a script of mine to a file, but also have it display to me on screen for realtime monitoring. The script outputs something about 10 times every second.
I tried to redirect stdout to a file and then tail -f that file from another terminal, but for some reason tail is updating the screen significantly slower than the script is writing to the file.
What's causing this lag? Is there an alternate method of getting one standard output stream both on my terminal and into a file for later examination?
I can't say why tail lags, but you can use tee:
Redirect output to multiple files, copies standard input to standard output and also to any files given as arguments. This is useful when you want not only to send some data down a pipe, but also to save a copy.
Example: <command> | tee <outputFile>
How much of a lag do you see? A few hundred characters? A few seconds? Minutes? Hours?
What you are seeing is buffering. Almost all file reads and writes are buffered. This includes input and output and there is also some buffering taking place within pipes. It's just more efficient to pass a packet of data around rather than a byte at a time. I believe data on HFS+ file systems are stored in UTF-16 while Mac OS X normally use UTF-8 as a default. (NTFS also stores data using UTF-16 while Windows uses code pages for character data by default).
So, if you run tail -f from another terminal, you may be seeing buffering from tail, but when you use a pipe and then tee, you may have a buffer in the pipe, and in the tee command which maybe why you see the lag.
By the way, how do you know there's a lag? How do you know how quickly your program is writing to the disk? Do you print out something in your program to help track the writes to the file?
In that case, you might not be lagging as much as you think. File writes are also buffered. So, it is very possible that the lag isn't from the tail -f, but from your script writing to the file.
Use tee command:
tail -f /path/logFile | tee outfile

How to read data and read user response to each line of data both from stdin

Using bash I want to read over a list of lines and ask the user if the script should process each line as it is read. Since both the lines and the user's response come from stdin how does one coordinate the file handles? After much searching and trial & error I came up with the example
exec 4<&0
seq 1 10 | while read number
do
read -u 4 -p "$number?" confirmation
echo "$number $confirmation"
done
Here we are using exec to reopen stdin on file handle 4, reading the sequence of numbers from the piped stdin, and getting the user's response on file handle 4. This seems like too much work. Is this the correct way of solving this problem? If not, what is the better way? Thanks.
You could just force read to take its input from the terminal, instead of the more abstract standard input:
while read number
do
< /dev/tty read -p "$number?" confirmation
echo "$number $confirmation"
done
The drawback is that you can't automate acceptance (by reading from a pipe connected to yes, for example).
Yes, using an additional file descriptor is a right way to solve this problem. Pipes can only connect one command's standard output (file descriptor 1) to another command's standard input (file descriptor 1). So when you're parsing the output of a command, if you need to obtain input from some other source, that other source has to be given by a file name or a file descriptor.
I would write this a little differently, making the redirection local to the loop, but it isn't a big deal:
seq 1 10 | while read number
do
read -u 4 -p "$number?" confirmation
echo "$number $confirmation"
done 4<&0
With a shell other than bash, in the absence of a -u option to read, you can use a redirection:
printf "%s? " "$number"; read confirmation <&4
You may be interested in other examples of using file descriptor reassignment.
Another method, as pointed out by chepner, is to read from a named file, namely /dev/tty, which is the terminal that the program is running in. This makes for a simpler script but has the drawback that you can't easily feed confirmation data to the script manually.
For your application, killmatching, two passes is totally the right way to go.
In the first pass you can read all the matching processes into an array. The number will be small (dozens typically, tens of thousands at most) so there are no efficiency issues. The code will look something like
set -A candidates
ps | grep | while read thing do candidates+=("$thing"); done
(Syntactic details may be wrong; my bash is rusty.)
The second pass will loop through the candidates array and do the interaction.
Also, if it's available on your platform, you might want to look into pgrep. It's not ideal, but it may save you a few forks, which cost more than all the array lookups in the world.

why does redirect (<) not create a subshell

I wrote the following code
var=0
cat $file | while read line do
var=$line
done
echo $var
Now as I understand it the pipe (|) will cause a sub shell to be created an therefore the variable var on line 1 will have the same value on the last line.
However this will solve it:
var=0
while read line do
var=$line
done < $file
echo $line
My question is why does the redirect not cause a subshell to be created, or if you like why does pipe cause one to be created?
Thanks
The cat command is a command which means it needs its own process and has its own STDIN and STDOUT. You're basically taking the STDOUT produced by the cat command and redirecting it into the process of the while loop.
When you use redirection, you're not using a separate process. Instead, you're merely redirecting the STDIN of the while loop from the console to the lines of the file.
Needless to say, the second way is more efficient. In the old Usenet days before all of you little whippersnappers got ahold of our Internet (_Hey you kids! Get off of my Internet!) and destroyed it with your fancy graphics and all them web page, some people use to give out the Useless Use of Cat award for people who contributed to the comp.unix.shell group and had a spurious cat command because the use of cat is almost never necessary and is usually more inefficient.
If you're using a cat in your code, you probably don't need it. The cat command comes from concatenate and is suppose to be used only to concatenate files together. For example, when we use to use SneakerNet on 800K floppies, we would have to split up long files with the Unix split command and then use cat to merge them back together.
A pipe is there to hook the stdout of one program to the stdin or another one. Two processes, possibly two shells. When you do redirection (> and <), all you're doing remapping stdin (or stdout) to a file. reading/writing a file can be done without another process or shell.

Bash script to edit a bunch of files

To process a bunch of data and get it ready to be inserted into our database, we generate a bunch of shell scripts. Each of them has about 15 lines, one for each table that the data is going. One a recent import batch, some of the import files failed going into one particular table. So, I have a bunch of shell scripts (about 600) where I need to comment out the first 7 lines, then rerun the file. There are about 6000 shell scripts in this folder, and nothing about a particular file can tell me if it needs the edit. I've got a list of which files that I pulled from the database output.
So how do I write a bash script (or anything else that would work better) to take this list of file names and for each of them, comment out the first 7 lines, and run the script?
EDIT:
#!/usr/bin/env sh
cmd1
cmd2
cmd3
cmd4
cmd5
cmd6
cmd7
cmd8
Not sure how readable that is. Basically, the first 7 lines (not counting the first line) need to have a # added to the beginning of them. Note: the files have been edited to make each line shorter and partially cut off copying out of VIM. But in the main part of each file, there is a line starting with echo, then a line starting with sqlldr
Using sed, you can specify a line number range in the file to be changed.
#!/bin/bash
while read line
do
# add a comment to beginning of lines 1 - 7 and rename the script
sed '3,9 s/^/#/' $line > $line.new
exec $line.new
done < "filelist.txt"
You may wish to test this before running it on all of those scripts...
EDIT: changed the lines numbers to reflect comments.
Roughly speaking:
#!/bin/sh
for file in "$#"
do
out=/tmp/$file.$$
sed '2,8s/^/#/' < $file > $out
$SHELL $out
rm -f $out
done
Assuming you don't care about checking for race conditions etc.
ex seems made for what you want to do.
For instance, for editing one file, with a here document:
#!/bin/sh
ex test.txt << END
1,12s/^/#/
wq
END
That'll comment out the first 12 lines in "test.txt". For your example you could try "$FILE" or similar (including quotes!).
Then run them the usual way, i.e. ./"$FILE"
edit: $SHELL "$FILE" is probably a better approach to run them (from one of the above commenters).
Ultimately you're going to want to use the linux command sed. Whatever logic you need to place in the script, you know. But your script will ultimately call sed. http://lowfatlinux.com/linux-sed.html

Resources