using cat in a bash script is very slow - bash

I have very big text files(~50,000) over which i have to do some text processing. Basically run multiple grep commands.
When i run it manually it returns in an instant , but when i do the same in a bash script - it takes a lot of time. What am i doing wrong in below bash script. I pass the names of files as command line arguments to script
Example Input data :
BUSINESS^GFR^GNevil
PERSONAL^GUK^GSheila
Output that should come in a file - BUSINESS^GFR^GNevil
It starts printing out the whole file on the terminal after quite some while. How do i suppress the same?
#!/bin/bash
cat $2 | grep BUSINESS

Do NOT use cat with program that can read file itself.
It slows thing down and you lose functionality:
grep BUSINESS test | grep '^GFR|^GDE'
Or you can do like this with awk
awk '/BUSINESS/ && /^GFR|^GDE/' test

Related

How to make awk command run faster on large data files

I used this awk command below to create a new UUID column in a table in my existing .dat files.
$ awk '("uuidgen" | getline uuid) > 0 {print uuid "|" $0} {close("uuidgen")}' $filename > ${filename}.pk
The problem is that my .dat files are pretty big (like 50-60 GB) and this awk command takes hours even on small data files (like 15MB).
Is there any way to increase the speed of this awk command?
I wonder if you might save time by not having awk open and close uuidgen every line.
$ function regen() { while true; do uuidgen; done; }
$ coproc regen
$ awk -v f="$filename" '!(getline line < f){exit} {print $0,line}' OFS="|" < /dev/fd/${COPROC[0]} > "$filename".pk
This has awk reading your "real" filename from a variable, and the uuid from stdin, because the call to uuidgen is handled by a bash "coprocess". The funky bit around the getline is to tell awk to quit once it runs out of input from $filename. Also, note that awk is taking input from input redirection instead of reading the file directly. This is important; the file descriptor at /dev/fd/## is a bash thing, and awk can't open it.
This should theoretically save you time doing unnecessary system calls to open, run and close the uuidgen binary. On the other hand, the coprocess is doing almost the same thing anyway by running uuidgen in a loop. Perhaps you'll see some improvement in an SMP environment. I don't have a 50GB text file handy for benchmarking. I'd love to hear your results.
Note that coproc is a feature that was introduced with bash version 4. And use of /dev/fd/* requires that bash is compiled with file descriptor support. In my system, it also means I have to make sure fdescfs(5) is mounted.
I just noticed the following on my system (FreeBSD 11):
$ /bin/uuidgen -
usage: uuidgen [-1] [-n count] [-o filename]
If your uuidgen also has a -n option, then adding it to your regen() function with ANY value might be a useful optimization, to reduce the number of times the command needs to be reopened. For example:
$ function regen() { while true; do uuidgen -n 100; done; }
This would result in uuidgen being called only once every 100 lines of input, rather than for every line.
And if you're running Linux, depending on how you're set up, you may have an alternate source for UUIDs. Note:
$ awk -v f=/proc/sys/kernel/random/uuid '{getline u<f; close(f); print u,$0}' OFS="|" "$filename" "$filename".pk
This doesn't require the bash coproc, it just has awk read a random uuid directly from a Linux kernel function that provides them. You're still closing the file handle for every line of input, but at least you don't have to exec the uuidgen binary.
YMMV. I don't know what OS you're running, so I don't know what's likely to work for you.
Your script is calling shell to call awk to call shell to call uuidgen. Awk is a tool for manipulating text, it's not a shell (an environment to call other tools from) so don't do that, just call uuidgen from shell:
$ cat file
foo .*
bar stuff
here
$ xargs -d $'\n' -n 1 printf '%s|%s\n' "$(uuidgen)" < file
5662f3bd-7818-4da8-9e3a-f5636b174e94|foo .*
5662f3bd-7818-4da8-9e3a-f5636b174e94|bar stuff
5662f3bd-7818-4da8-9e3a-f5636b174e94|here
I'm just guessing that the real problem here is that you're running a sub-process for each line. You could read your file explicitly line by line and read output from a batch-uuidgen line by line, and thus only have a single subprocess to handle at once. Unfortunately, uuidgen doesn't work that way.
Maybe another solution?
perl -MData::UUID -ple 'BEGIN{ $ug = Data::UUID->new } $_ = lc($ug->to_string($ug->create)) . " | " . $_' $filename > ${filename}.pk
Might this be faster?

execute bash command depending on keyword

I am trying to to provide a file for my shell as an input which in return should test if the file contains a specific word and decide what command to execute. I am not figuring out yet where the mistake might lie. Please find the shell script that i wrote:
#!/bin/(shell)
input_file="$1"
output_file="$2"
grep "val1" | awk -f ./path/to/script.awk $input_file > $output_file
grep "val2" | sh ./path/to/script.sh $input_file > $output_file
when I input the the file that uses awk everything get executed as expected, but for the second command I don't even get an output file. Any help is much appreciated
Cheers,
You haven't specified this in your question, but I'm guessing you have a file with the keyword, e.g. file cmdfile that contains x-g301. And then you run your script like:
./script "input_file" "output_file" < cmdfile
If so, the first grep command will consume the whole cmdfile on stdin while searching for the first pattern, and nothing will be left for the second grep. That's why the second grep, and then your second script, produces no output.
There are many ways to fix this, but choosing the right one depends on what exactly you are trying to do, and how does that cmdfile look like. Assuming that's a larger file with other things than just the command pattern, you could pass that file as a third argument to your script, like this:
./script "input_file" "output_file" "cmdfile"
And have your script handle it like this:
#!/bin/bash
input_file="$1"
output_file="$2"
cmdfile="$3"
if grep -q "X-G303" "$cmdfile"; then
awk -f ./mno/script.awk "$input_file" > t1.json
fi
if grep -q "x-g301" "$cmdfile"; then
sh ./mno/tm.sh "$input_file" > t2.json
fi
Here I'm also assuming that your awk and sh scripts don't really need the output from grep, since you're giving them the name of the input file.
Note the proper way to use grep for existence search is via its exit code (and the muted output with -q). Instead of the if we could have used shortcircuiting (grep ... && awk ...), but this way is probably more readable.

how to send text to a process in a shell script?

So I have a Linux program that runs in a while(true) loop, which waits for user input, process it and print result to stdout.
I want to write a shell script that open this program, feed it lines from a txt file, one line at a time and save the program output for each line to a file.
So I want to know if there is any command for:
- open a program
- send text to a process
- receive output from that program
Many thanks.
It sounds like you want something like this:
cat file | while read line; do
answer=$(echo "$line" | prog)
done
This will run a new instance of prog for each line. The line will be the standard input of prog and the output will be put in the variable answer for your script to further process.
Some people object to the "cat file |" as this creates a process where you don't really need one. You can also use file redirection by putting it after the done:
while read line; do
answer=$(echo "$line" | prog)
done < file
Have you looked at pipes and redirections ? You can use pipes to feed input from one program into another. You can use redirection to send contents of files to programs, and/or write output to files.
I assume you want a script written in bash.
To open a file you just need to type a name of it.
To send a text to a program you either pass it through | or with < (take input from file)
To receive output you use > to redirect output to some file or >> to redirect as well but append the results instead of truncating the file
To achieve what you want in bash, you could write:
#/bin/bash
cat input_file | xargs -l1 -i{} your_program {} >> output_file
This calls your_program for each line from input_file and appends results to output_file

Best way to modify a file when using pipes?

I often have shell programming tasks where I run into this pattern:
cat file | some_script > file
This is unsafe - cat may not have read in the entire file before some_script starts writing to it. I don't really want to write the result to a temporary file (its slow, and I don't want the added complication of thinking up a unique new name).
Perhaps, there is there is a standard shell command that will buffer a whole stream until EOF is reached? Something like:
cat file | bufferUntilEOF | script > file
Ideas?
Like many others, I like to use temporary files. I use the shell process-id as part of the temporary name so that if multiple copies of the script are running at the same time, they won't conflict. Finally, I then only overwrite the original file if the script succeeds (using boolean operator short-circuiting - it's a little dense but very nice for simple command lines). Putting that all together, it would look like:
some_script < file > smscrpt.$$ && mv smscrpt.$$ file
This will leave the temporary file if the command fails. If you want to clean up on error, you can change that to:
some_script < file > smscrpt.$$ && mv smscrpt.$$ file || rm smscrpt.$$
BTW, I got rid of the poor use of cat and replaced it with input redirection.
You're looking for sponge.
Using a temporary file is the correct solution here. When you use a redirection like '>', it is handled by the shell, and no matter how many commands are in your pipeline, the shell is free to delete and overwrite the output file before any command is executed (during pipeline setup).
Another option is just to read the file into a variable:
file_contents=$(cat file)
echo "$file_contents" | script1 | script2 > file
Using mktemp(1) or tempfile(1) saves you the expense of having to think up unique filename.
In response to the OP's question above about using sponge without external dependencies, and building on #D.Shawley's answer, you can have the effect of sponge with only a dependency on gawk, which is not uncommon on Unix or Unix-like systems:
cat foo | gawk -voutfn=foo '{lines[NR]=$0;} END {if(NR>0){print lines[1]>outfn;} for(i=2;i<=NR;++i) print lines[i] >> outfn;}'
The check for NR>0 is to truncate the input file.
To use this in a shell script, change -voutfn=foo to -voutfn="$1" or whatever syntax your shell uses for filename arguments. For example:
#!/bin/bash
cat "$1" | gawk -voutfn="$1" '{lines[NR]=$0;} END {if(NR>0){print lines[1]>outfn;} for(i=2;i<=NR;++i) print lines[i] >> outfn;}'
Note that, unlike real sponge, this may be limited to the size of RAM. sponge actually buffers in a temporary file if necessary.
Using a temporary file is IMO better than attempting to buffer the data in the pipeline.
It almost defeats the purpose of pipelines to buffer them.
I think you need to use mktemp. Something like this will work:
FILE=example-input.txt
TMP=`mktemp`
some_script <"$FILE" >"$TMP"
mv "$TMP" "$FILE"
I think that the best way is to use a temp file. However, if you want another approach, you can use something like awk to buffer up the input into memory before your application starts receiving input. The following script will buffer the all of the input into the lines array before it starts to output it to the next consumer in the pipeline.
{ lines[NR] = $0; }
END {
for (line_no=1; line_no<=NR; ++line_no) {
print lines[line_no];
}
}
You can collapse it into a one-liner if you want:
cat file | awk '{lines[NR]=$0;} END {for(i=1;i<=NR;++i) print lines[i];}' > file
With all of that, I would still recommend using a temporary file for the output and then overwriting the original file with it.

How to run the first process from a list in a file deleting the first line as if the file was a queue and I called "pop"?

How to run the first process from a list of processes stored in a file and immediately delete the first line as if the file was a queue and I called "pop"?
I'd like to call the first command listed in a simple text file with \n as the separator in a pop-like fashion:
Figure 1:
cmdqueue.lst :
proc_C1
proc_C2
proc_C3
.
.
Figure 2:
Pop the first command via popcmd:
proc_A | proc_B | popcmd cmdqueue.lst | proc_D
Figure 3:
cmdqueue.lst :
proc_C2
proc_C3
proc_C4
.
.
Ooh, that's an amusing one-liner.
Okay, here's the deal. What you want is a program that, when called, prints the first line of the file to stdout, then delete that line from the file. Sounds like a job for sed(1).
Try
proc_A | proc_B | `(head -1 cmdstack.lst; sed -i -e '1d' cmdstack.lst)` | proc_D
I'm sure that someone who had already had their coffee could change the sed program to not need the head(1) call, but that works, and shows off using a subshell ("( foo )" runs in a sub-process.)
pop-cmd.py:
#!/usr/bin/env python
import os, shlex, sys
from subprocess import call
filename = sys.argv[1]
lines = open(filename).readlines()
if lines:
command = lines[0].rstrip()
open(filename, "w").writelines(lines[1:])
if command:
sys.exit(call(shlex.split(command) + sys.argv[2:]))
Example:
proc_A | proc_B | python pop-cmd.py cmdstack.lst | proc_D
I assume that you are constantly appending to the file also, so rewriting the file puts you in danger of overwriting data. For this type of task I think you would be better using individual files for each queue entry, using date/time to determine order, and then as you process each file you could append the data to a log file and then delete the trigger file.
Really need more information in order to suggest a good solution. It's important to know how the file is getting updated. Is it a lot of separate processes, just one process, etc.
I think you would need to rewrite the file - e.g. run a command to list all lines but the first, write that to a temporary file and rename it to the original. That could be done using tail or awk or perl depending on the commands you have available.
If you want to treat a file like a stack, then a better approach would be to have the top of the stack at the end of the file.
Thus you can easily cut off the file at the beginning of the last line (= pop), and simply append to the file as you push.
You can use a little bash script; name it "popcmd":
#!/bin/bash
cmd=`head -n 1 $1`
tail -n +2 $1 > ~tmp~
mv -f ~tmp~ $1
$cmd
edit: Using sed for the middle two lines, like Charlie Martin showed, is much more elegant, of course:
#!/bin/bash
cmd=`head -n 1 $1`
sed -i -e '1d' $1
$cmd
edit: You can use this exactly as in your example usage code:
proc_A | proc_B | popcmd cmdstack.lst | proc_D
You can't write to the beginning of a file, so cutting out line 1 would be a lot of work (rewrite the rest of the file (which isn't actually that much work for the programmer (it's what every other answer post has written for you :) ) ) ).
I'd recommend keeping the whole thing in memory and using a classic stack rather than a file.

Resources