Is there any write buffer in bash programming? - bash

Is there any write-to-file buffer in bash programming? And if there is any, is it possible to change its size.
Here is the problem
I have a bash script which reads a file line by line then manipulates the read data and then write the result in to another file. Something like this
while read line
some grep, but and sed
echo and append to another file
The input data is really huge (nearly 20GB of text file). The progress is slow so a question arise that if the default behavior of bash, is to write the result into the output file for each read line, then the progress will be slow.
So I want to know, is there any mechanism to buffer some outputs and then write that chunk to file? I searched on the internet about this issue but didn't find any useful information...
Is is an OS related question or bash? The OS is centos release 6.
The script is
#!/bin/bash
BENCH=$1
grep "CPU 0" $BENCH > `pwd`/$BENCH.cpu0
grep -oP '(?<=<[vp]:0x)[0-9a-z]+' `pwd`/$BENCH.cpu0 | sed 'N;s/\n/ /' | tr '[:lower:]' '[:upper:]' > `pwd`/$BENCH.cpu0.data.VP
echo "grep done"
while read line ; do
w1=`echo $line | cut -d ' ' -f1`
w11=`echo "ibase=16; $w1" | bc`
w2=`echo $line | cut -d ' ' -f2`
w22=`echo "ibase=16; $w2" | bc`
echo $w11 $w22 >> `pwd`/$BENCH.cpu0.data.VP.decimal
done <"`pwd`/$BENCH.cpu0.data.VP"
echo "convertion done"

Each echo and append in your loop are opening and closing the file which may have a negative impact on performance.
A likely better approach (and you should profile) is simply:
grep 'foo' | sed 's/bar/baz' | [any other stream operations] <$input_file >$output_file
If you must keep the existing structure then an alternative approach would be to create a named pipe:
mkfifo buffer
Then create 2 processes: one which writes into the pipe, and one with reads from the pipe.
#proc1
while read line <$input_file; do
grep foo | sed 's/bar/baz' >buffer
done
#proc2
while read line <buffer; do
echo line >>$output_file
done
In reality I would expect the bottleneck to be entirely file IO, but this does create an independence between the reading and writing, which may be desirable.
If you have 20GB of RAM lying around, it may improve performance to use a memory mapped temporary file instead of a named pipe.

Just to see what the differences were, I created a file containing a bunch of
a somewhat long string followed by a number: 0000001
Containing 10,000 lines (about 50MiB) and then ran it through a shell read loop
while read line ; do
echo $line | grep '00$' | cut -d " " -f9 | sed 's/^00*//'
done < data > data.out
Which took almost 6 minutes. Compared with the equivalent
grep '00$' data | cut -d " " -f9 | sed 's/^00*//' > data.fast
which took 0.2 seconds. To remove the cost of the forking, I tested
while read line ; do
:
done < data > data.null
where : is a shell built-in which does nothing at all. As expected data.null had no contents and the loop still took 21 seconds to run through my small file. I wanted to test against a 20GB input file, but I'm not that patient.
Conclusion: learn how to use awk or perl because you will wait forever if you try to use the script you posted while I was writing this.

Related

Bash Iterative approach in place of process substitution not working as expected

complete bash noob here. Had the following command (1.) and it worked as expected but it seemed a bit naive for what I needed:
Essentially generating a wordlist from a messy input file with tab delimiters
cat users.txt | tee >(cut -f 1 >> cut_out.txt) >(cut -f 2 >> cut_out.txt) >(cut -f 3 >> cut_out.txt) >(cut -f 4 >> cut_out.txt)
Output:
W Humphrey
SummersW
FoxxR
noreply
DaibaN
PeanutbutterM
PetersJ
DaviesJ
BlaireJ
GongoH
MurphyF
JeffersD
HorsemanB
...
Thought I could cut down on the ridiculous command above with the following
cat users.txt | for i in {1..4}; do cut -f $i >> cut_out.txt; done
Output:
HumphreyW
The command above only returned a single word from the list and some white-space.
The solution. I knew that I could get it working logically by simply looping the entire command instead, this did exactly what I wanted but just wanted to know why the command above (2.) returned an almost empty file?
for i in {1..4}; do cat users.txt | cut -f $i >> cut_out.txt; done
Have a solution, more-so wanted an explanation because I am dumb and still learning about I/O in bash. Cheers.
Just a remark
awk -F '[\t]' '{for(i = 1; i <= 4; i++) print $i}' users.txt > cut_out.txt
Is basically what your cat ... | tee >(cut ...) ... does.
If the order of the output is unimportant, and there are only four coumns in the file, simply
tr '\t' '\n' <users.txt >cut_out.txt
If you only want the first four columns in any order,
cut -f1-4 users.txt |
rt '\t' '\n' >cut_out.txt
(Thanks to #KamilCuk for raising this in a comment.)
Otherwise your third attempt is basically fine, though you want to avoid the useless cat and redirect only once;
for i in {1..4}; do
cut -f "$i" users.txt
done > cut_out.txt
This is obviously less efficient than only reading the file once. If the file is small enough to fit into memory, you could write a simple Awk script to read it once and split it up into variables, and then write out these variables in the order you want.
The second attempt is wrong because cat only supplies a single instance of the data to the pipe, and the first iteration of the loop consumes it all.

shell script performance improvement

I have developed a below script to preprocess a file
I'm trying to extract the timestamp from the header of the file and delete the some characters at the end of every line based on the length of the timestamp. once deleted, it then appends the timestamps to every line in the file. this script takes nearly 30 mins to process a 4 GB file.
is there a way I can increase the performance ? does this script be written in a better way?
if [ -f INPUT.TXT ]; then
echo "FILE exists."
date=$(cut -c8-25 INPUT.TXT | head -1)
date_format=$(echo $date | sed -e "s/\./\:/g")
echo -e " header date value is : $date"
echo -e "Header date value format is: $date_format"
leng_t=${#date_format}
len=`expr $leng_t + 1`
sed -i "s/.\{${len}\}$//" INPUT.TXT
sed -i s/$/$date_format/ INPUT.TXT
else
echo "FILE does not exist."
fi
The main optimization is obtained by combining two consecutive sed's into one.
Instead:
sed -i "s/.\{${len}\}$//" INPUT.TXT
sed -i s/$/$date_format/ INPUT.TXT
Use:
sed -i "s/.\{$len\}$/$date_format/" INPUT.TXT
This should cut execution time up to twice.
This result is the base result for showing the gain for all subsequent optimizations.
All subsequent optimizations require additional disk space to store a copy of the INPUT.TXT file (i.e. an additional 4 GB):
Try putting the result in a separate file instead of editing it in place:
sed "s/.\{$len\}$/$date_format/" INPUT.TXT >INPUT.tmp.TXT
mv -f INPUT.tmp.TXT INPUT.TXT
This save ~10% relative base result.
On a multi-core machine, this shoud run faster:
rev INPUT.TXT | sed "s/^.\{$len\}//" | rev | sed "s/\$/$date_format/" >INPUT.tmp.TXT
mv -f INPUT.tmp.TXT INPUT.TXT
This save ~35% relative base result.
On a multi-core machine, and if there are no multibyte characters in the replacement (because cut still can't handle it):
let cut_len=$len+1
rev INPUT.TXT | cut -c $cut_len- | rev | sed "s/\$/$date_format/" >INPUT.tmp.TXT
mv -f INPUT.tmp.TXT INPUT.TXT
This save ~50% relative base result.
Thus, with the best optimization, the script can run four times faster.
Note: All tests done with 400MB file.

Bash script working with second column from txt but keep first column in result as relevant

I am trying to write a bash script to ease a process with IP information gathering.
Right now I have made a script which runs throught the one column of IP address in multiple files, looks for geo and host information and stores it to a new file.
What would be nice is also to have a script that generates a result from files with a 3 columns - date, time, ip address. Separator is space.
I tried this an that but no. I am a total newbie :)
This is my original script:
#!/usr/bin/env bash
find *.txt -print0 | while read -d $'\0' file;
do
for i in $( cat "$file")
do echo -e "$i,"$( geoiplookup -f "/usr/share/GeoIP/GeoLiteCity.dat" $i | cut -d' ' -f6,8-9)" "$(nslookup $i | grep name | awk '{print $4}')"" >> "res/res-"$file".txt";
done
done
Input file example
2014-03-06 12:13:27 213.102.145.172
2014-03-06 12:18:24 83.177.253.118
2014-03-25 15:42:01 213.102.155.173
2014-03-25 15:55:47 213.101.185.223
2014-03-26 15:21:43 90.130.182.2
Can you please help me on this?
It's not entirely clear what the current code is attempting to do, but here is a hopefully useful refactoring which could be at least a starting point.
#!/usr/bin/env bash
find *.txt -print0 | while read -d $'\0' file;
do
while read date time ip; do
geo=$(geoiplookup -f "/usr/share/GeoIP/GeoLiteCity.dat" "$ip" |
cut -d' ' -f6,8-9)
addr=$(nslookup "$ip" | awk '/name/ {print $4}')
#addr=$(dig +short -x "$ip")
echo "$ip $geo $addr"
done <"$file" >"res/res-$file.txt"
done
My copy of nslookup does not output four fields but I assume that part of your script is correct. The output from dig +short is better suitable for machine processing, so maybe switch to that instead. Perhaps geoiplookup also offers an option to output machine-readable results, or maybe there is an alternative interface which does.
I assume it was a mistake that your script would output partially comma-separated, partially whitespace-separated results, so I changed that, too. Maybe you should use CSV or JSON instead if you intend for other tools to be able to read this output.
Trying to generate a file named res/res-$file.txt will only work if file is not in any subdirectory, so I'm guessing you will want to fix that with basename; or perhaps the find loop should be replaced with a simple for file in *.txt instead.

Reading a file line by line in ksh

We use some package called Autosys and there are some specific commands of this package. I have a list of variables which i like to pass in one of the Autosys commands as variables one by one.
For example one such variable is var1, using this var1 i would like to launch a command something like this
autosys_showJobHistory.sh var1
Now when I launch the below written command, it gives me the desired output.
echo "var1" | while read line; do autosys_showJobHistory.sh $line | grep 1[1..6]:[0..9][0..9] | grep 24.12.2012 | tail -1 ; done
But if i put the var1 in a file say Test.txt and launch the same command using cat, it gives me nothing. I have the impression that command autosys_showJobHistory.sh does not work in that case.
cat Test.txt | while read line; do autosys_showJobHistory.sh $line | grep 1[1..6]:[0..9][0..9] | grep 24.12.2012 | tail -1 ; done
What I am doing wrong in the second command ?
Wrote all of below, and then noticed your grep statement.
Recall that ksh doesn't support .. as an indicator for 'expand this range of values'. (I assume that's your intent). It's also made ambiguous by your lack of quoting arguments to grep. If you were using syntax that the shell would convert, then you wouldn't really know what reg-exp is being sent to grep. Always better to quote argments, unless you know for sure that you need the unquoted values. Try rewriting as
grep '1[1-6]:[0-9][0-9]' | grep '24.12.2012'
Also, are you deliberately using the 'match any char' operator '.' OR do you want to only match a period char? If you want to only match a period, then you need to escape it like \..
Finally, if any of your files you're processing have been created on a windows machine and then transfered to Unix/Linux, very likely that the line endings (Ctrl-MCtrl-J) (\r\n) are causing you problems. Cleanup your PC based files (or anything that was sent via ftp) with dos2unix file [file2 ...].
If the above doesn't help, You'll have to "divide and conquer" to debug your problem.
When I did the following tests, I got the expected output
$ echo "var1" | while read line ; do print "line=${line}" ; done
line=var1
$ vi Test.txt
$ cat Test.txt
var1
$ cat Test.txt | while read line ; do print "line=${line}" ; done
line=var1
Unrelated to your question, but certain to cause comment is your use of the cat commnad in this context, which will bring you the UUOC award. That can be rewritten as
while read line ; do print "line=${line}" ; done < Test.txt
But to solve your problem, now turn on the shell debugging/trace options, either by changing the top line of the script (the shebang line) like
#!/bin/ksh -vx
Or by using a matched pair to track the status on just these lines, i.e.
set -vx
while read line; do
print -u2 -- "#dbg: Line=${line}XX"
autosys_showJobHistory.sh $line \
| grep 1[1..6]:[0..9][0..9] \
| grep 24.12.2012 \
| tail -1
done < Test.txt
set +vx
I've added an extra debug step, the print -u2 -- .... (u2=stderror, -- closes option processing for print)
Now you can make sure no extra space or tab chars are creeping in, by looking at that output.
They shouldn't matter, as you have left your $line unquoted. As part of your testing, I'd recommend quoting it like "${line}".
Then I'd comment out the tail and the grep lines. You want to see what step is causing this to break, right? So does the autosys_script by itself still produce the intermediate output you're expecting? Then does autosys + 1 grep produce out as expected, +2 greps, + tail? You should be able to easily see where you're loosing your output.
IHTH

How to handle variables that contain ";"?

I have a configuration file that contains lines like "hallo;welt;" and i want to do a grep on this file.
Whenever i try something like grep "$1;$2" my.config or echo "$1;$2 of even line="$1;$2" my script fails with something like:
: command not found95: line 155: =hallo...
How can i tell bash to ignore ; while evaluating "..." blocks?
EDIT: an example of my code.
# find entry
$line=$(grep "$1;$2;" $PERMISSIONSFILE)
# splitt line
reads=$(echo $line | cut -d';' -f3)
writes=$(echo $line | cut -d';' -f4)
admins=$(echo $line | cut -d';' -f5)
# do some stuff on the permissions
# replace old line with new line
nline="$1;$2;$reads;$writes;$admins"
sed -i "s/$line/$nline/g" $TEMPPERM
my script should be called like this: sh script "table" "a.b.*.>"
EDIT: another, simpler example
$test=$(grep "$1;$2;" temp.authorization.config)
the temp file:
table;pattern;read;write;stuff
the call sh test.sh table pattern results in: : command not foundtable;pattern;read;write;stuff
Don't use $ on the left side of an assignment in bash -- if you do it'll substitute the current value of the variable rather than assigning to it. That is, use:
test=$(grep "$1;$2;" temp.authorization.config)
instead of:
$test=$(grep "$1;$2;" temp.authorization.config)
Edit: also, variable expansions should be in double-quotes unless there's a good reason otherwise. For example, use:
reads=$(echo "$line" | cut -d';' -f3)
instead of:
reads=$(echo $line | cut -d';' -f3)
This doesn't matter for semicolons, but does matter for spaces, wildcards, and a few other things.
A ; inside quotes has no meaning at all for bash. However, if $1 contains a doublequote itself, then you'll end up with
grep "something";$2"
which'll be parsed by bash as two separate commands:
grep "something" ; other"
^---command 1----^ ^----command 2---^
Show please show exactly what your script is doing around the spot the error is occurring, and what data you're feeding into it.
Counter-example:
$ cat file.txt
hello;welt;
hello;world;
hell;welt;
$ cat xx.sh
grep "$1;$2" file.txt
$ bash -x xx.sh hello welt
+ grep 'hello;welt' file.txt
hello;welt;
$
You have not yet classified your problem accurately.
If you try to assign the result of grep to a variable (like I do) your example breaks.
Please show what you mean. Using the same data file as before and doing an assignment, this is the output I get:
$ cat xx.sh
grep "$1;$2" file.txt
output=$(grep "$1;$2" file.txt)
echo "$output"
$ bash -x xx.sh hello welt
+ grep 'hello;welt' file.txt
hello;welt;
++ grep 'hello;welt' file.txt
+ output='hello;welt;'
+ echo 'hello;welt;'
hello;welt;
$
Seems to work for me. It also demonstrates why the question needs an explicit, complete, executable, minimal example so that we can see what the questioner is doing that is different from what people answering the question think is happening.
I see you've provided some sample code:
# find entry
$line=$(grep "$1;$2;" $PERMISSIONSFILE)
# splitt line
reads=$(echo $line | cut -d';' -f3)
writes=$(echo $line | cut -d';' -f4)
admins=$(echo $line | cut -d';' -f5)
The line $line=$(grep ...) is wrong. You should omit the $ before line. Although it is syntactically correct, it means 'assign to the variable whose name is stored in $line the result of the grep command'. That is unlikely to be what you had in mind. It is, occasionally, useful. However, those occasions are few and far between, and only for people who know what they're doing and who can document accurately what they're doing.
For safety if nothing else, I would also enclose the $line values in double quotes in the echo lines. It may not strictly be necessary, but it is simple protective programming.
The changes lead to:
# find entry
line=$(grep "$1;$2;" $PERMISSIONSFILE)
# split line
reads=$( echo "$line" | cut -d';' -f3)
writes=$(echo "$line" | cut -d';' -f4)
admins=$(echo "$line" | cut -d';' -f5)
The rest of your script was fine.
It seems like you are trying to read a semicolon-delimited file, identify a line starting with 'table;pattern;' where table is a string you specify and pettern is a regular expression grep will understand. Once the line is identified you wish to replaced the 3rd, 4th and 5th fields with different data and write the updated line back to the file.
Does this sound correct?
If so, try this code
#!/bin/bash
in_table="$1"
in_pattern="$2"
file="$3"
while IFS=';' read -r -d$'\n' tuple pattern reads writes admins ; do
line=$(cut -d: -f1<<<"$tuple")
table=$(cut -d: -f2<<<"$tuple")
# do some stuff with the variables
# e.g., update the values
reads=1
writes=2
admins=12345
# replace the old line with the new line
sed -i'' -n $line'{i\
'"$table;$pattern;$reads;$writes;$admins"'
;d;}' "$file"
done < <(grep -n '^'"${in_table}"';'"${in_pattern}"';' "${file}")
I chose to update by line number here to avoid problems of unknown characters in the left hand of the substitution.

Resources