How to make awk command run faster on large data files

How to make awk command run faster on large data files - bash

I used this awk command below to create a new UUID column in a table in my existing .dat files.
$ awk '("uuidgen" | getline uuid) > 0 {print uuid "|" $0} {close("uuidgen")}' $filename > ${filename}.pk
The problem is that my .dat files are pretty big (like 50-60 GB) and this awk command takes hours even on small data files (like 15MB).
Is there any way to increase the speed of this awk command?

I wonder if you might save time by not having awk open and close uuidgen every line.
$ function regen() { while true; do uuidgen; done; }
$ coproc regen
$ awk -v f="$filename" '!(getline line < f){exit} {print $0,line}' OFS="|" < /dev/fd/${COPROC[0]} > "$filename".pk
This has awk reading your "real" filename from a variable, and the uuid from stdin, because the call to uuidgen is handled by a bash "coprocess". The funky bit around the getline is to tell awk to quit once it runs out of input from $filename. Also, note that awk is taking input from input redirection instead of reading the file directly. This is important; the file descriptor at /dev/fd/## is a bash thing, and awk can't open it.
This should theoretically save you time doing unnecessary system calls to open, run and close the uuidgen binary. On the other hand, the coprocess is doing almost the same thing anyway by running uuidgen in a loop. Perhaps you'll see some improvement in an SMP environment. I don't have a 50GB text file handy for benchmarking. I'd love to hear your results.
Note that coproc is a feature that was introduced with bash version 4. And use of /dev/fd/* requires that bash is compiled with file descriptor support. In my system, it also means I have to make sure fdescfs(5) is mounted.
I just noticed the following on my system (FreeBSD 11):
$ /bin/uuidgen -
usage: uuidgen [-1] [-n count] [-o filename]
If your uuidgen also has a -n option, then adding it to your regen() function with ANY value might be a useful optimization, to reduce the number of times the command needs to be reopened. For example:
$ function regen() { while true; do uuidgen -n 100; done; }
This would result in uuidgen being called only once every 100 lines of input, rather than for every line.
And if you're running Linux, depending on how you're set up, you may have an alternate source for UUIDs. Note:
$ awk -v f=/proc/sys/kernel/random/uuid '{getline u<f; close(f); print u,$0}' OFS="|" "$filename" "$filename".pk
This doesn't require the bash coproc, it just has awk read a random uuid directly from a Linux kernel function that provides them. You're still closing the file handle for every line of input, but at least you don't have to exec the uuidgen binary.
YMMV. I don't know what OS you're running, so I don't know what's likely to work for you.

Your script is calling shell to call awk to call shell to call uuidgen. Awk is a tool for manipulating text, it's not a shell (an environment to call other tools from) so don't do that, just call uuidgen from shell:
$ cat file
foo .*
bar stuff
here
$ xargs -d $'\n' -n 1 printf '%s|%s\n' "$(uuidgen)" < file
5662f3bd-7818-4da8-9e3a-f5636b174e94|foo .*
5662f3bd-7818-4da8-9e3a-f5636b174e94|bar stuff
5662f3bd-7818-4da8-9e3a-f5636b174e94|here

I'm just guessing that the real problem here is that you're running a sub-process for each line. You could read your file explicitly line by line and read output from a batch-uuidgen line by line, and thus only have a single subprocess to handle at once. Unfortunately, uuidgen doesn't work that way.
Maybe another solution?
perl -MData::UUID -ple 'BEGIN{ $ug = Data::UUID->new } $_ = lc($ug->to_string($ug->create)) . " | " . $_' $filename > ${filename}.pk
Might this be faster?

Related

What can I do to speed up this bash script?

The code I have goes through a file and multiplies all the numbers in the first column by a number. The code works, but I think its somewhat slow. It takes 26.676s (walltime) to go through a file with 2302 lines in it. I'm using a 2.7 GHz Intel Core i5 processor. Here is the code.
#!/bin/bash
i=2
sed -n 1p data.txt > data_diff.txt #outputs the header (x y)
while [ $i -lt 2303 ]; do
NUM=`sed -n "$i"p data.txt | awk '{print $1}'`
SEC=`sed -n "$i"p data.txt | awk '{print $2}'`
NNUM=$(bc <<< "$NUM*0.000123981")
echo $NNUM $SEC >> data_diff.txt
let i=$i+1
done

Honestly, the biggest speedup you can get will come from using a single language that can do the whole task itself. This is mostly because your script invokes 5 extra processes for each line, and invoking extra processes is slow, but also text processing in bash is really not that well optimized.
I'd recommend awk, given that you have it available:
awk '{ print $1*0.000123981, $2 }'
I'm sure you can improve this to skip the header line and print it without modification.
You can also do this sort of thing with Perl, Python, C, Fortran, and many other languages, though it's unlikely to make much difference for such a simple calculation.

Your script runs 4603 separate sed processes, 4602 separate awk processes, and 2301 separate bc processes. If echo were not a built-in then it would also run 2301 echo processes. Starting a process has relatively large overhead. Not so large that you would ordinarily notice it, but you are running over 11000 short processes. The wall time consumption doesn't seem unreasonable for that.
MOREOVER, each sed that you run processes the whole input file anew, selecting from it just one line. This is horribly inefficient.
The solution is to reduce the number of processes you are running, and especially to perform only a single run through the whole input file. A fairly easy way to do that would be to convert to an awk script, possibly with a bash wrapper. That might look something like this:
#!/bin/bash
awk '
NR==1 { print; next }
NR>=2303 { exit }
{ print $1 * 0.000123981, $2 }
' data.txt > data_diff.txt
Note that the line beginning with NR>=2303 artificially stops processing the input file when it reaches the 2303rd line, as your original script does; you could omit that line of the script altogether to let it simply process all the lines, however many there are.
Note, too, that that uses awk's built-in FP arithmetic instead of running bc. If you actually need the arbitrary-precision arithmetic of bc then I'm sure you can figure out how to modify the script to get that.

As an example of how to speed up the bash script (without implying that this is the right solution)
#!/bin/bash
{ IFS= read -r header
echo "$header"
# You can drop the third name "rest" if your input file
# only has two columns.
while read -r num sec rest; do
nnum=$( bc <<< "$num * 0.000123981" )
echo "$nnum $sec"
done
} < data.txt > data_diff.txt
Now you only have one extra call to bc per data line, necessitated by the fact that bash doesn't do floating-point arithmetic. The right answer is to use a single call to program that can do floating-point arithmetic, as pointed out by David Z.

getting the last opened file

input file:
wtf.txt|/Users/jaro/documents/inc/face/|
lol.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/twitter/|
lol.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/face/|
omg.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/linked/|
wtf.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/twitter/|
wtf.txt|/Users/jaro/documents/inc/linked/|
lol.txt|/Users/jaro/documents/inc/face/|
omg.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/face/|
wtf.txt|/Users/jaro/documents/inc/twitter/|
omg.txt|/Users/jaro/documents/inc/linked/|
omg.txt|/Users/jaro/documents/inc/linked/|
input file is the list of opened files (opening file means 1 line of file) i want to get the last opened file in
e.g. : get last opened file in dir /Users/jaro/documents/inc/face/
output:
wtf.txt

This fetches the last line in the file whose second field is the desired folder name, and prints the first field.
awk -F '\|' '$2 == "/Users/jaro/documents/inc/face/" { f=$1 }
END { print f }' file
To test whether the most recent file is also an existing file, I would use the shell to reverse the order with tac and perform the logic; skip the files in the wrong path, and the ones which don't exist, then print the first success and quit.
tac file |
while IFS='|' read -r basename path _; do
case $path in "/Users/jaro/documents/inc/face") ;; *) continue;; esac
test -e "$path/$basename" || continue
echo "$basename"
break
done |
grep .
The final grep . is to produce an exit code which reflects whether or not the command was successful -- if it printed a file, it's okay; if none of the extracted files existed, return error.
Below is my original answer, based on a plausible but apparently incorrect interpretation of your question.
Here is a quick attempt at finding the file with the newest modification time from the list. I avoid parsing ls, prefering instead to use properly machine-parseable output from stat. Since your input file is line-oriented, I assume no file names contain newlines, which simplifies things quite a bit.
awk -F '\|' '$2 == "/Users/jaro/documents/inc/face/" { print $2 $1 }' file |
sort -u |
xargs stat -f '%m %N' |
sort -rn |
awk -F '/' '{ print $NF; exit(0) }'
The first sort is to remove any duplicates, to avoid running stat more times than necessary (premature optimization, perhaps), the stat prefixes each line with the file's modification time expressed as seconds since the epoch, which facilitates easy numerical sorting by age, and the final Awk script neatly combines head -n 1 | rev | cut -d / -f1 | rev i.e. extract just the basename from the first line of output, then quit.
If there is any way to use a less wacky input format, that would be an improvement (probably of your life in general as well).
The output format from stat is not properly standardized, but your question is tagged linuxosx so I assume GNU coreutils BSD stat. If portability is desired, maybe look at find (which however may be overkill and/or not much better standardized across diverse platforms) or write a small Perl or Python script instead. (Well, Ruby too, I suppose, but personally, I'd go with Perl.)
perl -F'\|' -lane '{ $t{$F[0]} = (stat($F[1].$F[0]))[10]
if !defined $t{$F[0]} and $F[1] == "/Users/jaro/documents/inc/face/" }
END { print ((sort { $t{$a} <=> $t{$b} } keys %t)[-1]) }' file

atime – The atime (access time) is the time when the data of a file was last accessed. Displaying the contents of a file or executing a shell script will update a file’s atime, for example. You can view the atime with the ls -lu command
http://www.techtrunch.com/linux/ctime-mtime-atime-linux-timestamps
So in your case, will do the trick.
ls -lu /Users/jaro/documents/inc/face/

How can I merge multiple lines to create exactly two records based on field separators?

I need help writing a Unix script loop to process the following data:
200250|Wk50|200212|January|20024|Quarter4|2002|2002
|2003-01-12
|2003-01-18
|2003-01-05
|2003-02-01
|2002-11-03
|2003-02-01|
|2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002
|2002-10-27
|2002-11-02
|2002-10-06
|2002-11-02
|2002-08-04
|2002-11-02|
|2003-02-01|||||||
I have data in above format in a text file. What I need to do is remove newline characters on all lines which have | as the first character in the next line. The output I need is:
200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02 |2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||
I need some help to achieve this. These shell commands are giving me nightmares!

The 'sed' approach:
sed ':a;N;$!ba;s/\n|/|/g' input.txt
Though, awk would be faster & easier to understand/maintain. I just had that example handy (a common solution for removing trailing newlines w/ sed).
EDIT:
To clarify the difference between this answer (option #1) and the alternative solution by #potong (which I actually prefer: sed ':a;N;s/\n|/|/;ta;P;D' file), which I'll call option #2:
note that these are two of many possible options with sed. I actually prefer non-sed solutions since they do in general run faster. But these two options are notable because they demonstrate two distinct ways to process a file: option #1 all in-memory, and option #2 as a stream. (note: below when I say "buffer", technically I mean "pattern space"):
option #1 reads the whole file into memory:
:a is just a label; N says append the next line to the buffer; if end-of-file ($) is not (!) reached, then branch (b) back to label :a ...
then after the whole file is read into memory, process the buffer with the substitution command (s), replacing all occurrences of "\n|" (newline followed by "|") with just a "|", on the entire (g) buffer
option #2 just process a couple lines at a time:
reads / appends the next line (N) into the buffer, processes it (s/\n|/|/); branches (t) back to label :a only if the substitution was successful; otherwise prints (P) and clears/deletes (D) the current buffer up to the first embedded newline ... and the stream continues.
option #1 takes a lot more memory to run. In general, as large as your file. Option #2 requires minimal memory; so small I didn't bother to see what it correlates to (I'm guessing the length of a line.)
option #1 runs faster. In general, twice as fast as option #2; but obviously it depends on the file and what is being done.
On a ~500MB file, option #1 runs about twice as fast (1.5s vs 3.4s),
$ du -h /tmp/foobar.txt
544M /tmp/foobar.txt
$ time sed ':a;N;$!ba;s/\n|/|/g' /tmp/foobar.txt > /dev/null
real 0m1.564s
user 0m1.390s
sys 0m0.171s
$ time sed ':a;N;s/\n|/|/;ta;P;D' /tmp/foobar.txt > /dev/null
real 0m3.418s
user 0m3.239s
sys 0m0.163s
At the same time, option #1 takes about 500MB of memory, and option #2 requires less than 1MB:
$ ps -F -C sed
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
username 4197 11001 99 172427 558888 1 19:22 pts/10 00:00:01 sed :a;N;$!ba;s/\n|/|/g /tmp/foobar.txt
note: /proc/{pid}/smaps (Pss): 558188 (545M)
And option #2:
$ ps -F -C sed
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
username 4401 11001 99 3468 864 3 19:22 pts/10 00:00:03 sed :a;N;s/\n|/|/;ta;P;D /tmp/foobar.txt
note: /proc/{pid}/smaps (Pss): 236 (0M)
In summary (w/ commentary),
if you have files of unknown size, streaming without buffering is a better decision.
if every second matters, then buffering the entire file and processing it at once may be fine -- but ymmv.
my personal experience with tuning shell scripts is that awk or perl (or tr, but it's the least portable) or even bash may be preferable to using sed.
yet, sed is a very flexible and powerful tool that gets a job done quickly, and can be tuned later.

Here is an awk solution:
$ awk 'substr($0,1,1)=="|"{printf $0;next} {printf "\n"$0} END{print""}' data
200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||
Explanation:
Awk implicitly loops through every line in the file.
substr($0,1,1)=="|"{printf $0;next}
If this line begins with a vertical bar, then print it (without a final newline) and then skip to the next line. We are using printf here, as opposed to the more common print, so that newlines are not printed unless we explicitly ask for them.
{printf "\n"$0}
If the line didn't begin with a vertical bar, print a newline and then this line (again without a final newline).
END{print""}
At the end of the file, print a newline.
Refinement
The above prints out an extra newline at the beginning of the file. If that is a problem, then it can be eliminated with just a minor change:
$ awk 'substr($0,1,1)=="|"{printf $0;next} {printf new $0;new="\n"} END{print""}' data
200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

This might work for you (GNU sed):
sed ':a;N;s/\n|/|/;ta;P;D' file
This processes the file a line at a time an alternative to #michael_n's which slurps the file content into memory before processing.

You could do this simply through perl,
$ perl -0777pe 's/\n(?=\|)//g' file
200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||

awk -f test.awk input.txt
test.awk
{
if($0 ~ /^\|/)
{
array[i++] = $0
}
else
{
for(j=0;j<i;j++)
{
line = line array[j];
}
i=0;
print line
line = $0;
}
}

awk -f inp.awk input | sed '/^$/d'
inp.awk
{
if($0 !~ /^\|/)
{
print line;
line = $0;
}
else
{
line = line $0;
}
}

awk: Output to different processes

I have awk script which splits big file into several files by some condition. Than I'm running another script over each file in parallel.
awk -f script.awk -v DEST_FOLDER=tmp input.file
find tmp/ -name "*.part" | xargs -P $ALLOWED_CPUS --replace --verbose /bin/bash -c "./process.sh {}"
The question is: are there any way to run ./process.sh:
before first script is done, because process.sh processes file line by line (one line too long to be passed to xargs directly);
each new file has a header (added in script.awk) that should be run before the rest of file;
limit amount of parallel processes;
GNU parallel,inotifywait is not an option;
assume dest folder is empty, files name are unknown.
The purpose of optimization to get rid of waiting until the awk is done while some files are ready to be processed.

Once you have created a file, you can pass the filename to a process' or script's input:
awk '{print name_of_created_file | "./process.sh &"}'
& sends process.sh to the background, so that they can run in parallel. However, this is a gawk extension and not POSIX. Check the manual

You basically give the answer yourself: GNU Parallel + inotifywait will work.
Since you are not allowed to use inotifywait, you can make your substitute for inotifywait. If you are allowed to write your own script, you are also allowed to run GNU Parallel (as that is just a script).
So something like this:
awk -f script.awk -v DEST_FOLDER=tmp input.file &
sleep 1
record file sizes of files in tmp
while tmp is not empty do
for files in tmp:
if file size is unchanged: print file
record new file size
sleep 1
done | parallel 'process {}; rm {}'
It is assumed that awk will produce some output with one second. If that takes longer, adjust the sleeps accordingly.

how to make a winmerge equivalent in linux

My friend recently asked how to compare two folders in linux and then run meld against any text files that are different. I'm slowly catching on to the linux philosophy of piping many granular utilities together, and I put together the following solution. My question is, how could I improve this script. There seems to be quite a bit of redundancy and I'd appreciate learning better ways to script unix.
#!/bin/bash
dir1=$1
dir2=$2
# show files that are different only
cmd="diff -rq $dir1 $dir2"
eval $cmd # print this out to the user too
filenames_str=`$cmd`
# remove lines that represent only one file, keep lines that have
# files in both dirs, but are just different
tmp1=`echo "$filenames_str" | sed -n '/ differ$/p'`
# grab just the first filename for the lines of output
tmp2=`echo "$tmp1" | awk '{ print $2 }'`
# convert newlines sep to space
fs=$(echo "$tmp2")
# convert string to array
fa=($fs)
for file in "${fa[#]}"
do
# drop first directory in path to get relative filename
rel=`echo $file | sed "s#${dir1}/##"`
# determine the type of file
file_type=`file -i $file | awk '{print $2}' | awk -F"/" '{print $1}'`
# if it's a text file send it to meld
if [ $file_type == "text" ]
then
# throw out error messages with &> /dev/null
meld $dir1/$rel $dir2/$rel &> /dev/null
fi
done
please preserve/promote readability in your answers. An answer that is shorter but harder to understand won't qualify as an answer.

It's an old question, but let's work a bit on it just for fun, without thinking in the final goal (maybe SCM) nor in tools that already do this in a better way. Just let's focus in the script itself.
In the OP's script, there are a lot of string processing inside bash, using tools like sed and awk, sometimes more than once in the same command line or inside a loop executing n times (one per file).
That's ok, but it's necessary to remember that:
Each time the script calls any of those programs, it's created a new process in the OS, and that is expensive in time and resources. So the less programs are called, the better is the performance of script that is executing:
diff 2 times (1 just to print to user)
sed 1 time processing diff result and 1 time for each file
awk 1 time processing sed result and 2 times for each file (processing file result)
file 1 time for each file
That doesn't apply to echo, read, test and others that are builtin commands of bash, so no external program is executed.
meld is the final command that will display the files to user, so it doesn't count.
Even with the builtin commands, redirection pipelines | has a cost too, because the shell has to create pipes, duplicate handles, and maybe even creating forks of the shell (that is a process itself). So again: less is better.
The messages of diff command are locale dependants, so if the system is not in english, the whole script won't work.
Thinking that, let's clean a bit the original script, mantaining the OP's logic:
#!/bin/bash
dir1=$1
dir2=$2
# Set english as current language
LANG=en_US.UTF-8
# (1) show files that are different only
diff -rq $dir1 $dir2 |
# (2) remove lines that represent only one file, keep lines that have
# files in both dirs, but are just different, delete all but left filename
sed '/ differ$/!d; s/^Files //; s/ and .*//' |
# (3) determine the type of file
file -i -f - |
# (4) for each file
while IFS=":" read file file_type
do
# (5) drop first directory in path to get relative filename
rel=${file#$dir1}
# (6) if it's a text file send it to meld
if [[ "$file_type" =~ "text/" ]]
then
# throw out error messages with &> /dev/null
meld ${dir1}${rel} ${dir2}${rel} &> /dev/null
fi
done
A little explaining:
Unique chain of commands cmd1 | cmd2 | ... where the output (stdout) of previous one is the input (stdin) of the next one.
Execute sed just once to execute 3 operations (separated with ;) in diff output:
Deleting lines ending with " differ"
Delete "Files " at the beginning of remaining lines
Delete from " and " to the end of remaining lines
Execute command file once to process the file list in stdin (option -f -)
Use the while bash sentence to read two values separated by : for each line line of stdin.
Use bash variable substitution to extract filename from a variable
Use bash test to compare a file type with a regular expression
For clarity reasons, I didn't considerate that file and directory names may have spaces. In such cases, both scripts will fail. To avoid that is necessary enclose in double quotes any reference to file/dir name variable.
I didn't use awk, because it is powerful enough that can replace almost the entire script ;-)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to make awk command run faster on large data files - bash

Related

What can I do to speed up this bash script?

getting the last opened file

How can I merge multiple lines to create exactly two records based on field separators?

awk: Output to different processes

how to make a winmerge equivalent in linux

Categories

Resources