Splitting large files efficiently (currently using awk) - bash

I have a 4 GB file that I need to do some operations on. I have a Bash script to do this, but it Bash seems ill suited to reading large data files into an array. So I decided to break up my file with awk.
My current script is :
for((i=0; i<100; i++)); do awk -v i=$i 'BEGIN{binsize=60000}{if(binsize*i < NR && NR <= binsize*(i+1)){print}}END{}' my_large_file.txt &> my_large_file_split$i.fastq; done
However the problem with this script is that it will read in and loop through this large file 100 times (which presumably will lead to about 400GB of IO).
QUESTION : Is there better strategy of reading in the the large file once? Perhaps doing the writing to files within awk instead of redirecting its output?

Assuming binsize is the number of lines you want per chunk, you could just maintain and reset a line counter as you step through the file, and setting alternate output files within awk instead of using the shell to redirect.
awk -v binsize=60000 '
BEGIN {
outfile="output_chunk_1.txt"
}
count > binsize {
if (filenum>1) {
close(outfile)
}
filenum++
outfile="output_chunk_" filenum ".txt"
count=0
}
{
count++
print > outfile
}
' my_large_file.txt
I haven't actually tested this code, so if it doesn't work verbatim, at least it should give you an idea of a strategy to use. :-)
The idea is that we'll step through the file, updating a filename in a variable whenever our line count for a chunk exceeds binsize. Note that the close(outfile) isn't strictly necessary, as awk will of course close any open files when it exits, but it may save you a few bytes of memory per open file handle (which will only be significant if you have many many output files).
That said, you could do almost exactly the same thing in bash alone:
#!/usr/bin/env bash
binsize=60000
filenum=1; count=0
while read -r line; do
if [ $count -gt $binsize ]; then
((filenum++))
count=0
fi
((count++))
outfile="output_chunk_${filenum}.txt"
printf '%s\n' "$line" >> $outfile
done < my_large_file.txt
(Also untested.)
And while I'd expect the awk solution to be faster than bash, it might not hurt to do your own benchmarks. :)

Related

efficient renaming of strings based on comma delimited table

I want to rename multiple individual entries in a long file based on a comma delimited table. I figured out a way how to do it, but I feel it's highly inefficient and I'm wondering if there's a better way to do it.
My file contains >30k entries like this this:
>Gene.1::Fmerg_contig0.1::g.1::m.1 Gene.1::Fmerg_contig0.1::g.1
TPAPHKMQEPTTPFTPGGTPKPVFTKTLKGDVVEPGDGVTFVCEVAHPAAYFITWLKDSK
>Gene.17::Fmerg_Transcript_1::g.17::m.17 Gene.17::Fmerg_Transcript_1::g.17
PLDDKLADRVQQTDAGAKHALKMTDEGCKHTLQVLNCRVEDSGIYTAKATDENGVWSTCS
>Gene.15::Fmerg_Transcript_1::g.15::m.15 Gene.15::Fmerg_Transcript_1::g.15
AQLLVQELTEEERARRIAEKSPFFMVRMKPTQVIENTNLSYTIHVKGDPMPNVTFFKDDK
And the table with the renaming information looks like this:
original,renamed
Fmerg_contig0.1,Fmerg_Transcript_0
Fmerg_contig1.1,Fmerg_Transcript_1
Fmerg_contig2.1,Fmerg_Transcript_2
The inefficient solution I came up with looks like this:
#!/bin/bash
#script to revert dammit name changes
while read line; do
IFS="," read -r -a contig <<< "$line"
sed -i "s|${contig[1]}|${contig[0]}|g" Fmerg_final.fasta.transdecoder_test.pep
done < Fmerg_final.fasta.dammit.namemap.csv
However, this means that sed iterates over the table once per entry to be renamed.
I could imagine there is a way to only access each line once and iterate over the name list that way, but I'm not sure how to tackle this. I chose bash because this is the language that I'm most fluent in. But I'm not adverse to use perl or python if they offer an easier solution.
This is On problem and you solved it with On solution so I wouldn't consider it inefficient. However, if you are good with bash you can do more it no problem.
Divide and conquer.
I have done this many times as you can reduce the work time closer to the time it takes one item to be processed ..
Take this pseudo code, I call a method that cuts up the 30K file into say X parts, then I call it in a loop with the & option to run as threads.
declare -a file_part_names
# cut files into parts
function cut_file_into_parts() {
orig_file="$1"
number_parts="$1"
}
# call method to handle renaming a file
function rename_fields_in_file() {
file_part="$1"
while read line; do
IFS="," read -r -a contig <<< "$line"
sed -i "s|${contig[1]}|${contig[0]}|g" "$tmp_file"
done < "$file_part"
}
# main
cut_file_into_parts "Fmerg_final.fasta.dammit.namemap.csv"
for each file_part ;do
if threads_pids < 100
rename_fields_in_file $each &
else
sleep 10
fi
done
wait
#Now that you have a pile of temp files processed, combine them all.
for each temp file do
temp_$i.txt >> final_result.txt
done
In summary, cut the big file into say 500 tmp files labled file1, file2 etc. in say /tmp/folder. Then go through them one at a time but launch them as child processes up to say 100 running at the same time, keep the pipe full by checking that if over 100 do nothing (sleep 10) if under add more. When done, one more loop to combine file1_finish.txt to file2_finish.txt etc. which is super quick.
NOTE: if this is too much you can always just break the file up and call the the same script X times for each file instead of using threads.

bash loop taking extremely long time

I have a list of times that I am looping through in the format HH:MM:SS to find the nearest but not past time. The code that I have is:
for i in ${times[#]}; do
hours=$(echo $i | sed 's/\([0-9]*\):.*/\1/g')
minutes=$(echo $i | sed 's/.*:\([0-9]*\):.*/\1/g')
currentHours=$(date +"%H")
currentMinutes=$(date +"%M")
if [[ hours -ge currentHours ]]; then
if [[ minutes -ge currentMinutes ]]; then
break
fi
fi
done
The variable times is an array of all the times that I am sorting through (its about 20-40 lines). I'd expect this to take less than 1 second however it is taking upwards of 5 seconds. Any suggestions for decreasing the time of the regular expression would be appreciated.
times=($(cat file.txt))
Here is a list of the times that are stored in a text file and are imported into the times variable using the above line of code.
6:05:00
6:35:00
7:05:00
7:36:00
8:08:00
8:40:00
9:10:00
9:40:00
10:11:00
10:41:00
11:11:00
11:41:00
12:11:00
12:41:00
13:11:00
13:41:00
14:11:00
14:41:00
15:11:00
15:41:00
15:56:00
16:11:00
16:26:00
16:41:00
16:58:00
17:11:00
17:26:00
17:41:00
18:11:00
18:41:00
19:10:00
19:40:00
20:10:00
20:40:00
21:15:00
21:45:00
One of the key things to understand in looking at bash scripts from a performance perspective is that while the bash interpreter is somewhat slow, the act of spawning an external process is extremely slow. Thus, while it can often speed up your scripts to use a single invocation of awk or sed to process a large stream of input, starting those invocations inside a tight loop will greatly outweigh the performance of those tools once they're running.
Any command substitution -- $() -- causes a second copy of the interpreter to be fork()ed off as a subshell. Invoking any command not built into bash -- date, sed, etc -- then causes a subprocess to be fork()ed off for that process, and then the executable associated with that process to be exec()'d -- something involves a great deal of OS-level overhead (the binary needs to be linked, loaded, etc).
This loop would be better written as:
IFS=: read -r currentHours currentMinutes < <(date +"%H:%M")
while IFS=: read -r hours minutes _; do
if (( hours >= currentHours )) && (( minutes >= currentMinutes )); then
break
fi
done <file.txt
In this form only one external command is run, date +"%H:%M", outside the loop. If you were only targeting bash 4.2 and newer (with built-in time formatting support), even this would be unnecessary:
printf -v currentHours '%(%H)T' -1
printf -v currentMinutes '%(%M)T' -1
...will directly place the current hour and minute into the variables currentHours and currentMinutes using only functionality built into modern bash releases.
See:
BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
BashFAQ #100 - How can I do native string manipulations in bash? (Subsection: "Splitting a string into fields")
To be honest I'm not sure why it's taking an extremely long time but there are certainly some things which could be made more efficient.
currentHours=$(date +"%H")
currentMinutes=$(date +"%M")
for time in "${times[#]}"; do
IFS=: read -r hours minutes seconds <<<"$time"
if [[ hours -ge currentHours && minutes -ge currentMinutes ]]; then
break
fi
done
This uses read, a built-in command, to split the text into variables, rather than calling external commands and creating subshells.
I assume that you want the script to run so quickly that it's safe to reuse currentHours and currentMinutes within the loop.
Note that you can also just use awk to do the whole thing:
awk -F: -v currentHours="$(date +"%H") -v currentMinutes="$(date +"%M")" '
$1 >= currentHours && $2 >= currentMinutes { print; exit }' file.txt
Just to make the program produce some output, I added a print, so that the last line is printed.
awk to the rescue!
awk -v time="12:12:00" '
function pad(x) {split(x,ax,":"); return (ax[1]<10)?"0"x:x}
BEGIN {time=pad(time)}
time>pad($0) {next}
{print; exit}' times
12:41:00
with 0 padding the hour you can do string only comparison.

More elegant method of sorting a file after the first X lines?

I've been doing a lot of searching on Stack Overflow today for a solution for this and have found many questions about sorting after skipping X lines, but no really solid generic answers, so I threw together my own slipshod method of doing so:
head -n 15 FILE.EXT > temp.txt
tail -n+16 FILE.EXT | sort >> temp.txt
mv temp.txt FILE.EXT
This will sort the file (take your pick of the options for sort), while preserving the order of the first 15 lines of it. This is obviously fairly inelegant, with three file references and two different values to enter. Ideally I'd like to come up with a command that is less cumbersome, if possible, because this seems like a pretty common desire with not much support.
Does anyone have a simpler solution than mine?
Is there anything wrong with what I've done? Potential issues?
This problem lends itself more strongly to using a script, but my command is still probably slightly more quick than creating and executing a script for a one-off.
I'm not even close to a bash expert, so I'm hoping there is some bash-fu that can make this a quick one-liner. Is there a way to create and reference variables in a single command so that a user only needs to put in something like the name and line number?
This 'one-liner' generates the output:
awk 'NR <= 15 { print; next } { print | "sort" }'
Overwriting the original file cleanly is harder, and generally involves something that writes to a temporary file and renames it when that's complete.
As sputnick points out, if you have GNU awk, you can use the -i option to overwrite in-place:
gawk -i 'NR <= 15 { print; next } { print | "sort" }' FILE.EXT
(And gawk is often also installed as awk.)
If you don't have GNU awk, then I have a script ow derived from a script overwrite from Kernighan & Pike The UNIX Programming Environment that does just that.
Usage:
ow FILE.EXT awk 'NR <= 15 { print; next } { print | "sort" }' FILE.EXT
Code:
: "#(#)$Id: ow.sh,v 1.6 2005/06/30 18:14:08 jleffler Exp $"
#
# Overwrite file
# From: The UNIX Programming Environment by Kernighan and Pike
# Amended: remove PATH setting; handle file names with blanks.
case $# in
0|1)
echo "Usage: $0 file command [arguments]" 1>&2
exit 1;;
esac
file="$1"
shift
new=${TMPDIR:-/tmp}/ovrwr.$$.1
old=${TMPDIR:-/tmp}/ovrwr.$$.2
trap "rm -f '$new' '$old' ; exit 1" 0 1 2 15
if "$#" >"$new"
then
cp "$file" "$old"
trap "" 1 2 15
cp "$new" "$file"
rm -f "$new" "$old"
trap 0
exit 0
else
echo "$0: $1 failed - $file unchanged" 1>&2
rm -f "$new" "$old"
trap 0
exit 1
fi
It's old code; I haven't modified it for almost a decade now, but I have used it quite a lot. As noted by Charles Duffy, it could do with some modernization if you're likely to face file names starting with dashes (because those could be mistaken for command-line options to cp or mv), and it should have a shebang line, etc.
It also shows trapping signals (though nowadays, I usually trap '0 1 2 3 13 15', equivalent to 'EXIT HUP INT QUIT PIPE TERM') and naming temporary files for preventing casual interference (using $$ rather than mktemp — like I said, it is old code).
you can do a sort that skips some lines at the start of the file like this:
{ head -n 15 && sort; } < file > tempfile
it works because because head stops reading after 15 lines, so sort sees the rest of the file.
so to solve the full original problem.
{ head -n 15 && sort; } < file > tempfile && mv tempfile file
What about :
{ head -n 15 file; tail -n+16 file | sort ; }

Do I need stay away from bash scripts for big files?

I have big log files(1-2 gb and more). I'm new on programming and bash so useful and easy for me. When I need something, I can do (someone help me on here). Simple scripts works fine, but when I need complex operations, maybe bash so slow maybe my programming skill so bad, it's so slow working.
So do I need C for complex programming on my server log files or do I need just optimization my scripts?
If I need just optimization, how can I check where is bad or where is good on my codes?
For example I have while-do loop:
while read -r date month size;
do
...
...
done < file.tmp
How can I use awk for faster run?
That depends on how you use bash. To illustrate, consider how you'd sum a possibly large number of integers.
This function does what Bash was meant for: being control logic for calling other utilities.
sumlines_fast() {
awk '{n += $1} END {print n}'
}
It runs in 0.5 seconds on a million line file. That's the kind of bash code you can very effectively use for larger files.
Meanwhile, this function does what Bash is not intended for: being a general purpose programming language:
sumlines_slow() {
local i=0
while IFS= read -r line
do
(( i += $line ))
done
echo "$i"
}
This function is slow, and takes 30 seconds to sum the same million line file. You should not be doing this for larger files.
Finally, here's a function that could have been written by someone who has no understanding of bash at all:
sumlines_garbage() {
i=0
for f in `cat`
do
i=`echo $f + $i | bc`
done
echo $i
}
It treats forks as being free and therefore runs ridiculously slowly. It would take something like five hours to sum the file. You should not be using this at all.

How can I use a named pipe as temporary storage in a loop?

I have a loop, where I'm making incremental modifications to a large file. Rather than write to disk each time, I though I'd use named pipes. However, this means that I'll need a unique name for each iteration of the loop, since I can't seem to redirect output back into the same named pipe.
$ mkfifo fifotemp
$ echo qwerty > fifotemp &
$ grep qwe <fifotemp >fifotemp &
$ cat <fifotemp
[hangs]
I could create a new named pipe for each iteration, but this seemed inelegant. Is there a better way?
Potentially you could use plain pipes and recursive functions. You would need to pass everything into the recursive function to determine when to quit and what processing is needed at each recursion level. This example just adds the recursion level at the front of each line for each level, and quits at level 4:
#!/bin/bash
doEdits() {
while read -r line
do
echo "$1 $line"
done
}
doRecursion() {
level=$(($1 + 1))
#echo "doRecursion $level" 1>&2
if [ $level -lt 4 ]
then
doEdits $level | doRecursion $level
else
# Just output all the input
cat
fi
}
doRecursion 0 < myInputFile > myOutputFile
I assume the number of recursion levels is fairly limited, otherwise you could run into system limitations on the number of open processes and pipes.
One advantage here is that each pipe should only need a small buffer. This could also be fast if your machine has multiple processors.

Resources