how to reuse stdout output without saving it to physical disk file - ksh

I have a for-loop, like the following:
for inf from $filelist; do
for ((i=0; i<imax; ++i)); do
temp=`<command_1> $inf | <command_2>`
eval set -A array -- $temp
...
done
...
done
Problem is, command_1 a bit time consuming and its output is a bit large (900MB is the highest, depending on how big the input file is). So, I modified the script to:
outf="./temp"
for inf from $filelist; do
<command_1> $inf -o $outf
for ((i=0; i<imax; ++i)); do
temp=`cat $outf | <command_2>`
eval set -A array -- $temp
...
done
...
done
There is a little performance improvement, but not so much as I want, probably because disk I/O is a performance bottle-neck as well.
Just curious if there is a way to save the stdout output of command_1, so that I could reuse it without saving it to a physical disk file?

don't use pipelines inside nested loops
Based on new comments and another look at the original question, I would strongly recommend against using a pipeline processing large amounts of data inside a nested loop. Shell pipelines are far from efficient, and incur lots of process overhead.
Look at the original problem, this involves looking into the contributions of command_1 and command_2, and see if you could solve this in another way.
That said: here's the original answer:
In the shell there are two ways of storing data: either in a shell variable, or in a file. You might try to store that file in a memory based filesystem, like /dev/shm on linux or tmpfs in Solaris.
You might also analyse command_1 and command_2 for optimisations. Is there anything in the output of command_1 that's not needed by command_2? Try to put a filter between the two.
Example:
command_1 | awk '{ print $2 }' | command_2
(Assuming command_2 only needs column 2 of command_1's output.)

Related

how to untar certain files from an archive and grep in parallel in bash

We've got extensive amount of tarballs and in each tarball I need to search for a particular pattern only in some files which names are known before hand.
As the disk access is slower and there is quite a few cores and plenty of memory available on this system, we aim minimising the disk writes and going through the memory as much as possible.
echo "a.txt" > file_subset_in_tar.txt
echo "b.txt" >> file_subset_in_tar.txt
echo "c.txt" >> file_subset_in_tar.txt
tarball_name="tarball.tgz";
pattern="mypattern"
echo "pattern: $pattern"
(parallel -j-2 tar xf $tarball_name -O ::: `cat file_subset_in_tar.txt` | grep -ac "$pattern")
This works just fine on the bash terminal directly. However, when I paste this in a script with bash bang on the top, it just prints zero.
If I change the $pattern to a hard coded string, it runs ok. It feels like there is something wrong with the pipe sequencing or something similar. So, ideally an update to the attempt above or another solution which satisfies the mentioned disk/memory use requirements would be much appreciated.
I believe your parallel command is constructed incorrectly. You can run the pipeline of commands like the following:
parallel -j -2 "tar xf $tarball_name -O {} | grep -ac $pattern" :::: file_subset_in_tar.txt
Also note that the backticks and use of cat is unnecessary, parameters can be fed to parallel from a file using ::::.

efficient renaming of strings based on comma delimited table

I want to rename multiple individual entries in a long file based on a comma delimited table. I figured out a way how to do it, but I feel it's highly inefficient and I'm wondering if there's a better way to do it.
My file contains >30k entries like this this:
>Gene.1::Fmerg_contig0.1::g.1::m.1 Gene.1::Fmerg_contig0.1::g.1
TPAPHKMQEPTTPFTPGGTPKPVFTKTLKGDVVEPGDGVTFVCEVAHPAAYFITWLKDSK
>Gene.17::Fmerg_Transcript_1::g.17::m.17 Gene.17::Fmerg_Transcript_1::g.17
PLDDKLADRVQQTDAGAKHALKMTDEGCKHTLQVLNCRVEDSGIYTAKATDENGVWSTCS
>Gene.15::Fmerg_Transcript_1::g.15::m.15 Gene.15::Fmerg_Transcript_1::g.15
AQLLVQELTEEERARRIAEKSPFFMVRMKPTQVIENTNLSYTIHVKGDPMPNVTFFKDDK
And the table with the renaming information looks like this:
original,renamed
Fmerg_contig0.1,Fmerg_Transcript_0
Fmerg_contig1.1,Fmerg_Transcript_1
Fmerg_contig2.1,Fmerg_Transcript_2
The inefficient solution I came up with looks like this:
#!/bin/bash
#script to revert dammit name changes
while read line; do
IFS="," read -r -a contig <<< "$line"
sed -i "s|${contig[1]}|${contig[0]}|g" Fmerg_final.fasta.transdecoder_test.pep
done < Fmerg_final.fasta.dammit.namemap.csv
However, this means that sed iterates over the table once per entry to be renamed.
I could imagine there is a way to only access each line once and iterate over the name list that way, but I'm not sure how to tackle this. I chose bash because this is the language that I'm most fluent in. But I'm not adverse to use perl or python if they offer an easier solution.
This is On problem and you solved it with On solution so I wouldn't consider it inefficient. However, if you are good with bash you can do more it no problem.
Divide and conquer.
I have done this many times as you can reduce the work time closer to the time it takes one item to be processed ..
Take this pseudo code, I call a method that cuts up the 30K file into say X parts, then I call it in a loop with the & option to run as threads.
declare -a file_part_names
# cut files into parts
function cut_file_into_parts() {
orig_file="$1"
number_parts="$1"
}
# call method to handle renaming a file
function rename_fields_in_file() {
file_part="$1"
while read line; do
IFS="," read -r -a contig <<< "$line"
sed -i "s|${contig[1]}|${contig[0]}|g" "$tmp_file"
done < "$file_part"
}
# main
cut_file_into_parts "Fmerg_final.fasta.dammit.namemap.csv"
for each file_part ;do
if threads_pids < 100
rename_fields_in_file $each &
else
sleep 10
fi
done
wait
#Now that you have a pile of temp files processed, combine them all.
for each temp file do
temp_$i.txt >> final_result.txt
done
In summary, cut the big file into say 500 tmp files labled file1, file2 etc. in say /tmp/folder. Then go through them one at a time but launch them as child processes up to say 100 running at the same time, keep the pipe full by checking that if over 100 do nothing (sleep 10) if under add more. When done, one more loop to combine file1_finish.txt to file2_finish.txt etc. which is super quick.
NOTE: if this is too much you can always just break the file up and call the the same script X times for each file instead of using threads.

Bash split stdin by null and pipe to pipeline

I have a stream that is null delimited, with an unknown number of sections. For each delimited section I want to pipe it into another pipeline until the last section has been read, and then terminate.
In practice, each section is very large (~1GB), so I would like to do this without reading each section into memory.
For example, imagine I have the stream created by:
for I in {3..5}; do seq $I; echo -ne '\0';
done
I'll get a steam that looks like:
1
2
3
^#1
2
3
4
^#1
2
3
4
5
^#
When piped through cat -v.
I would like to pipe each section through paste -sd+ | bc, so I get a stream that looks like:
6
10
15
This is simply an example. In actuality the stream is much larger and the pipeline is more complicated, so solutions that don't rely on streams are not feasible.
I've tried something like:
set -eo pipefail
while head -zn1 | head -c-1 | ifne -n false | paste -sd+ | bc; do :; done
but I only get
6
10
If I leave off bc I get
1+2+3
1+2+3+4
1+2+3+4+5
which is basically correct. This leads me to believe that the issue is potentially related to buffering and the way each process is actually interacting with the pipes between them.
Is there some way to fix the way that these commands exchange streams so that I can get the desired output? Or, alternatively, is there a way to accomplish this with other means?
In principle this is related to this question, and I could certainly write a program that reads stdin into a buffer, looks for the null character, and pipes the output to a spawned subprocess, as the accepted answer does for that question. Given the general support of streams and null delimiters in bash, I'm hoping to do something that's a little more "native". In particular, if I want to go this route, I'll have to escape the pipeline (paste -sd+ | bc) in a string instead of just letting the same shell interpret it. There's nothing too inherently bad about this, but it's a little ugly and will require a bunch of somewhat error prone escaping.
Edit
As was pointed out in an answer, head makes no guarantees about how much it buffers. Unless it only buffers single byte at a time, which would be impractical, this will never work. Thus, it seems like the only solution would be to read it into memory, or write a specific program.
The issue with your original code is that head doesn't guarantee that it won't read more than it outputs. Thus, it can consume more than one (NUL-delimited) chunk of input, even if it's emitting only one chunk of output.
read, by contrast, guarantees that it won't consume more than you ask it for.
set -o pipefail
while IFS= read -r -d '' line; do
bc <<<"${line//$'\n'/+}"
done < <(build_a_stream)
If you want native logic, there's nothing more native than just writing the whole thing in shell.
Calling external tools -- including bc, cut, paste, or others -- involves a fork() penalty. If you're only processing small amounts of data per invocation, the efficiency of the tools is overwhelmed by the cost of starting them.
while read -r -d '' -a numbers; do # read up to the next NUL into an array
sum=0 # initialize an accumulator
for number in "${numbers[#]}"; do # iterate over that array
(( sum += number )) # ...using an arithmetic context for our math
done
printf '%s\n' "$sum"
done < <(build_a_stream)
For all of the above, I tested with the following build_a_stream implementation:
build_a_stream() {
local i j IFS=$'\n'
local -a numbers
for ((i=3; i<=5; i++)); do
numbers=( )
for ((j=0; j<=i; j++)); do
numbers+=( "$j" )
done
printf '%s\0' "${numbers[*]}"
done
}
As discussed, the only real solution seemed to be writing a program to do this specifically. I wrote one in rust called xstream-util. After installing it with cargo install xstream-util, you can pipe the input into
xstream -0 -- bash -c 'paste -sd+ | bc'
to get the desired output
6
10
15
It doesn't avoid having to run the program in bash, so it still needs escaping if the pipeline is complicated. Also, it currently only supports single byte delimiters.

Disk space required for unix sort

I am currently doing a UNIX sort (via GitBash on a Windows machine) of a 500GB text file. Due to running out of space on the main disk, I have used the -T option to direct the temp files to a disk where I have enough space to accommodate the entire file. The thing is, I've been watching the disk space and apparently the temp files are already in excess of what the original file was. I don't know how much further this is going to go, but I'm wondering if there is a rule by which I can predict how much space I will need for temp files.
I'd batch it manually as described in this unix.SE answer.
Find some very basic queries that will divide your content into chunks that are small enough to be sorted. For example, if it's a file of words, you could create queries like grep ^a …, grep ^b …, and so on. Some items may need more granularity than others.
You can script that like:
#!/bin/bash
for char1 in other {0..9} {a..z}; do
out="/tmp/sort.$char1.xz"
echo "Extracting lines starting with '$char1'"
if [ "$char1" = "other" ]; then char1='[^a-z0-9]'; fi
grep -i "^$char1" *.txt |xz -c0 > "$out"
unxz -c "$out" |sort -u >> output.txt || exit 1
rm "$out"
done
echo "It worked"
I'm using xz -0 because it's almost as fast as gzip's default gzip -6 yet it's vastly better at conserving space. I omitted it from the final output in order to preserve the exit value of sort -u, but you could instead use a size check (iirc, sort fails with zero output) and then use sort -u |xz -c0 >> output.txt.xz since the xz (and gzip) container lets you concatenate archives (I've written about that before too).
This works because the output of each grep run is already sorted (0 is before 1, which is before a, etc.), so the final assembly doesn't need to run through sort (note, the "other" section will be slightly different since some non-alphanumeric characters are before the numbers, others are between numbers and letters, and others still are after the letters. You can also remove grep's -i flag and additionally iterate through {A..Z} to be case sensitive). Each individual iteration obviously still needs to be sorted, but hopefully they're manageable.
If the program exits before completing all iterations and saying "It worked" then you can edit the script with a more discrete batch for the last iteration it tried. Remove all prior iterations since they're successfully saved in output.txt.

Why is the output from these parallel processes not messed up?

Everything is executing perfectly.
words.dict file contains one word per line:
$ cat words.dict
cat
car
house
train
sun
today
station
kilometer
house
away
chapter.txt file contains plain text:
$ cat chapter.txt
The cars are very noisy today.
The train station is one kilometer away from his house.
This script below adds in result.txt file all words from words.dict not found with grep command in chapter.txt file, using 10 parallel grep:
$ cat psearch.sh
#!/bin/bash --
> result.txt
max_parallel_p=10
while read line ; do
while [ $(jobs | wc -l) -gt "$max_parallel_p" ]; do sleep 1; done
fgrep -q "$line" chapter.txt || printf "%s\n" "$line" >> result.txt &
done < words.dict
wait
A test:
$ ./psearch.sh
$ cat result.txt
cat
sun
I thought the tests would generate mixed words in result.txt
csat
un
But it really seems to work.
Please have a look and explain me why?
Background jobs are not threads. With a multi-threaded process then you can get that effect. The reason is that each process has just one standard output stream (stdout). In a multi-threaded program all threads share the same output stream, so an unprotected write to stdout can lead to garbled output as you describe. But you do not have a multi-threaded program.
When you use the & qualifier bash creates a new child process with its own stdout stream. Generally (depends on implementation details) this is flushed on a newline. So even though the file might be shared, the granularity is by line.
There is a slim chance that two processes could flush to the file at exactly the same time, but your code, with subprocesses and a sleep, makes it highly unlikely.
You could try taking out the newline from the printf, but given the inefficiency of the rest of the code, and the small dataset, it is still unlikely. It is quite possible that each process is complete before the next starts.

Resources