I have multiple files (nearly 1000 of them) separated by space and I need to compute permutations with join function between each one of them using the second columns only. The important thing is that the comparison must not be repeated, that's why the permutation.
For instance, a small example with 3 files A.txt B.txt and C.txt
The general idea is to get A B comparison, A C and B C. Neither B A nor C A nor C B
The 101 code would be
join -1 2 -2 2 A.txt B.txt | cut -d ' ' -f1 > AB.txt
join -1 2 -2 2 A.txt C.txt | cut -d ' ' -f1 > AC.txt
join -1 2 -2 2 B.txt C.txt | cut -d ' ' -f1 > BC.txt
Is there a way to accomplish this for thousand of files? I tried using a for loop, but toasted my brains out, and now I'm trying with a while loop. But I better get some orientation first.
As the number of iterations is quite large performance becomes an issue. Here is an optimized version of Matty's answer, using an array, to divide the number of iterations by 2 (half a million instead of a million) and to avoid a test:
declare -a files=( *.txt )
declare -i len=${#files[#]}
declare -i lenm1=$(( len - 1 ))
for (( i = 0; i < lenm1; i++ )); do
a="${files[i]}"
ab="${a%.txt}"
for (( j = i + 1; j < len; j++ )); do
b="${files[j]}"
join -1 2 -2 2 "$a" "$b" | cut -d ' ' -f1 > "$ab$b"
done
done
But consider that bash was not designed for such intensive tasks with half a million iterations. There might be a better (more efficient) way to accomplish what you want.
It looks like what you are after can be accomplished with two nested for loops and a lexicographic comparison to maintain alphabetical order?
# prints pairs of filenames
for f in dir/*; do
for g in dir/*; do
if [[ "$f" < "$g" ]]; then # ensure alphabetical order
echo $f $g
fi
done
done
Here's why you don't want to use bash for this:
First create 1000 files
seq 1000 | xargs touch
Now, distinct pairs with bash
time {
files=(*)
len=${#files[#]}
for ((i=0; i<len-1; i++)); do
a=${files[i]}
for ((j=i+1; j<len; j++)); do
b=${files[j]}
echo "$a $b"
done
done >/dev/null
}
real 0m5.091s
user 0m4.818s
sys 0m0.262s
Versus, for example, the same in perl:
time {
perl -e '
opendir my $dh, ".";
my #files = sort grep {$_ != "." && $_ != ".."} readdir $dh;
closedir $dh;
for (my $i = 0; $i < #files - 1; $i++) {
my $a = $files[$i];
for (my $j = $i + 1; $j < #files; $j++) {
my $b = $files[$j];
print "$a $b\n";
}
}
' > /dev/null
}
real 0m0.131s
user 0m0.120s
sys 0m0.006s
Related
I am trying to sample 10000 random rows from a large dataset with ~3 billion rows (with headers). I've considered using shuf -n 1000 input.file > output.file but this seems quite slow (>2 hour run time with my current available resources).
I've also used awk 'BEGIN{srand();} {a[NR]=$0} END{for(i=1; i<=10; i++){x=int(rand()*NR) + 1; print a[x];}}' input.file > output.file from this answer for a percentage of lines from smaller files, though I am new to awk and don't know how to include headers.
I wanted to know if there was a more efficient solution to sampling a subset (e.g. 10000 rows) of data from the 200GB dataset.
I don't think any program written in a scripting language can beat the shuf in the context of this question. Anyway, this is my try in bash. Run it with ./scriptname input.file > output.file
#!/bin/bash
samplecount=10000
datafile=$1
[[ -f $datafile && -r $datafile ]] || {
echo "Data file does not exists or is not readable" >&2
exit 1
}
linecount=$(wc -l "$datafile")
linecount=${linecount%% *}
pickedlinnum=(-1)
mapfile -t -O1 pickedlinnum < <(
for ((i = 0; i < samplecount;)); do
rand60=$((RANDOM + 32768*(RANDOM + 32768*(RANDOM + 32768*RANDOM))))
linenum=$((rand60 % linecount))
if [[ -z ${slot[linenum]} ]]; then # no collision
slot[linenum]=1
echo ${linenum}
((++i))
fi
done | sort -n)
for ((i = 1; i <= samplecount; ++i)); do
mapfile -n1 -s$((pickedlinnum[i] - pickedlinnum[i-1] - 1))
echo -n "${MAPFILE[0]}"
done < "$datafile"
Something in awk. Supply it with random seed ($RANDOM in Bash) and number n of wanted records. It counts the lines with wc -l and uses that count to select randomly n values between 1—lines[1] in file and outputs them. Can't really say anything about speed, I don't even have 200 GBs of disk. (:
$ awk -v seed=$RANDOM -v n=10000 '
BEGIN {
cmd="wc -l " ARGV[1] # use wc for line counting
if(ARGV[1]==""||n==""||(cmd | getline t)<=0) # require all parameters
exit 1 # else exit
split(t,lines) # wc -l returns "lines filename"
srand(seed) # use the seed
while(c<n) { # keep looping n times
v=int((lines[1]) * rand())+1 # get a random line number
if(!(v in a)){ # if its not used yet
a[v] # use it
++c
}
}
}
(NR in a)' file # print if NR in selected
Testing with dataset from seq 1 100000000. shuf -n 10000 file took about 6 seconds where the awk above took about 18 s.
I have a tab separated file:
c1 1000000
c2 2000000
c3 1000000
I would like to loop through each line of that file and save the second column in a variable to then loop through increments of that number and generate a specific new file out of it.
out=""
while read i; do
length=$(echo $i | cut -d$'\t' -f2) #How to use $i here?
c=$(echo $i | cut -d$'\t' -f1)
start=1
end=10000
for (( i = 0; i < $(expr $length / 500); i++ )); do
start=$(expr $start + $i \* 500)
end=$(expr $end + $i \* 500)
echo $c $start $end >> out
done
done <file
Of course, I am always happy to learn about how inefficient my code may be and how I can improve it.
Thanks for your input!
The problem isn't specific to loops -- it's specific to unquoted echos. As described in BashPitfalls #14, echo $i string-splits and glob-expands the contents of $i before passing them to echo.
Part of string-splitting is that the content are split into words, and the words are passed as separate parameters -- so what it actually runs is echo "c1" "1000000", which doesn't put a tab between the two values, so your cut command can't find a tab to cut on.
The Right Way to fix this is to not use cut at all:
while IFS=$'\t' read -r c length; do
I have an input file, over 1,000,000 lines long which looks something like this:
G A 0|0:2,0:2:3:0,3,32
G A 0|1:2,0:2:3:0,3,32
G C 1|1:0,1:1:3:32,3,0
C G 1|1:0,1:1:3:32,3,0
A G 1|0:0,1:1:3:39,3,0
For my purposes, everything after the first : in the third field is irrelevant (but I left it in as it'll affect the code).
The first field defines the values coded as 0 in the third, and the second field defines the values coded as 1
So, for example:
G A 0|0 = G|G
G A 1|0 = A|G
G A 1|1 = A|A
etc.
I first need to decode the third field, and then convert it from a vertical list to a horizontal list of values, with the values before the | on one line, and the values after on a second line.
So the example at the top would look like this:
HAP0 GGCGG
HAP1 GACGA
I've been working in bash, but any other suggestions are welcome. I have a script which does the job - but it's incredibly slow and long-winded and I'm sure there's a better way.
echo "HAP0 " > output.txt
echo "HAP1 " >> output.txt
while IFS=$'\t' read -a array; do
ref=${array[0]}
alt=${array[1]}
data=${array[2]}
IFS=$':' read -a code <<< $data
IFS=$'|' read -a hap <<< ${code[0]}
if [[ "${hap[0]}" -eq 0 ]]; then
sed -i "1s/$/${ref}/" output.txt
elif [[ "${hap[0]}" -eq 1 ]]; then
sed -i "1s/$/${alt}/" output.txt
fi
if [[ "${hap[1]}" -eq 0 ]]; then
sed -i "2s/$/${ref}/" output.txt
elif [[ "${hap[1]}" -eq 1 ]]; then
sed -i "2s/$/${alt}/" output.txt
fi
done < input.txt
Suggestions?
Instead of running sed in a subshell, use parameter expansion.
#!/bin/bash
printf '%s ' HAP0 > tmp0
printf '%s ' HAP1 > tmp1
while read -a cols ; do
indexes=${cols[2]}
indexes=${indexes%%:*}
idx0=${indexes%|*}
idx1=${indexes#*|}
printf '%s' ${cols[idx0]} >> tmp0
printf '%s' ${cols[idx1]} >> tmp1
done < "$1"
cat tmp0
printf '\n'
cat tmp1
printf '\n'
rm tmp0 tmp1
The script creates two temporaty files, one contains the first line, the second file the second line.
Or, use Perl for even faster solution:
#!/usr/bin/perl
use warnings;
use strict;
my #haps;
while (<>) {
my #cols = split /[\s|:]+/, $_, 5;
$haps[$_] .= $cols[ $cols[ $_ + 2 ] ] for 0, 1;
}
print "HAP$_ $haps[$_]\n" for 0, 1;
How can I do a incremental for loop, for the -n in the head command (head -n)?
Does this work?
for (( i = 1 ; i <= $NUMBER ; i++ ))
head -$(NUMBER) filename.txt
NUMBER=$((NUMBER+1))
done
The code is suppose to display different texts off from filename.txt using the -n
The following should work:
for (( i = 1 ; i < `wc -l filename.txt | cut -f 1 -d ' '` ; i++ )); do
head -$i filename.txt | tail -1;
done
The wc -l filename.txt gets the number of lines in filename.txt. cut -f 1 -f ' ' takes the first field from the wc which is the number of lines. This is used as the upper bound for the loop.
head -$i takes the first $i lines and tail -1 takes the last line of that. This gives you one line blocks.
I have a list of files where two subsequent ones always belong together. I would like a for loop extract two files out of this list per iteration, and then work on these two files at a time (for an example, let's say I want to just concatenate, i.e. cat the two files).
In a simple case, my list of files is this:
FILES="file1_mateA.txt file1_mateB.txt file2_mateA.txt file2_mateB.txt"
I could hack around it and say
FILES="file1 file2"
for file in $FILES
do
actual_mateA=${file}_mateA.txt
actual_mateB=${file}_mateB.txt
cat $actual_mateA $actual_mateB
done
But I would like to be able to handle lists where mate A and mate B have arbitrary names, e.g.:
FILES="first_file_first_mate.txt first_file_second_mate.txt file2_mate1.txt file2_mate2.txt"
Is there a way to extract two values out of $FILES per iteration?
Use an array for the list:
files=(fileA1 fileA2 fileB1 fileB2)
for (( i=0; i<${#files[#]} ; i+=2 )) ; do
echo "${files[i]}" "${files[i+1]}"
done
You could read the values from a while loop and use xargs to restrict each read operation to two tokens.
files="filaA1 fileA2 fileB1 fileB2"
while read -r a b; do
echo $a $b
done < <(echo $files | xargs -n2)
You could use xargs(1), e.g.
ls -1 *.txt | xargs -n2 COMMAND
The switch -n2 let xargs select 2 consecutive filenames from the pipe output which are handed down do the COMMAND
To concatenate the 10 files file01.txt ... file10.txt pairwise
one can use
ls *.txt | xargs -n2 sh -c 'cat $# > $1.$2.joined' dummy
to get the 5 result files
file01.txt.file02.txt.joined
file03.txt.file04.txt.joined
file05.txt.file06.txt.joined
file07.txt.file08.txt.joined
file09.txt.file10.txt.joined
Please see 'info xargs' for an explantion.
How about this:
park=''
for file in $files # wherever you get them from, maybe $(ls) or whatever
do
if [ "$park" = '' ]
then
park=$file
else
process "$park" "$file"
park=''
fi
done
In each odd iteration it just stores the value (in park) and in each even iteration it then uses the stored and the current value.
Seems like one of those things awk is suited for
$ awk '{for (i = 1; i <= NF; i+=2) if( i+1 <= NF ) print $i " " $(i+1) }' <<< "$FILES"
file1_mateA.txt file1_mateB.txt
file2_mateA.txt file2_mateB.txt
You could then loop over it by setting IFS=$'\n'
e.g.
#!/bin/bash
FILES="file1_mateA.txt file1_mateB.txt file2_mateA.txt file2_mateB.txt file3_mat
input=$(awk '{for (i = 1; i <= NF; i+=2) if( i+1 <= NF ) print $i " " $(i+1) }'
IFS=$'\n'
for set in $input; do
cat "$set" # or something
done
Which will try to do
$ cat file1_mateA.txt file1_mateB.txt
$ cat file2_mateA.txt file2_mateB.txt
And ignore the odd case without the match.
You can transform you string to array and read this new array by elements:
#!/bin/bash
string="first_file_first_mate.txt first_file_second_mate.txt file2_mate1.txt file2_mate2.txt"
array=(${string})
size=${#array[*]}
idx=0
while [ "$idx" -lt "$size" ]
do
echo ${array[$idx]}
echo ${array[$(($idx+1))]}
let "idx=$idx+2"
done
If you have delimiter in string different from space (i.e. ;) you can use the following transformation to array:
array=(${string//;/ })
You could try something like this:
echo file1 file2 file3 file4 | while read -d ' ' a; do read -d ' ' b; echo $a $b; done
file1 file2
file3 file4
Or this, somewhat cumbersome technique:
echo file1 file2 file3 file4 |tr " " "\n" | while :;do read a || break; read b || break; echo $a $b; done
file1 file2
file3 file4