Take two at a time in a bash "for file in $list" construct - bash

I have a list of files where two subsequent ones always belong together. I would like a for loop extract two files out of this list per iteration, and then work on these two files at a time (for an example, let's say I want to just concatenate, i.e. cat the two files).
In a simple case, my list of files is this:
FILES="file1_mateA.txt file1_mateB.txt file2_mateA.txt file2_mateB.txt"
I could hack around it and say
FILES="file1 file2"
for file in $FILES
do
actual_mateA=${file}_mateA.txt
actual_mateB=${file}_mateB.txt
cat $actual_mateA $actual_mateB
done
But I would like to be able to handle lists where mate A and mate B have arbitrary names, e.g.:
FILES="first_file_first_mate.txt first_file_second_mate.txt file2_mate1.txt file2_mate2.txt"
Is there a way to extract two values out of $FILES per iteration?

Use an array for the list:
files=(fileA1 fileA2 fileB1 fileB2)
for (( i=0; i<${#files[#]} ; i+=2 )) ; do
echo "${files[i]}" "${files[i+1]}"
done

You could read the values from a while loop and use xargs to restrict each read operation to two tokens.
files="filaA1 fileA2 fileB1 fileB2"
while read -r a b; do
echo $a $b
done < <(echo $files | xargs -n2)

You could use xargs(1), e.g.
ls -1 *.txt | xargs -n2 COMMAND
The switch -n2 let xargs select 2 consecutive filenames from the pipe output which are handed down do the COMMAND
To concatenate the 10 files file01.txt ... file10.txt pairwise
one can use
ls *.txt | xargs -n2 sh -c 'cat $# > $1.$2.joined' dummy
to get the 5 result files
file01.txt.file02.txt.joined
file03.txt.file04.txt.joined
file05.txt.file06.txt.joined
file07.txt.file08.txt.joined
file09.txt.file10.txt.joined
Please see 'info xargs' for an explantion.

How about this:
park=''
for file in $files # wherever you get them from, maybe $(ls) or whatever
do
if [ "$park" = '' ]
then
park=$file
else
process "$park" "$file"
park=''
fi
done
In each odd iteration it just stores the value (in park) and in each even iteration it then uses the stored and the current value.

Seems like one of those things awk is suited for
$ awk '{for (i = 1; i <= NF; i+=2) if( i+1 <= NF ) print $i " " $(i+1) }' <<< "$FILES"
file1_mateA.txt file1_mateB.txt
file2_mateA.txt file2_mateB.txt
You could then loop over it by setting IFS=$'\n'
e.g.
#!/bin/bash
FILES="file1_mateA.txt file1_mateB.txt file2_mateA.txt file2_mateB.txt file3_mat
input=$(awk '{for (i = 1; i <= NF; i+=2) if( i+1 <= NF ) print $i " " $(i+1) }'
IFS=$'\n'
for set in $input; do
cat "$set" # or something
done
Which will try to do
$ cat file1_mateA.txt file1_mateB.txt
$ cat file2_mateA.txt file2_mateB.txt
And ignore the odd case without the match.

You can transform you string to array and read this new array by elements:
#!/bin/bash
string="first_file_first_mate.txt first_file_second_mate.txt file2_mate1.txt file2_mate2.txt"
array=(${string})
size=${#array[*]}
idx=0
while [ "$idx" -lt "$size" ]
do
echo ${array[$idx]}
echo ${array[$(($idx+1))]}
let "idx=$idx+2"
done
If you have delimiter in string different from space (i.e. ;) you can use the following transformation to array:
array=(${string//;/ })

You could try something like this:
echo file1 file2 file3 file4 | while read -d ' ' a; do read -d ' ' b; echo $a $b; done
file1 file2
file3 file4
Or this, somewhat cumbersome technique:
echo file1 file2 file3 file4 |tr " " "\n" | while :;do read a || break; read b || break; echo $a $b; done
file1 file2
file3 file4

Related

How to get values from one file that fall in a list of ranges from another file

I have bunch of files with sorted numerical values, in example:
cat tag_1_file.val
234
551
626
cat tag_2_file.val
12
1023
1099
etc.
And one file with tags and value ranges that fit my needs. Values are sorted first by tag, then by 2nd column, then by 3rd. Ranges may overlap.
cat ranges.val
tag_1 200 300
tag_1 600 635
tag_2 421 443
and so on.
So I try to loop through file with ranges and then look for all values that fall in range (in every line) in file with appropriate tag:
cat ~/blahblah/ranges.val | while read -a line;
#read line as array
do
cat ~/blahblah/${line[0]}_file.val | while read number;
#get tag name and cat the appropriate file
do
if [[ "$number" -ge "${line[1]}" ]] && [[ "$number" -le "${line[2]}" ]]
#check if current value fall into range
then
echo $number >> ${line[0]}.output
#toss the value that fall into interval to another file
elif [[ "$number" -gt "${line[2]}" ]]
then break
fi
done
done
But these two nested while loops are deadly slow with huge files containing 100M+ lines.
I think, there must be more efficient way of doing such things and I'd be grateful for any hint.
UPD: The expected output based on this example is:
cat file tag_1.output
234
626
Have you tried recoding the inner loop in something more efficient than Bash? Perl would probably be good enough:
while read tag low hi; do
perl -nle "print if \$_ >= ${low} && \$_ <= ${hi}" \
<${tag}_file.val >>${tag}.output
done <ranges.val
The behaviour if this version is slightly different in two ways - the loop doesn't bail out once the high point is reached, and the output file is created even if it is empty. Over to you if that isn't what you want!
another not so efficient implementation with awk
$ awk 'NR==FNR {t[NR]=$1; s[NR]=$2; e[NR]=$3; next}
{for(k in t)
if(t[k]==FILENAME) {
inout = t[k] "." ((s[k]<=$1 && $1<=e[k])?"in":"out");
print > inout;
next}}' ranges tag_1 tag_2
$ head tag_?.*
==> tag_1.in <==
234
==> tag_1.out <==
551
626
==> tag_2.out <==
12
1023
1099
note that I renamed files to match the tag names, otherwise you have to add tag extraction from filenames. Suffix ".in" for in ranges and ".out" for not. Depends on the sorted order of the files. If you have thousands of tag files adding a another layer to filter out the ranges per tag will speed it up. Now it iterates over ranges.
I'd write
while read -u3 -r tag start end; do
f="${tag}_file.val"
if [[ -r $f ]]; then
while read -u4 -r num; do
(( start <= num && num <= end )) && echo "$num"
done 4< "$f"
fi
done 3< ranges.val
I'm deliberately reading the files on separate file descriptors, otherwise the inner while-read loop will also slurp up the rest of "ranges.val".
bash while-read loops are very slow. I'll be back if a few minutes with an alternate solution
here's a GNU awk answer (requires, I believe, a fairly recent version)
gawk '
#load "filefuncs"
function read_file(tag, start, end, file, number, statdata) {
file = tag "_file.val"
if (stat(file, statdata) != -1) {
while (getline number < file) {
if (start <= number && number <= end) print number
}
}
}
{read_file($1, $2, $3)}
' ranges.val
perl
perl -Mautodie -ane '
$file = $F[0] . "_file.val";
next unless -r $file;
open $fh, "<", $file;
while ($num = <$fh>) {
print $num if $F[1] <= $num and $num <= $F[2]
}
close $fh;
' ranges.val
I have a solution for you from bioinformatics:
We have a format and a tool for this kind of task.
The format called .bed is used for description of ranges on chromosomes, but should work with your tags too.
The best toolset for this format is bedtools, which is lightning fast.
The specific tool, which might help you is intersect.
With this installed it becomes a task of formating the data for the tool:
#!/bin/bash
#reformating your positions to .bed format;
#1 adding the tag to each line
#2 repeating the position to make it a range
#3 converting to tab-separation
awk -F $'\t' 'BEGIN {OFS = FS} {print FILENAME, $0, $0}' *_file.val | sed 's/_file.val//g' >all_positions_in_one_range_file.bed
#making your range-file tab-separated
sed 's/ /\t/g' ranges.val >ranges_with_tab.bed
#doing the real comparision of the ranges with bedtools
bedtools intersect -a all_positions_in_one-range_file.bed -b ranges_with_tab.bed >all_positions_intersected.bed
#spliting the one result file back into files named by your tag
awk -F $'\t' '{print $2 >$1".out"}' all_positions_intersected.bed
Or if you prefer oneliners:
bedtools intersect -a <(awk -F $'\t' 'BEGIN {OFS = FS} {print FILENAME, $0, $0}' *_file.val | sed 's/_file.val//g') -b <(sed 's/ /\t/g' ranges.val) | awk -F $'\t' '{print $2 >$1".out"}'

Bash pass the file names which are not in the ith element of loop

In a simple processing of files, where you want to do something on every file in a directory, you do something like this:
for i in file1 file2 file3 file5
do
echo "Processing $i"
done
What I want to do here is pass $i as well as the non-$i files as an argument to a command. Lets say my directory contains 4 files (file1, file2, file3, file5). For example in the first iteration of the loop when file1 is being processed, I want to pass the rest of the files (file2, file3, file5) to the -b argument of the command.
For example, first iteration of loop in bash should look something like this:
FILES=/path/to/directory
for i in $FILES
do
bedtools intersect -a $i -b file2 file3 file5
done
In second iteration as the file2 is in the $i the rest of the files will be passed to -b argument.
for i in $FILES
do
bedtools intersect -a $i -b file1 file3 file5
done
and so on for all the files in the directory. In short, pass the current file to -a argument and rest of the files to -b argument.
It will be great if somebody can help me with this. Thank you.
You can just use a numeric loop and take slices out of the array:
shopt -s nullglob
files=( path/to/directory/* )
for (( i = 0; i < ${#files[#]}; ++i )); do
file=${files[i]}
others=( "${files[#]:0:i}" "${files[#]:i+1}" )
bedtools intersect -a "$file" -b "${others[#]}"
done
This loops though the indices of the array files and slices the part before and after the current index i to get the others.
You can try out like this as well,
op=$(find /path/to/directory ! -iname ".*")
temp=$op
for i in $op;
do
rfile=${temp//$i/}
rfile=$(echo $rfile | tr '\n' ' ')
bedtools intersect -a $i -b $rfile
done
count=0; files=(*)
for i in ${files[*]}; do
unset files[count]
echo "bedtools intersect -a $i -b ${files[*]}"
files+=($i)
((count++))
done

Adding numbers with a while loop using piped output

So i am running a randomfile that can receive several arguments ($1 and $2) not shown, and then does something with the argument passed...
with the 3rd argument, i am supposed to search for $3 (or not $3) in file1 and add number of instances of this to file2 ...
this works fine:
cat file1 | grep $3 | wc -l | while read line1; do echo $3 $line1 > file2; done
cat file1 | grep -v $3 | wc -l | while read line2; do echo not $3 $line2 >> file2; done
Now I am trying to read file2 that is holding the instances of the search, i want to get the numbers in the file, get the sum, to then append to file2. So, for example, if $3 was "baby":
file2 would contain-
baby 30
not baby 20
and then i want to get the sum of 20 and 30 and append to that same file2, so that it looks like-
baby 30
not baby 20
total 50
This is what i have at the moment:
cat file2 | grep -o '[0-9]*' | while read num ; do sum=$(($sum + $num));echo "total $sum" >> file2; done
my file2 ends up with two lines for totals, where one of them is what i need-
baby 30
not baby 20
total 30
total 50
What did I miss here?
This is happening because your echo is within your for loop.
The obvious solution would be to move this outside your for loop, but if you try this you will find that $sum is not set, this is because the while loops and pipes are actually spawned as their own processes. You can solve this by using braces ({}) to group your commands:
cat file2 | grep -o '[0-9]*' | { while read num ; do sum=$(($sum + $num)); done; echo "total $sum" >> file2; }
Other answers do point out better ways of doing this, but this hopefully helps you understand what is happening.
cat file1 | grep $3 | wc -l | while read line1; do echo $3 $line1 > file2; done
If you want to count the instances of $3, you can use the option -c of grep, avoiding a pipe to wc(1). Moreover, it would be better to quote the $3. Finally, you don't need a loop to read the count (either from wc or grep): it is a single line! So, your code above could be written like this:
count=$(grep -c "$3" file1)
echo $count $3 >file2
The second grep would be just the same as before:
count=$(grep -vc "$3" file1)
echo $count $3 >>file2
Now you should have the intermediate result:
30 baby
20 not baby
Note that I reversed the two terms, count and pattern; this is because we know that the count is a single word, but the pattern could be more words. So writing first the count, we have a well defined format: "count, then all the rest".
The third loop can be written like this:
while read num string; do
# string is filled with all the rest on the line
let "sum = $sum + $num"
done < file2
echo "$sum total" >> file2
There are other ways to sum up the total; if needed, you could also reverse again the terms of the final file, as was your original - it could be done by using another file again.

Linux: Search for coincidences in four different files

Scenario: Four files with 300 lines on each one. I want to know which lines are in all four files using bash only (no perl/python/ruby please)
Quick sample
$cat bad_domains.urlvoid
a
b
c
d
e
$cat bad_domains.alienvault
f
g
a
c
h
$cat bad_domains.hphosts
i
j
k
a
h
$cat bad_domains.malwaredomain
l
b
m
f
a
j
I only want to match the "a" i tried with stuff like this but it's slow as hell:
for void in $(cat bad_domains.urlvoid)
do
for vault in $(cat bad_domains.alienvault)
do
for hphosts in $(cat bad_domains.hphosts)
do
for malwaredomain in $(cat bad_domains.malwaredomain)
do
if [ $void == $vault -a $void == $hphosts -a $void == $malwaredomain -a $vault == $hphosts -a $vault == $malwaredomain -a $hphosts == $malwaredomain ]
then
echo $void
fi
done
done
done
done
Any good tips for optimizing my code? I read something about Dichotomic search that maybe could work.
Using comm:
comm -12 <(awk 'FNR==NR{a[$0];next} $0 in a' f1 f2) <(awk 'FNR==NR{a[$0];next} $0 in a' f3 f4)
a
Which works using these 3 steps:
Get common strings from file1 and file2
Get common strings from file3 and file4
Get common strings from above 2 steps thus getting intersection of 4 sets
EDIT: Pure awk solution:
awk 'FNR==NR{a[$0];next} $0 in a' <(awk 'FNR==NR{a[$0];next} $0 in a' f1 f2) <(awk 'FNR==NR{a[$0];next} $0 in a' f3 f4)
If the lines are unique within each file:
cat file1 file2 file3 file4 | sort | uniq -c | grep '^ *4 '
For bash 4.x (and ksh93)
Create an associative array indexed by the lines of one of the files (master).
For each of the remaining files, create a second array (work) indexed by the file's lines, then iterate over the master
array removing any entry with a key which does not also appear in the work array.
Any keys left in master[] after processing must have been in all files.
list=( bad_domains.* )
typeset -A master
while IFS= read -r key ; do master[$key]=1 ; done < "${list[0]}"
unset list[0]
for file in "${list[#]}" ; do
typeset -A work
while IFS= read -r key ; do work[$key]=1 ; done < "$file"
for key in "${!master[#]}" ; do [[ ${work[$key]+set} = set ]] || unset master[$key] ; done
unset work
done
for key in "${!master[#]}" ; do printf '%s\n' "$key" ; done

In bash, how can I print the first n elements of a list?

In bash, how can I print the first n elements of a list?
For example, the first 10 files in this list:
FILES=$(ls)
UPDATE: I forgot to say that I want to print the elements on one line, just like when you print the whole list with echo $FILES.
FILES=(*)
echo "${FILES[#]:0:10}"
Should work correctly even if there are spaces in filenames.
FILES=$(ls) creates a string variable. FILES=(*) creates an array. See this page for more examples on using arrays in bash. (thanks lhunath)
Why not just this to print the first 50 files:
ls -1 | head -50
FILE="$(ls | head -1)"
Handled spaces in filenames correctly too when I tried it.
My way would be:
ls | head -10 | tr "\n" " "
This will print the first 10 lines returned by ls, and then tr replaces all line breaks with spaces. Output will be on a single line.
echo $FILES | awk '{for (i = 1; i <= 10; i++) {print $i}}'
Edit: AAh, missed your comment that you needed them on one line...
echo $FILES | awk '{for (i = 1; i <= 10; i++) {printf "%s ", $i}}'
That one does that.
to do it interactively:
set $FILES && eval eval echo \\\${1..10}
to run it as a script, create foo.sh with contents
N=$1; shift; eval eval echo \\\${1..$N}
and run it as
bash foo.sh 10 $FILES
An addition to the answer of "Ayman Hourieh" and "Shawn Chin", in case it is needed for something else than content of a directory.
In newer version of bash you can use mapfile to store the directory in an array. See help mapfile
mapfile -t files_in_dir < <( ls )
If you want it completely in bash use printf "%s\n" * instead of ls, or just replace ls with any other command you need.
Now you can access the array as usual and get the data you need.
First element:
${files_in_dir[0]}
Last element (do not forget space after ":" ):
${files_in_dir[#]: -1}
Range e.g. from 10 to 20:
${files_in_dir[#]:10:20}
Attention for large directories, this is way more memory consuming than the other solutions.
FILES=$(ls)
echo $FILES | fmt -1 | head -10

Resources