bash for loop only return last value xtimes of xlength of arra - bash

I have a file with IDs such as below:
A
D
E
And I have a second file with the same IDs and extra info that I need:
A 50 G25T1 7.24 298
B 20 G234T2 8.3 80
C 5 G1I1 5.2 909
D 500 G458T3 0.4 79
E 321 G46I2 45.8 901
I want to output the third column of the second file by selecting the first column of the second file using the ids from first file:
G25T1
G458T3
G46I2
The issue I have is while the for loop runs, the output is as follows:
G46I2
G46I2
G46I2
Here is my code:
a=0; IFS=$'\r\n' command eval 'ids=($(awk '{print$1}' shared_single_copies.txt | sed -e 's/[[:space:]]//g'))'; for id in "${ids[#]}"; do a=$(($a+1)); echo $a' '"$id"; awk '{$1=="${id}"} END {print $3}' run_Busco_A1/A1_single_copy_ids.txt >> A1_genes_sc_Buscos.txt; done

Your code is way too complicated. Try one of these solutions: "file1" contains the ids, "file2" contains the extra info:
$ join -o 2.3 file1 file2
G25T1
G458T3
G46I2
$ awk 'NR==FNR {id[$1]; next} $1 in id {print $3}' file1 file2
G25T1
G458T3
G46I2
For more help about join, check the man page.
For more help about awk, start with the awk info page.

#glenn jackman's answer was by far the most succinct and elegant imo. If you want to use loops, though, then this can work:
#!/bin/bash
# if output file already exists, clear it so we don't
# inadvertently duplicate data:
> A1_genes_sc_Buscos.txt
while read -r selector
do
while read -r c1 c2 c3 garbage
do
[[ "$c1" = "$selector" ]] && echo "$c3" >> A1_genes_sc_Buscos.txt
done < run_Busco_A1/A1_single_copy_ids.txt
done < shared_single_copies.txt
That should work for your use-case provided the formatting is valid between what you gave as input and your real files.

Related

cat multiple files into one using same amount of rows as file B from A B C

This is a strange question, I have been looking around and I wasn't able to find anything to match with what I wish to do.
What I'm trying to do is;
File A, File B, File C
5 Lines, 3 Lines, 2 Lines.
Join all files in one file matching the same amount of the file B
The output should be
File A, File B, File C
3 Lines, 3 Lines, 3 Lines.
So in file A I have to remove two lines, in File C i have to duplicate 1 line so I can match the same lines as file B.
I was thinking to do a count to see how many lines each file has first
count1=`wc -l FileA| awk '{print $1}'`
count2=`wc -l FileB| awk '{print $1}'`
count3=`wc -l FileC| awk '{print $1}'`
Then to do a gt then file B remove lines, else add lines.
But I have got lost as I'm not sure how to continue with this, I never seen anyone trying to do this.
Can anyone point me to an idea?
the output should be as per attached picture below;
Output
thanks.
Could you please try following. I have made # as a separator you could change it as per your need too.
paste -d'#' file1 file2 file3 |
awk -v file2_lines="$(wc -l < file2)" '
BEGIN{
FS=OFS="#"
}
FNR<=file2_lines{
$1=$1?$1:prev_first
$3=$3?$3:prev_third
print
prev_first=$1
prev_third=$3
}'
Example of running above code:
Lets say following are Input_file(s):
cat file1
File1_line1
File1_line2
File1_line3
File1_line4
File1_line5
cat file2
File2_line1
File2_line2
File2_line3
cat file3
File3_line1
File3_line2
When I run above code in form of script following will be the output:
./script.ksh
File1_line1#File2_line1#File3_line1
File1_line2#File2_line2#File3_line2
File1_line3#File2_line3#File3_line2
you can get the first n lines of a files with the head command resp sed.
you can generate new lines with echo.
i'm going to use sed, as it allows in-place editing of a file (so you don't have to deal with temporary files):
#!/bin/bash
fix_numlines() {
local filename=$1
local wantlines=$2
local havelines=$(grep -c . "${filename}")
head -${wantlines} "${filename}"
if [ $havelines -lt $wantlines ]; then
for i in $(seq $((wantlines-havelines))); do echo; done
fi
}
lines=$(grep -c . fileB)
fix_numlines fileA ${lines}
fix_numlines fileB ${lines}
fix_numlines fileC ${lines}
if you want columnated output, it's even simpler:
paste fileA fileB fileC | head -$(grep -c . fileB)
Another for GNU awk that outputs in columns:
$ gawk -v seed=$RANDOM -v n=2 ' # n parameter is the file index number
BEGIN { # ... which defines the record count
srand(seed) # random record is printed when not enough records
}
{
a[ARGIND][c[ARGIND]=FNR]=$0 # hash all data to a first
}
END {
for(r=1;r<=c[n];r++) # loop records
for(f=1;f<=ARGIND;f++) # and fields for below output
printf "%s%s",((r in a[f])?a[f][r]:a[f][int(rand()*c[f])+1]),(f==ARGIND?ORS:OFS)
}' a b c # -v n=2 means the second file ie. b
Output:
a1 b1 c1
a2 b2 c2
a3 b3 c1
If you don't like the random pick of a record, replace int(rand()*c[f])+1] with c[f].
$ gawk ' # remember GNU awk only
NR==FNR { # count given files records
bnr=FNR
next
}
{
print # output records of a b c
if(FNR==bnr) # ... up to bnr records
nextfile # and skip to next file
}
ENDFILE { # if you get to the end of the file
if(bnr>FNR) # but bnr not big enough
for(i=FNR;i<bnr;i++) # loop some
print # and duplicate the last record of the file
}' b a b c # first the file to count then all the files to print
To make a file have n lines you can use the following function (usage: toLength n file). This omits lines at the end if the file is too long and repeats the last line if the file is too short.
toLength() {
{ head -n"$1" "$2"; yes "$(tail -n1 "$2")"; } | head -n"$1"
}
To set all files to the length of FileB and show them side by side use
n="$(wc -l < FileB)"
paste <(toLength "$n" FileA) FileB <(toLength "$n" FileC) | column -ts$'\t'
As observed by the user umläute the side-by-side output makes things even easier. However, they used empty lines to pad out short files. The following solution repeats the last line to make short files longer.
stretch() {
cat "$1"
yes "$(tail -n1 "$1")"
}
paste <(stretch FileA) FileB <(stretch FileC) | column -ts$'\t' |
head -n"$(wc -l < FileB)"
This is a clean way using awk where we read each file only a single time:
awk -v n=2 '
BEGIN{ while(1) {
for(i=1;i<ARGC;++i) {
if (b[i]=(getline tmp < ARGV[i])) a[i] = tmp
}
if (b[n]) for(i=1;i<ARGC;++i) print a[i] > ARGV[i]".new"
else {break}
}
}' f1 f2 f3 f4 f5 f6
This works in the following way:
the lead file is defined by the index n. Here we choose the lead file to be f2.
We do not process the files in the standard read record, fields sequentially, but we use the BEGIN block where we read the files in parallel.
We do an infinite loop while(1) where we will break out if the lead-file has no more input.
Per cycle, we read a new line of each file using getline. If the file i has a new line, store it in a[i], and set the outcome of getline into b[i]. If file i has reached its end, keep the last line in mind.
Check the outcome of the lead file with b[n]. If we still read a line, print all the lines to the files f1.new, f2.new, ..., otherwise, break out of the infinite loop.

write a value to specific row and column using shell

I am very new to shell scripting, Task in my hand are divided into 2 shell scripts.
I want to do Pipelining of two shell scripts(which should run from a different directory rather than where the scripts are written) and which is presently working well.
First shell script contains:
Combining around 90 .lvm files stored inside a folder.
Crops each .lvm file, removes header and crops till the end of data.
Now I need to print a value in the 18th column once each file have been iterated to distinguish the end of file (here I am trying to write 500)
#!/bin/sh
clear
for file in "$1/"*.lvm; do
a=$(awk '/X_Value/{ print NR; exit }' "$file")
b=$(awk 'END {print NR}' "$file")
awk '{OFS= "\t"} {NR==$b $18,"500"}' "$file"
#specified row is $b and column number is 18
sed "s|\$a|${b}|" "$file"
done
Second shell script contains:
Reading specific columns from a first shell script.
Which is:
#!/bin/sh
clear
while read line; do
sleep 1
awk -v OFS='\t' '{print $1, $2, $3, $8, $18}'
done
Output now:
file1 196287,265000 3,902977 -39,226354 0,873427
file1 196287,266000 3,890747 -51,032699 0,519405
file1 196287,267000 3,900080 -51,472975 -0,446108
....
....
....
file2 196287,268000 3,904586 -50,627182 -0,092086
file2 196287,269000 3,870793 -30,687314 1,195265
file2 196287,270000 3,897505 -30,073244 0,744692
....
....
Desired Output:
file1 196287,265000 3,902977 -39,226354 0,873427 0
file1 196287,266000 3,890747 -51,032699 0,519405 0
file1 196287,267000 3,900080 -51,472975 -0,446108 500
...
...
...
file2 196287,268000 3,904586 -50,627182 -0,092086 0
file2 196287,269000 3,870793 -30,687314 1,195265 0
file2 196287,270000 3,897505 -30,073244 0,744692 500
...
...
I do not like awk. I do not use awk. Mock me if you must.
If bash can't handle something, I rewrite it in perl. =o)
That said - here's a clumsy all-bash version.
Many improvements to be made, but work keeps interrupting my fun, lol
nl="
"
for f in "$1/"*.lvm
do typeset out=''
while read a b c d e
do out="$out$( printf "%-25s%14s%14s%14s%14s%14s" $f $a $b $c $d $e )$nl"
done <<< "$(cut -f 1,2,3,8,18 -d ' ' $f )"
stack="$stack$nl$(printf "${out% *} 500$nl")"
done
echo "$stack"|grep -v '^$'

How to combine two text files using bash

I have two text files that I wish to combine in bash so that every line in one file is combined with every file in the other file.
file1.txt
abc123
def346
ghj098
file2.txt
PSYC1001
PSYC1002
PSYC1003
I want to combine them so that line 1 of file1 is added to every line of file2, with a pipe de-limiter | in between them.
e.g.
PSYC1001|abc123
PSYC1002|abc123
PSYC1003|abc123
Then the same for the other lines in file1 so I would end up with
PSYC1001|abc123
PSYC1002|abc123
PSYC1003|abc123
PSYC1001|def346
PSYC1002|def346
PSYC1003|def346
PSYC1001|ghj098
PSYC1002|ghj098
PSYC1003|ghj098<
I've been doing similar simpler text things in bash by copying examples from this site, but I've not found an example that can do this. Would love to hear your suggestion. I know it must be simple but I've not worked it out yet.
The shortest one - join command:
join -j2 -t'|' -o2.1,1.1 file1 file2
-t'|' - input/output field separator
-o FORMAT - FORMAT is one or more comma or blank separated specifications, each being FILENUM.FIELD or 0
The output:
PSYC1001|abc123
PSYC1002|abc123
PSYC1003|abc123
PSYC1001|def346
PSYC1002|def346
PSYC1003|def346
PSYC1001|ghj098
PSYC1002|ghj098
PSYC1003|ghj098
This awk one-liner should help you:
awk -v OFS="|" 'NR==FNR{a[NR]=$0;c=NR;next}{for(i=1;i<=c;i++){print a[i],$0}}' file2 file1
Test with your data:
kent$ awk -v OFS="|" 'NR==FNR{a[NR]=$0;c=NR;next}{for(i=1;i<=c;i++){print a[i],$0}}' f2 f1
PSYC1001|abc123
PSYC1002|abc123
PSYC1003|abc123
PSYC1001|def346
PSYC1002|def346
PSYC1003|def346
PSYC1001|ghj098
PSYC1002|ghj098
PSYC1003|ghj098
Here are 2 ways to do it in plain bash:
while IFS= read -u3 -r elem1; do
while IFS= read -u4 -r elem2; do
echo "$elem2|$elem1"
done 4<file2.txt
done 3<file1.txt
mapfile -t f1 < file1.txt
mapfile -t f2 < file2.txt
for elem1 in "${f1[#]}"; do
for elem2 in "${f2[#]}"; do
echo "$elem2|$elem1"
done
done
bash only
a1=( $(<f1) )
a2=( $(<f2) )
for i in "${a2[#]}"
do
for j in "${a1[#]}"
do
echo "${j}|${i}"
done
done

Comparing values in two files

I am comparing two files, each having one column and n number of rows.
file 1
vincy
alex
robin
file 2
Allen
Alex
Aaron
ralph
robin
if the data of file 1 is present in file 2 it should return 1 or else 0, in a tab seprated file.
Something like this
vincy 0
alex 1
robin 1
What I am doing is
#!/bin/bash
for i in `cat file1 `
do
cat file2 | awk '{ if ($1=="'$i'") print 1 ; else print 0 }'>>binary
done
the above code is not giving me the output which I am looking for.
Kindly have a look and suggest correction.
Thank you
The simple awk solution:
awk 'NR==FNR{ seen[$0]=1 } NR!=FNR{ print $0 " " seen[$0] + 0}' file2 file1
A simple explanation: for the lines in file2, NR==FNR, so the first action is executed and we simply record that a line has been seen. In file1, the 2nd action is taken and the line is printed, followed by a space, followed by a "0" or a "1", depending on if the line was seen in file2.
AWK loves to do this kind of thing.
awk 'FNR == NR {a[tolower($1)]; next} {f = 0; if (tolower($1) in a) {f = 1}; print $1, f}' file2 file1
Swap the positions of file2 and file1 in the argument list to make file1 the dictionary instead of file2.
When FNR (the record number in the current file) and NR (the record number of all records so far) are equal, then the first file is the one being processed. Simply referencing an array element brings it into existence. This sets up the dictionary. The next instruction reads the next record.
Once FNR and NR aren't equal, subsequent file(s) are being processed and their data is looked up in the dictionary array.
The following code should do it.
Take a close look to the BEGIN and END sections.
#!/bin/bash
rm -f binary
for i in $(cat file1); do
awk 'BEGIN {isthere=0;} { if ($1=="'$i'") isthere=1;} END { print "'$i'",isthere}' < file2 >> binary
done
There are several decent approaches. You can simply use line-by-line set math:
{
grep -xF -f file1 file2 | sed $'s/$/\t1/'
grep -vxF -f file1 file2 | sed $'s/$/\t0/'
} > somefile.txt
Another approach would be to simply combine the files and use uniq -c, then just swap the numeric column with something like awk:
sort file1 file2 | uniq -c | awk '{ print $2"\t"$1 }'
The comm command exists to do this kind of comparison for you.
The following approach does only one pass and scales well to very large input lists:
#!/bin/bash
while read; do
if [[ $REPLY = $'\t'* ]] ; then
printf "%s\t0\n" "${REPLY#?}"
else
printf "%s\t1\n" "${REPLY}"
fi
done < <(comm -2 <(tr '[A-Z]' '[a-z]' <file1 | sort) <(tr '[A-Z]' '[a-z]' <file2 | sort))
See also BashFAQ #36, which is directly on-point.
Another solution, if you have python installed.
If you're familiar with Python and are interested in the solution, you only need a bit of formatting.
#/bin/python
f1 = open('file1').readlines()
f2 = open('file2').readlines()
f1_in_f2 = [int(x in f2) for x in f1]
for n,c in zip(f1, f1_in_f2):
print n,c

Shell command to find lines common in two files

I'm sure I once found a shell command which could print the common lines from two or more files. What is its name?
It was much simpler than diff.
The command you are seeking is comm. eg:-
comm -12 1.sorted.txt 2.sorted.txt
Here:
-1 : suppress column 1 (lines unique to 1.sorted.txt)
-2 : suppress column 2 (lines unique to 2.sorted.txt)
To easily apply the comm command to unsorted files, use Bash's process substitution:
$ bash --version
GNU bash, version 3.2.51(1)-release
Copyright (C) 2007 Free Software Foundation, Inc.
$ cat > abc
123
567
132
$ cat > def
132
777
321
So the files abc and def have one line in common, the one with "132".
Using comm on unsorted files:
$ comm abc def
123
132
567
132
777
321
$ comm -12 abc def # No output! The common line is not found
$
The last line produced no output, the common line was not discovered.
Now use comm on sorted files, sorting the files with process substitution:
$ comm <( sort abc ) <( sort def )
123
132
321
567
777
$ comm -12 <( sort abc ) <( sort def )
132
Now we got the 132 line!
To complement the Perl one-liner, here's its awk equivalent:
awk 'NR==FNR{arr[$0];next} $0 in arr' file1 file2
This will read all lines from file1 into the array arr[], and then check for each line in file2 if it already exists within the array (i.e. file1). The lines that are found will be printed in the order in which they appear in file2.
Note that the comparison in arr uses the entire line from file2 as index to the array, so it will only report exact matches on entire lines.
Maybe you mean comm ?
Compare sorted files FILE1 and FILE2 line by line.
With no options, produce three-column output. Column one
contains lines unique to FILE1, column
two contains lines unique to
FILE2, and column three contains lines common to both files.
The secret in finding these information are the info pages. For GNU programs, they are much more detailed than their man-pages. Try info coreutils and it will list you all the small useful utils.
While
fgrep -v -f 1.txt 2.txt > 3.txt
gives you the differences of two files (what is in 2.txt and not in 1.txt), you could easily do a
fgrep -f 1.txt 2.txt > 3.txt
to collect all common lines, which should provide an easy solution to your problem. If you have sorted files, you should take comm nonetheless. Regards!
Note: You can use grep -F instead of fgrep.
If the two files are not sorted yet, you can use:
comm -12 <(sort a.txt) <(sort b.txt)
and it will work, avoiding the error message comm: file 2 is not in sorted order
when doing comm -12 a.txt b.txt.
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' file1 file2
awk 'NR==FNR{a[$1]++;next} a[$1] ' file1 file2
On limited version of Linux (like a QNAP (NAS) I was working on):
comm did not exist
grep -f file1 file2 can cause some problems as said by #ChristopherSchultz and using grep -F -f file1 file2 was really slow (more than 5 minutes - not finished it - over 2-3 seconds with the method below on files over 20 MB)
So here is what I did:
sort file1 > file1.sorted
sort file2 > file2.sorted
diff file1.sorted file2.sorted | grep "<" | sed 's/^< *//' > files.diff
diff file1.sorted files.diff | grep "<" | sed 's/^< *//' > files.same.sorted
If files.same.sorted shall be in the same order as the original ones, then add this line for same order than file1:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file1 > files.same
Or, for the same order than file2:
awk 'FNR==NR {a[$0]=$0; next}; $0 in a {print a[$0]}' files.same.sorted file2 > files.same
For how to do this for multiple files, see the linked answer to Finding matching lines across many files.
Combining these two answers (answer 1 and answer 2), I think you can get the result you are needing without sorting the files:
#!/bin/bash
ans="matching_lines"
for file1 in *
do
for file2 in *
do
if [ "$file1" != "$ans" ] && [ "$file2" != "$ans" ] && [ "$file1" != "$file2" ] ; then
echo "Comparing: $file1 $file2 ..." >> $ans
perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' $file1 $file2 >> $ans
fi
done
done
Simply save it, give it execution rights (chmod +x compareFiles.sh) and run it. It will take all the files present in the current working directory and will make an all-vs-all comparison leaving in the "matching_lines" file the result.
Things to be improved:
Skip directories
Avoid comparing all the files two times (file1 vs file2 and file2 vs file1).
Maybe add the line number next to the matching string
Not exactly what you were asking, but something I think still may be useful to cover a slightly different scenario
If you just want to quickly have certainty of whether there is any repeated line between a bunch of files, you can use this quick solution:
cat a_bunch_of_files* | sort | uniq | wc
If the number of lines you get is less than the one you get from
cat a_bunch_of_files* | wc
then there is some repeated line.
rm file3.txt
cat file1.out | while read line1
do
cat file2.out | while read line2
do
if [[ $line1 == $line2 ]]; then
echo $line1 >>file3.out
fi
done
done
This should do it.

Resources