bash: clean outer join of three files, preserving file-membership - bash

Consider the following three files with headers in the first row:
file1:
id name in1
1 jon 1
2 sue 1
file2:
id name in2
2 sue 1
3 bob 1
file3:
id name in3
2 sue 1
3 adam 1
I want to merge these files to get the following output, merged_files:
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 bob 0 1 0
3 adam 0 0 1
This request has several special features that I have not found implemented in a handy way in grep/sed/awk/join etc. Edit: You may assume, for simplicity, that the three files have already been sorted.

This is very similar to the problem solved in Bash script to find matching rows from multiple CSV files. It's not identical, but it is very similar. (So similar that I only had to remove three sort commands, change the three sed commands slightly, change the file names, change the 'missing' value from no to 0, and change the replacement in the final sed from comma to space.)
The join command with sed (usually sort too, but the data is already sufficiently sorted) are the primary tools needed. Assume that : does not appear in the original data. To record the presence of a row in a file, we want a 1 field in the file (it's almost there); we'll have join supply the 0 when there isn't a match. The 1 at the end of each non-heading line needs to become :1, and the last field in the heading also needs to be preceded by the :. Then, using bash's process substitution, we can write:
$ sed 's/[ ]\([^ ]*\)$/:\1/' file1 |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,2.2 - <(sed 's/[ ]\([^ ]*\)$/:\1/' file2) |
> join -t: -a 1 -a 2 -e 0 -o 0,1.2,1.3,2.2 - <(sed 's/[ ]\([^ ]*\)$/:\1/' file3) |
> sed 's/:/ /g'
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 adam 0 0 1
3 bob 0 1 0
$
The sed command (three times) adds the : before the last field in each line of the files. The joins are very nearly symmetric. The -t: specifies that the field separator is the colon; the -a 1 and -a 2 mean that when there isn't a match in a file, the line will still be included in the output; the -e 0 means that if there isn't a match in a file, a 0 is generated in the output; and the -o option specifies the output columns. For the first join, -o 0,1.2,2.2 the output is the join column (0), then the second column (the 1) from the two files. The second join has 3 columns in the input, so it specifies -o 0,1.2,1.3,2.2. The argument - on its own means 'read standard input'. The <(...) notation is 'process substitution', where a file name (usually /dev/fd/NN) is provided to the join command, and it contains the output of the command inside the parentheses. The output is then filtered through sed once more to replace the colons with spaces, yielding the desired output.
The only difference from the desired output is the sequencing of 3 bob after 3 adam; it is not particularly clear on what basis you ordered them in reverse in your desired output. If it is crucial, a means can be devised for resolving the order differently (such as sort -k1,1 -k3,5, except that sorts the label line after the data; there are workarounds for that if necessary).

Code for GNU awk:
{
if ($1=="id") { v[i++]=$3; next }
b[$1,$2]=$1" "$2
c[i-1,$1" "$2]=$3
}
END {
printf ("id name")
for (x in v) printf (" %s", v[x]); printf ("\n")
for (y in b) {
printf ("%s", b[y])
for (z in v) if (c[z,b[y]]==0) {printf (" 0")} else printf (" %s", c[z,b[y]])
printf ("\n")
}
}
$cat file?
id name in1
1 jon 1
2 sue 1
id name in2
2 sue 1
3 bob 1
id name in3
2 sue 1
3 adam 1
$awk -f prog.awk file?
id name in1 in2 in3
3 bob 0 1 0
3 adam 0 0 1
1 jon 1 0 0
2 sue 1 1 1

This awk script will do what you want:
$1=="id"&&$2=="name"{
ins[$3]= 1;
lastin = $3;
}
$1!="id"||$2!="name" {
ids[$1] = 1;
names[$2] = 1;
a[$1,$2,lastin]= $3
used[$1,$2] = 1;
}
END {
printf "id name"
for (i in ins) {
printf " %s", i
}
printf "\n"
for (id in ids) {
for (name in names) {
if (used[id,name]) {
printf "%s %s", id, name
for (i in ins) {
printf " %d", a[id,name,i]
}
printf "\n"
}
}
}
}
Assuming your files are called list1, list2, etc., and the awk file is script.awk, you can run it like this
$ cat list* | awk -f script.awk
id name in1 in2 in3
1 jon 1 0 0
2 sue 1 1 1
3 bob 0 1 0
3 adam 0 0 1
I am sure that is a much shorter and simpler way to do it, but this is all I could come up with at 1:30 am :)

Related

use bash or awk to replace part of a string

I have the following example lines in a file:
sweet_25 2 0 4
guy_guy 2 4 6
ging_ging 0 0 3
moat_2 0 1 0
I want to process the file and have the following output:
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0
Notice that the required effect happened in lines 2 and 3 - that an underscore and text follwing a text is remove on lines where this pattern occurs.
I have not succeeded with the follwing:
sed -E 's/([a-zA-Z])_[a-zA-Z]/$1/g' file.txt >out.txt
Any bash or awk advice will be welcome.Thanks
If you want to replace the whole word after the underscore, you have to repeat the character class one or more times using [a-zA-Z]+ and use \1 in the replacement.
sed -E 's/([a-zA-Z])_[a-zA-Z]+/\1/g' file.txt >out.txt
If the words should be the same before and after the underscore, you can use a repeating capture group with a backreference.
If you only want to do this for the start of the string you can prepend ^ to the pattern and omit the /g at the end of the sed command.
sed -E 's/([a-zA-Z]+)(_\1)+/\1/g' file.txt >out.txt
The pattern matches:
([a-zA-Z]+) Capture group 1, match 1 or more occurrences of a char a-zA-Z
(_\1)+ Capture group 2, repeat matching _ and the same text captured by group 1
The file out.txt will contain:
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0
With your shown samples, please try following awk code.
awk 'split($1,arr,"_") && arr[1] == arr[2]{$1=arr[1]} 1' Input_file
Explanation: Simple explanation would be, using awk's split function that splits 1st field into an array named arr with delimiter _ AND then checking condition if 1st element of arr is EQAUL to 2nd element of arr then save only 1st element of arr to first field($1) and by mentioning 1 printing edited/non-edited lines.
You can do it more simply, like this:
sed -E 's/_[a-zA-Z]+//' file.txt >out.txt
This just replaces an underscore followed by any number of alphabetical characters with nothing.
$ awk 'NR~/^[23]$/{sub(/_[^ ]+/,"")} 1' file
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0
I would do:
awk '$1~/[[:alpha:]]_[[:alpha:]]/{sub(/_.*/,"",$1)} 1' file
Prints:
sweet_25 2 0 4
guy 2 4 6
ging 0 0 3
moat_2 0 1 0

Split a file based on number of groups in first column in bash and maximum line number

Consider the following (sorted) file test.txt where in the first column a occurs 3 times, b occurs once, c occurs 2 times and d occurs 4 times.
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1
I would like to split this file to smaller files with maximum 4 lines. However, I need to retain the the groups in the smaller files, meaning that all lines that start with the same value in column $1 need to be in the same file. The size of the group is in this example never larger than the desired output length.
The expected output would be:
file1:
a 1
a 2
a 1
b 1
file2:
c 1
c 1
file3:
d 2
d 1
d 2
d 1
From the expected output, you can see that it if two or more groups together have less than the maximum line number (here 4), they should go into the same file.
Therefore: a + b have together 4 entries and they can go into the same file. However, c + d have together 6 entris. Therefore c has to go in its own file.
I am aware of this Awk oneliner:
awk '{print>$1".test"}' test.txt
But this results in a separate file for each group. This would not make much sense in the real-world problem that I am facing since it would lead to a lot of files being transferred to the HPC and back and making the overhead too intense.
A bash solution would be preferred. But it could also be Python.
Another awk. Had a busy day and this is only tested with your sample data so anything could happen. It creates files named filen.txt, where n>0:
$ awk -v n=4 '
BEGIN {
fc=1 # file numbering initialized
}
{
if($1==p||FNR==1) # when $1 remains same
b=b (++cc==1?"":ORS) $0 # keep buffering
else {
if(n-(cc+cp)>=0) { # if room in previous file
print b >> sprintf("file%d.txt",fc) # append to it
cp+=cc
} else { # if it just won t fit
close(sprintf("file%d.txt",fc))
print b > sprintf("file%d.txt",++fc) # creat new
cp=cc
}
b=$0
cc=1
}
p=$1
}
END { # same as the else above
if(n-(cc+cp)>=0)
print b >> sprintf("file%d.txt",fc)
else {
close(sprintf("file%d.txt",fc))
print b > sprintf("file%d.txt",++fc)
}
}' file
I hope I have understood your requirement correctly, could you please try following once written and tested with GNU awk.
awk -v count="1" '
FNR==NR{
max[$1]++
if(!a[$1]++){
first[++count2]=$1
}
next
}
FNR==1{
for(i in max){
maxtill=(max[i]>maxtill?max[i]:maxtill)
}
prev=$1
}
{
if(!b[$1]++){++count1};
c[$1]++
if(prev!=$1 && prev){
if((maxtill-currentFill)<max[$1]){count++}
else if(maxtill==max[$1]) {count++}
}
else if(prev==$1 && c[$1]==maxtill && count1<count2){
count++
}
else if(c[$1]==maxtill && prev==$1){
if(max[first[count1+1]]>(maxtill-c[$1])){ count++ }
}
prev=$1
outputFile="outfile"count
print > (outputFile)
currentFill=currentFill==maxtill?1:++currentFill
}
' Input_file Input_file
Testing of above solution with OP's sample Input_file:
cat Input_file
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1
It will create 3 output files named outputfile1, outputfile2 and outputfile3 as follows.
cat outfile1
a 1
a 2
a 1
b 1
cat outfile2
c 1
c 1
cat outfile3
d 2
d 1
d 2
d 1
2nd time testing(with my custom samples): With my own sample Input_file, lets say following is Input_file.
cat Input_file
a 1
a 2
a 1
b 1
c 1
c 1
d 2
d 1
d 2
d 1
d 4
d 5
When I run above solution then 2 outputfiles will be created with name outputfile1 and outputfile2 as follows.
cat outputfile1
a 1
a 2
a 1
b 1
c 1
c 1
cat outfile2
d 2
d 1
d 2
d 1
d 4
d 5
This might work for you (GNU sed, bash and csplit):
f(){
local g=$1
shift
while (( $#>1))
do
(($#==2)) && echo $2 && break
(($2-$1==$g)) && echo $2 && shift && continue
(($3-$1==$g)) && echo $3 && shift 2 && continue
(($2-$1<$g)) && (($3-$1>$g)) && echo $2 && shift && continue
set -- $1 ${#:3}
done
}
csplit file $(f 4 $(sed -nE '1=;N;/^(\S+\s).*\n\1/!=;D' file))
This will split file into separate files named xxnn where nn is 00,01,02,...
The sed command produces a list of line numbers that splits the file on change of key.
The function f then rewrites these numbers grouping them in to lengths of 4 or less.
~

Retrive entire column to a new file if it matches from list of another file

I have a huge file and I need to retrieve specific columns from File1 which is ~ 200000 rows and ~ 1000 Columns if it matches with the list of file2. (Prefer Bash over R )
for example my dummy data files are as follows,
file1
gene s1 s2 s3 s4 s5
a 1 2 1 2 1
b 2 3 2 3 3
c 1 1 1 1 1
d 1 1 2 2 2
and file2
sample
s4
s3
s7
s8
My desired output is
gene s3 s4
a 1 2
b 2 3
c 1 1
d 2 2
likewise, i have 3 different file2 and i have to pick different samples from the same file1 into a new file.
I would be very greatful if you guys can provide me with your valuable suggestions
P.S: I am a Biologist, i have very little coding experience
Regards
Ateeq
$ cat file1
gene s1 s2 s3 s4 s5
a 1 2 1 2 1
b 2 3 2 3 3
c 1 1 1 1 1
d 1 1 2 2 2
$ cat file2
gene
s4
s3
s8
s7
$ cat a
awk '
NR == FNR {
columns[ NR ] = $0
printf "%s\t", $0
next
}
FNR == 1 {
print ""
split( $0, headers )
for (x = 1 ; x <= length(headers) ; x++ )
{
aheaders[ headers[x]] = x
}
next
}
{
for ( x = 1 ; x <= length( columns ) ; x++ )
{
if (length( aheaders[ columns[x] ] ) == 0 )
printf "N/A\t"
else
printf "%s\t" , $aheaders[ columns[x] ]
}
print ""
}
' $*
$ ./a file2 file1 | column -t
gene s4 s3 s8 s7
a 2 1 N/A N/A
b 3 2 N/A N/A
c 1 1 N/A N/A
d 2 2 N/A N/A
The above should get you on your way. It's an extremely optimistic program and no negative testing was performed.
Awk is a tool that applies a set of commands to every line of every file that matches an expression. In general, the awk script has the form:
<pattern> <command>
There are three such pairs above. Each needs a little explanation:
NR == FNR {
columns[ NR ] = $0
printf "%s\t", $0
next
}
NR == FNR is a awk'ism. NR is the record number and FNR is the record number in the current file. NR is always increasing but FNR resets to 1 when awk parses the next file. NR==FNR is an idiom that is only true when parsing the first file.
I've designed the awk program to read the columns file first (you are calling this file2). File2 has a list of columns to output. As you can see, we are storing each line in the first file (file2) into an array called columns. We are also printing the columns out as we read them. In order to avoid newlines after each column name (since we want all the column headers to be on the same line), we use printf which doesn't output a newline (as opposed to print which does).
The 'next' at the end of the stanza tells awk to read the next line in the file without processing any of the other stanzas. After all, we just want to read the first file.
In summary, the first stanza remembers the column names (and order) and prints them out on a single line (without a newline).
The second "stanza":
FNR == 1 {
print ""
split( $0, headers )
for (x = 1 ; x <= length(headers) ; x++ )
{
aheaders[ headers[x]] = x
}
next
}
FNR==1 will match on the first line of any file. Due to the next in the previous stanza, we'll only hit this stanza when we are on the first line of the second file (file1). The first print "" statement adds the newline that was missing from the first stanza. Now the line with the column headers is complete.
The split command takes the first parameter, $0, the current line and splits it according to whitespace. We know the current line is the first line and has the column headers in it. The split command writes to an array named in the second parameter , headers. Now headers[1] = "gene" and headers[2] = "s4" , headers[3] = "s3", etc.
We're going to need to map the column names to the column numbers. The next bit of code takes each header value and creates an aheaders entry. aheders is an associative array that maps column header names to the column number.
aheaders["gene"] = 1
aheaders["s1"] = 2
aheaders["s2"] = 3
aheaders["s3"] = 4
aheaders["s4"] = 5
aheaders["s5"] = 6
When we're done making the aheaders array, the next command tells awk to skip to the next line of the input. From this point on, only the third stanza is going to have a true condition.
{
for ( x = 1 ; x <= length( columns ) ; x++ )
{
if (length( aheaders[ columns[x] ] ) == 0 )
printf "N/A\t"
else
printf "%s\t" , $aheaders[ columns[x] ]
}
print ""
}
The third stanza has no explicit . Awk will process this as always true. So this last is executed for every line of the second file.
At this point, we want to print the columns that are specified in columns array. We walk through each element of the array in order. The first time through the loop, columns[1] = "gene_symbol". This gives us:
printf "%s\t" , $aheaders[ "gene" ]
And since aheaders["gene"] = 1 this gives us:
printf "%s\t" , $1
And awk understands $1 to be the first field (or column) in the input line. Thus the first column is passed to printf which outputs the value with a tab (\t) appended.
The loop then executes another time with x=2 and columns[2]="s4". This results in the following print executing:
printf "%s\t" , $5
This prints the fifth column followed by a tab. The next iteration:
columns[3] = "s3"
aheaders["s3"] = 4
Which results in:
printf "%s\t" , $4
That is, the fourth field is output.
The next iteration we hit a failure situation:
columns[4] = "s8"
aheaders["s8"] = ''
In this case, the length( aheaders[ columns[x] ] ) == 0 is true so we just print out a placeholder - something to tell the operator their input may be invalid:
printf "N/A\t"
The same is output when we process the last columns[x] value "s7".
Now, since there are no more entries in columns, the loop exists and we hit the final print:
print ""
The empty string is provided to print because print by itself defaults to print $0 - the entire line.
At this point, awk reads the next line out of file1 hits the third block again (only). Thus awk continues until the second file is completely read.

reset row number count in awk

I have a file like this
file.txt
0 1 a
1 1 b
2 1 d
3 1 d
4 2 g
5 2 a
6 3 b
7 3 d
8 4 d
9 5 g
10 5 g
.
.
.
I want reset row number count to 0 in first column $1 whenever value of field in second column $2 changes, using awk or bash script.
result
0 1 a
1 1 b
2 1 d
3 1 d
0 2 g
1 2 a
0 3 b
1 3 d
0 4 d
0 5 g
1 5 g
.
.
.
As long as you don't mind a bit of excess memory usage, and the second column is sorted, I think this is the most fun:
awk '{$1=a[$2]+++0;print}' input.txt
This awk one-liner seems to work for me:
[ghoti#pc ~]$ awk 'prev!=$2{first=0;prev=$2} {$1=first;first++} 1' input.txt
0 1 a
1 1 b
2 1 d
3 1 d
0 2 g
1 2 a
0 3 b
1 3 d
0 4 d
0 5 g
1 5 g
Let's break apart the script and see what it does.
prev!=$2 {first=0;prev=$2} -- This is what resets your counter. Since the initial state of prev is empty, we reset on the first line of input, which is fine.
{$1=first;first++} -- For every line, set the first field, then increment variable we're using to set the first field.
1 -- this is awk short-hand for "print the line". It's really a condition that always evaluates to "true", and when a condition/statement pair is missing a statement, the statement defaults to "print".
Pretty basic, really.
The one catch of course is that when you change the value of any field in awk, it rewrites the line using whatever field separators are set, which by default is just a space. If you want to adjust this, you can set your OFS variable:
[ghoti#pc ~]$ awk -vOFS=" " 'p!=$2{f=0;p=$2}{$1=f;f++}1' input.txt | head -2
0 1 a
1 1 b
Salt to taste.
A pure bash solution :
file="/PATH/TO/YOUR/OWN/INPUT/FILE"
count=0
old_trigger=0
while read a b c; do
if ((b == old_trigger)); then
echo "$((count++)) $b $c"
else
count=0
echo "$((count++)) $b $c"
old_trigger=$b
fi
done < "$file"
This solution (IMHO) have the advantage of using a readable algorithm. I like what's other guys gives as answers, but that's not that comprehensive for beginners.
NOTE:
((...)) is an arithmetic command, which returns an exit status of 0 if the expression is nonzero, or 1 if the expression is zero. Also used as a synonym for let, if side effects (assignments) are needed. See http://mywiki.wooledge.org/ArithmeticExpression
Perl solution:
perl -naE '
$dec = $F[0] if defined $old and $F[1] != $old;
$F[0] -= $dec;
$old = $F[1];
say join "\t", #F[0,1,2];'
$dec is subtracted from the first column each time. When the second column changes (its previous value is stored in $old), $dec increases to set the first column to zero again. The defined condition is needed for the first line to work.

different lines in two files when ignoring last column - in bash

I have two files, smaller and bigger and bigger contains all lines of smaller. Those lines are almost same, just last column differs.
file_smaller
A NM 0
B GT 4
file_bigger
A NM 5 <-same as in file_smaller according to my rules
C TY 2
D OP 6
B GT 3 <-same as in file_smaller according to my rules
I would like to write lines, where the two files differ, that means:
wished_output
C TY 2
D OP 6
Could you please help me to do so? Thanks a lot.
you can do the following:
cat file_bigger file_smaller |sed 's=\(.*\).$=\1='|sort| uniq -u > temp_pat
grep -f temp_pat file_bigger ; rm temp_pat
which will (in the same order)
merge the files
remove the last column
sort the result
print only unique lines in temp_pat
find the original lines in file_bigger
all in all, the expected result.
awk 'FILENAME==file_bigger {arr[$1 $2]=$0}
FILENAME==file_smaller { tmp=$1 $2; if( tmp in arr) {next} else {print $0}}
' file_bigger file_smaller
See if that meets you needs
grep -vf <(cut -d " " -f 1-2 file_smaller| sed 's/^/^/') file_bigger
The process substitution results in this:
^A NM
^B GT
Then, grep -v removes those patterns from "file_bigger"
Bash 4 using associative arrays:
#!/usr/bin/env bash
f() {
if (( $# != 2 )); then
echo "usage: ${FUNCNAME} <smaller> <bigger>" >&2
return 1
fi
local -A smaller
local -a x
while read -ra x; do
smaller["${x[#]::2}"]=0
done <"$1"
while read -ra x; do
((${smaller["${x[#]::2}"]:-1})) && echo "${x[*]}"
done <"$2"
}
f /dev/fd/3 /dev/fd/0 <<"SMALLER" 3<&0 <<"BIGGER"
A NM 0
B GT 4
SMALLER
A NM 5
C TY 2
D OP 6
B GT 3
BIGGER

Resources