Merging text files in csh while preserving place rows - shell

I have about 100 text files with two columns that I would like to merge into a single file in a c shell script by using factor "A".
For example, I have file A that looks like this
A B1
1 100
2 200
3 300
4 400
and File B looks like this
A B2
1 100
2 200
3 300
4 400
5 300
6 400
I want the final file C to look like this:
A B1 B2
1 100 100
2 200 200
3 300 300
4 400 400
5 300
6 400
The cat function only puts the files on top of one another and pastes them into file C. I would like to put the data next to each other. Is this possible?

to meet your exact spec, this will work. If the spec changes, you'll need to play with this some,
paste -d' ' factorA factorB \
| awk 'NF==4||NF==3{print $1, $2, $3} NF==2{print$1, $2}' \
> factorC
# note, no spaces or tabs after each of the contintuation chars `\` at end of lines!
output
$ cat factorC
A B1 B2
1 100 100
2 200 200
3 300 300
4 400 400
5 300
6 400
Not sure how you get bold headers to "trasmit" thru unix pipes. ;->
Recall that awk programs all have a basic underlying structure, i.e.
awk 'pattern{action}' file
So pattern can be a range of lines, a reg-exp, an expression (NF==4), missing, or a few other things.
The action is what happens when the pattern is matched. This is more traditional looking code.
If no pattern specified, then action applies to all lines read. If no action is specfied, but the pattern matches, then the line is printed (without further ado).
NF means NumberOfFields in the current line, so NF==2 will only process line with 2 fields (the trailing records in factorB).
The || is a logical OR operator, so that block will only process records, where the number of fields is 3 OR 4. Hopefully, the print statements are self-explanatory.
The , separating $1,$2,$3 (for example) is the syntax that converts to awk's internal variable OFS, which is OutputFieldSeparator, which can be assigned like OFS="\t" (to give an OFS of tab char), or as in this case, we are not specifying a value, so we're getting the default value for OFS, which is the space char (" ") (no quotes!)
IHTH

Related

find and replace substrings in a file which match strings in another file

I have two txt files: File1 is a tsv with 9 columns. Following is its first row (SRR6691737.359236/0_14228//11999_12313 is the first column and after Repeat is the 9th column):
SRR6691737.359236/0_14228//11999_12313 Censor repeat 5 264 1169 + . Repeat BOVA2 SINE 1 260 9
File2 is a tsv with 9 columns. Following is its first row (after Read is the 9th column):
CM011822.1 reefer discordance 63738705 63738727 . + . Read SRR6691737.359236 11999 12313; Dup 277
File1 contains information of read name (SRR6691737.359236), read length (0_14228) and coordinates (11999_12313) while file two contains only read name and coordinate. All read names and coordinates in file1 are present in file2, but file2 may also contain the same read names with different coordinates. Also file2 contains read names which are not present in file1.
I want to write a script which finds read names and coordinates in file2 that match those in file1 and adds the read length from file1 to file2. i.e. changes the last column of file2:
Read SRR6691737.359236 11999 12313; Dup 277
to:
Read SRR6691737.359236/0_14228//11999_12313; Dup 277
any help?
If unclear how your input files look look like.
You write:
I have two txt files: File1 is a tsv with 9 columns. Following is
its first row (SRR6691737.359236/0_14228//11999_12313 is the first
column and after Repeat is the 9th column):
SRR6691737.359236/0_14228//11999_12313 Censor repeat 5 264 1169 + . Repeat BOV, ancd A2 SINE 1 260 9
If I try to check the columns (and put them in a 'Column,Value' pair):
Column,Value
1,SRR6691737.359236/0_14228//11999_12313
2,Censor
3,repeat
4,5
5,264
6,1169
7,+
8,.
9,Repeat
10,BOVA2
11,SINE
12,1
13,260
14,9
That seems to have 14 columns, you specify 9 columns...
Can you edit your question, and be clear about this?
i.e. specify as csv
SRR6691737.359236/0_14228//11999_12313,Censor,repeat,5,.....
Added info, after feedback :
file1 contains the following fields (tab-, ancd separated):
SRR6691737.359236/0_14228//11999_12313
Censor
5
264
1169
+
.
Repeat BOVA2 SINE 1 260 9
You want to convert this (using a script) to a tab-separated file:
CM011822.1
reefer
distance
63738705
63738727
+
.
Read SRR6691737.359236 11999 12313
Dup 277
More info is needed to solve this!
field 1: How/Where is the info for 'CM011822.1' coming from?
field 2 and 3: 'reefer'/'distance'. Is this fixed text, should, ancd these fields always contain these texts or are there exceptions?
field 4 and 5: Where are these values (63738705 ; 63738727) coming from?
OK, it's clear that there are more questions to be asked than can give here …
second change...:
create a file, name if 'mani.awk':
FILENAME=="file1"{
split($1,a,"/");
x=a[1] " " a[4];
y=x; gsub(/_/," ",y);
r[y]=$1;
c=1; for (i in r) { print c++,i,"....",r[i]; }
}
FILENAME=="file2"{
print "<--", $0, "--> " ;
for (i in r) {
if ($9 ~ i) {
print "B:" r[i];
split(r[i],b,"/");
$9="Read " r[i];
print "OK";
}
};
print "<--", $0, "--> " ;
}
After this gawk -f mani.awk file1 file2 should produce the correct result.
If not, than i suggest you to learn AWK 😉, and change the script as needed.

Awk code explanation: changing order of fields

I have a file a .txt file that has 14 columns. The head of it would look like this:
name A1 A2 Freq MAF Quality Rsq n Mean Beta sBeta CHi rsid
SNP1 A T 0.05 1 5 56 7 8 9 11 12 rs1
SNP2 T A 0.05 1 6 55 7 8 9 11 12 rs2
I want to put the last column in the first position. I wasn't sure what was the most efficient way of doing this, but I came across this, inspiring myself from other posts:
awk '{$0=$NF FS$0; $14=""}1' file.txt | head
I obtained this, which I think works:
rsid name A1 A2 Freq MAF Quality Rsq n Mean Beta sBeta CHi
rs1 SNP1 A T 0.05 1 5 56 7 8 9 11 12
rs2 SNP2 T A 0.05 1 6 55 7 8 9 11 12
I am struggling though to understand what exactly the code does.
I know that NF is the field count of the line being processed
I know that FS is the field seperator
So how can my code work exactly? I just don't really understand how saying that $0 (the whole line) is equal to NF and saying FS$0 (not sure what this means) ends up with the last field now being first. I do realise that $14="" is not written, you end up with 2 rsid columns, one at the start and one at the end.
I'm quite new to using awk so if there is an easier way to achieve this, I would happily go for it.
Thanks
might be easier with sed
sed -E 's/(.*)\s(\S+)$/\2 \1/' file
match the last field and the rest of the line, print it reverse order.
\s is shorthand for whitespace character, equivalent to [ \t\r\n\f].
\S is the negation of \s, for non-whitespace. POSIX equivalent of \s is [:space:]. If your sed doesn't support the shorthand notation or you want full portability you may need to use one of the equivalent forms.
Please go through following and let me know if this helps you on same.
awk '{
$0=$NF FS$0; ##Re-creating current line by mentioning $NF(last field value), FS(field separator, whose default value is space) then current line value.
$14="" ##Now in current line(which is edited above by having last field value to very first) nullifying the last(14th field) here, you could use $NF here too(in case your Input_file have only 14 fields.
}
1 ##1 means we are making condition TRUE here and not mentioning any action so by default print action will happen.
' file.txt ##Mentioning Input_file name here.

Awk: how to compare two strings in one line

I have a dataset with 20 000 probes, they are in two columns, 21nts each. From this file I need to extract the lines in which last nucleotide in Probe1 column matches last nucleotide in in Probe 2 column. So far I tried AWK (substr) function, but didn't get the expected outcome. Here is one-liner I tried:
awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'
Another option would be to anchor last character in columns 2 and 4 (awk '$2~/[A-Z]$/), but I can't find a way to match the probes in two columns using regex. All suggestions and comments will be very much appreciated.
Example of dataset:
Probe 1 Probe 2
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4738 GGAGGATTTGGCCGGAGAGGC C GGAGGAGGAGGAGGACGAGGT
4739 GGAGGAAGAGGAGGGGGAGGT D GGAGGACGAGGAGGAGGAGGC
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
Desired output:
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
This will filter the input, matching lines where the last character of the 2nd column is equal to the last character of the 4th column:
awk 'substr($2, length($2), 1) == substr($4, length($4), 1)'
What I changed compared to your sample script:
Move the if statement out of the { ... } block into a filter
Use length($2) and length($4) instead of hardcoding the value 21
The { print $0 } is not needed, as that is the default action for the matched lines

shell script to read each line and run the condition

I have a log file which has the below data
120
140
200
110
200
200
120
90
100
I want to read this file and compare each line(number) with 200, and if it crosses 200 - then it has to compare the next word till the 5 consecutive which crosses 200 then it has to send a alert otherwise script has to end.
Please help
Thanks,
Are you saying that you want to detect when 5 consecutive rows contain a value greater than 200? If so:
awk '{a = $1 > lim ? a + 1 : 0}
a > seq {print "alert on line " NR}' lim=200 seq=5 input
It's not clear what you actually want, and perhaps you want to use >= rather than > in the above.
This simply reads through the file named input and checks if the number is greater than 200 (the value given to lim). If it is, it increments a counter. When that counter is greater than seq, it prints a message.

Bash - fill empty cell with following value in the column

I have a long tab-delimited CSV file and I am trying to paste in a cell a value that comes later on the column.
For instance, input.txt:
0
1
1.345 B
2
2.86 A
3
4
I would like an output such as:
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
I've been tinkering with code from other threads like this awk solution, but the problem is that the value I want is not before the empty cell, but after, kind of a .FillUp in Excel.
Additional information:
input file may have different number of lines
"A" and "B" in input file may be at different rows and not evenly separated
second column may have only two values
last cell in second column may not have value
[EDIT] for the last two rows in input.txt, B is known to be in the second column, as all rows after 2.86 are not A.
Thanks in advance.
$ tac input.txt | awk -v V=B '{if ($2) V=$2; else $2=V; print}' | tac
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
tac (cat backwards) prints a file in reverse. Reverse the file, fill in the missing values, and then reverse it again.
This allows you to process the file in a single pass as long as you know the first value to fill. It should be quite a bit faster than reversing the file twice.
awk 'BEGIN {fillvalue="B"} $2 {fillvalue=$2=="A"?"B":"A"} !$2 {$2=fillvalue} 1' input.txt
Note that this assumes knowledge about the nature of that second column being only 'A' or 'B' or blank.

Resources