Awk code explanation: changing order of fields

Awk code explanation: changing order of fields - bash

I have a file a .txt file that has 14 columns. The head of it would look like this:
name A1 A2 Freq MAF Quality Rsq n Mean Beta sBeta CHi rsid
SNP1 A T 0.05 1 5 56 7 8 9 11 12 rs1
SNP2 T A 0.05 1 6 55 7 8 9 11 12 rs2
I want to put the last column in the first position. I wasn't sure what was the most efficient way of doing this, but I came across this, inspiring myself from other posts:
awk '{$0=$NF FS$0; $14=""}1' file.txt | head
I obtained this, which I think works:
rsid name A1 A2 Freq MAF Quality Rsq n Mean Beta sBeta CHi
rs1 SNP1 A T 0.05 1 5 56 7 8 9 11 12
rs2 SNP2 T A 0.05 1 6 55 7 8 9 11 12
I am struggling though to understand what exactly the code does.
I know that NF is the field count of the line being processed
I know that FS is the field seperator
So how can my code work exactly? I just don't really understand how saying that $0 (the whole line) is equal to NF and saying FS$0 (not sure what this means) ends up with the last field now being first. I do realise that $14="" is not written, you end up with 2 rsid columns, one at the start and one at the end.
I'm quite new to using awk so if there is an easier way to achieve this, I would happily go for it.
Thanks

might be easier with sed
sed -E 's/(.*)\s(\S+)$/\2 \1/' file
match the last field and the rest of the line, print it reverse order.
\s is shorthand for whitespace character, equivalent to [ \t\r\n\f].
\S is the negation of \s, for non-whitespace. POSIX equivalent of \s is [:space:]. If your sed doesn't support the shorthand notation or you want full portability you may need to use one of the equivalent forms.

Please go through following and let me know if this helps you on same.
awk '{
$0=$NF FS$0; ##Re-creating current line by mentioning $NF(last field value), FS(field separator, whose default value is space) then current line value.
$14="" ##Now in current line(which is edited above by having last field value to very first) nullifying the last(14th field) here, you could use $NF here too(in case your Input_file have only 14 fields.
}
1 ##1 means we are making condition TRUE here and not mentioning any action so by default print action will happen.
' file.txt ##Mentioning Input_file name here.

Related

SED to spit out nth and (n+1)th lines

EDITS: For reference, "stuff" is a general variable, as is "KEEP".
KEEP could be "Hi, my name is Dave" on line 2 and "I love pie" on line 7. The numbers I've put here are for illustration only and DO NOT show up in the data.
I had a file that needed to be parsed, keeping every 4th line, starting at the 3rd line. In other words, it looked like this:
1 stuff
2 stuff
3 KEEP
4
5 stuff
6 stuff
7 KEEP
8 stuff etc...
Great, sed solved that easily with:
sed -n -e 3~4p myfile
giving me
3 KEEP
7 KEEP
11 KEEP
Now I have a different file format and a different take on the pattern:
1 stuff
2 KEEP
3 KEEP
4
5 stuff
6 KEEP
7 KEEP etc...
and I still want the output of
2 KEEP
3 KEEP
6 KEEP
7 KEEP
10 KEEP
11 KEEP
Here's the problem - this is a multi-pattern "pattern" for sed. It's "every 4th line, spit out 2 lines, but start at line 2".
Do I need to have some sort of DO/FOR loop in my sed, or do I need a different command like awk or grep? Thus far, I have tried formats like:
sed -n -e '3~4p;4~4p' myfile
and
awk 'NR % 3 == 0 || NR % 4 ==0' myfile
and
sed -n -e '3~1p;4~4p' myfile
and
awk 'NR % 1 == 0 || NR % 4 ==0' myfile
source: https://superuser.com/questions/396536/how-to-keep-only-every-nth-line-of-a-file

If your intent is to print lines 2,3 then every fourth line after those two, you can do:
$ seq 20 | awk 'BEGIN{e[2];e[3]} (NR%4) in e'
2
3
6
7
10
11
14
15
18
19

You were pretty close with your sed:
$ printf '%s\n' {1..12} | sed -n '2~4p;3~4p'
2
3
6
7
10
11

this is the idiomatic way to write in awk
$ awk 'NR%4==2 || NR%4==3' file
however, this special case can be shortened to
$ awk 'NR%4>1' file

This might work for you (GNU sed):
sed '2~4,+1p;d' file
Use a range, the first parameter is the starting line and modulus (in this case from line 2 modulus 4). The second parameter is how man lines following the start of the range (in this case plus one). Print these lines and delete all others.

In the generic case, you want to keep lines p to p+q and p+n to p+q+n and p+2n to p+q+2n ... So you can write:
awk '(NR - p) % n <= q'

Awk: how to compare two strings in one line

I have a dataset with 20 000 probes, they are in two columns, 21nts each. From this file I need to extract the lines in which last nucleotide in Probe1 column matches last nucleotide in in Probe 2 column. So far I tried AWK (substr) function, but didn't get the expected outcome. Here is one-liner I tried:
awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'
Another option would be to anchor last character in columns 2 and 4 (awk '$2~/[A-Z]$/), but I can't find a way to match the probes in two columns using regex. All suggestions and comments will be very much appreciated.
Example of dataset:
Probe 1 Probe 2
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4738 GGAGGATTTGGCCGGAGAGGC C GGAGGAGGAGGAGGACGAGGT
4739 GGAGGAAGAGGAGGGGGAGGT D GGAGGACGAGGAGGAGGAGGC
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
Desired output:
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC

This will filter the input, matching lines where the last character of the 2nd column is equal to the last character of the 4th column:
awk 'substr($2, length($2), 1) == substr($4, length($4), 1)'
What I changed compared to your sample script:
Move the if statement out of the { ... } block into a filter
Use length($2) and length($4) instead of hardcoding the value 21
The { print $0 } is not needed, as that is the default action for the matched lines

How to sequence lines in files if some lines are strings

I encountered a problem with bash, I started using it recently.
I realize that lot of magic stuff can be done with just one line, as my previous question was solved by it.
This time question is simple:
I have a file which has this format
2 2 10
custom
8 10
3 5 18
custom
1 5
some of the lines equal to string custom (it can be any line!) and other lines have 2 or 3 numbers in it.
I want a file which will sequence the line with numbers but keep the lines with custom (order also must be the same), so desired output is
2 4 6 8 10
custom
8 9 10
3 8 13 18
custom
1 2 3 4 5
I also wish to overwrite input file with this one.
I know that with seq I can do the sequencing, but I wish elegant way to do it on file.

You can use awk like this:
awk '/^([[:blank:]]*[[:digit:]]+){2,3}[[:blank:]]*$/ {
j = (NF==3) ? $2 : 1
s=""
for(i=$1; i<=$NF; i+=j)
s = sprintf("%s%s%s", s, (i==$1)?"":OFS, i)
$0=s
} 1' file
2 4 6 8 10
custom
8 9 10
3 8 13 18
custom
1 2 3 4 5
Explanation:
/^([[:blank:]]*[[:digit:]]+){2,3}[[:blank:]]*$/ - match only lines with 2 or 3 numbers.
j = (NF==3) ? $2 : 1 - set variable j to $2 if there are 3 columns otherwise set j to 1
for(i=$1; i<=$NF; i+=j) run a loop from 1st col to last col, increment by j
sprintf is used for formatting the generated sequence
1 is default awk action to print each line

This might work for you (GNU sed, seq and paste):
sed '/^[0-9]/s/.*/seq & | paste -sd\\ /e' file
If a line begins with a digit use the lines values as parameters for the seq command which is then piped to paste command. The RHS of the substitute command is evaluated using the e flag (GNU sed specific).

How to replace all matches with an incrementing number in BASH?

I have a text file like this:
AAAAAA this is some content.
This is AAAAAA some more content AAAAAA. AAAAAA
This is yet AAAAAA some more [AAAAAA] content.
I need to replace all occurrence of AAAAAA with an incremented number, e.g., the output would look like this:
1 this is some content.
This is 2 some more content 3. 4
This is yet 5 some more [6] content.
How can I replace all of the matches with an incrementing number?

Here is one way of doing it:
$ awk '{for(x=1;x<=NF;x++)if($x~/AAAAAA/){sub(/AAAAAA/,++i)}}1' file
1 this is some content.
This is 2 some more content 3. 4
This is yet 5 some more [6] content.

A perl solution:
perl -pe 'BEGIN{$A=1;} s/AAAAAA/$A++/ge' test.dat

This might work for you (GNU sed):
sed -r ':a;/AAAAAA/{x;:b;s/9(_*)$/_\1/;tb;s/^(_*)$/0\1/;s/$/:0123456789/;s/([^_])(_*):.*\1(.).*/\3\2/;s/_/0/g;x;G;s/AAAAAA(.*)\n(.*)/\2\1/;ta}' file
This is a toy example, perl or awk would be a better fit for a solution.
The solution only acts on lines which contain the required string (AAAAAA).
The hold buffer is used as a place to keep the incremented integer.
In overview: when a required string is encountered, the integer in the hold space is incremented, appended to the current line, swapped for the required string and the process is then repeated until all occurences of the string are accounted for.
Incrementing an integer simply swaps the last digit (other than trailing 9's) for the next integer in sequence i.e. 0 to 1, 1 to 2 ... 8 to 9. Where trailing 9's occur, each trailing 9 is replaced by a non-integer character e.g '_'. If the number being incremented consists entirely of trailing 9's a 0 is added to the front of the number so that it can be incremented to 1. Following the increment operation, the trailing 9's (now _'s) are replaced by '0's.
As an example say the integer 9 is to be incremented:
9 is replaced by _, a 0 is prepended (0_), the 0 is swapped for 1 (1_), the _ is replaced by 0. resulting in the number 10.
See comments directed at #jaypal for further notes.

Maybe something like this
#!/bin/bash
NR=1
cat filename while read line
do
line=$(echo $line | sed 's/AAAAA/$NR/')
echo ${line}
NR=$((NR + 1 ))
done

Perl did the job for me
perl -pi -e 's/\b'DROP'\b/$&.'_'.++$A /ge' /folder/subfolder/subsubfolder/*
Input:
DROP
drop
$drop
$DROP
$DROP="DROP"
$DROP='DROP'
$DROP=$DROP
$DROP="DROP";
$DROP='DROP';
$DROP=$DROP;
$var="DROP_ACTION"
drops
DROPS
CODROP
'DROP'
"DROP"
/DROP/
Output:
DROP_1
drop
$drop
$DROP_2
$DROP_3="DROP_4"
$DROP_5='DROP_6'
$DROP_7=$DROP_8
$DROP_9="DROP_10";
$DROP_11='DROP_12';
$DROP_13=$DROP_14;
$var="DROP_ACTION"
drops
DROPS
CODROP
'DROP_15'
"DROP_16"
/DROP_17/

How to match numbers on one list (e.g. 2 and 3) with the approximate sum on another list (e.g. 5)?

I am trying to match some audio files to some written passages of text.
I started with a single audio file of someone reading the typed passage. Then, I split the audio files at every period of silence, with sox, and similarly split the types text such that each unique sentence is on a unique line.
The splits did not occur perfectly at every period however, but whenever the speaker paused. I need to create a list of which audio files correspond to which typed sentences, e.g.:
0001.wav This is a sentence.
0002.wav This is another sentence.
Note that sometimes 2 or more audio files corresponds to a single sentence, e.g.:
0001.wav ("this is a") + 0002.wav ("sentence") = "This is a sentence."
To help with matching the texts, I've used software to count the syllables in the audio and count the syllables in the typed text.
I have two files with this data. The first, "sentences.txt", is a list of all of the sentences from the text, presented one per line, with their syllable count, e.g.:
5 This is a sentence.
7 This is another sentence.
8 This is yet another sentence.
9 This is still yet another sentence.
I can remove the sentence data with awk -f" " { print $1 } sentences.txt to have this syllables_in_text.txt:
5
7
8
9
The second file, syllables_in_audio.txt has a list of audio files, in the same order, with approximate syllable counts. Sometimes a little lower than the actual number in the text, because the syllable-counting software is not perfect:
0001.wav 3
0002.wav 2
0003.wav 4
0004.wav 5
0005.wav 7
0006.wav 3
0007.wav 2
0008.wav 3
How can I print a list ("output.txt") of audio files to such that the audio file filenames appear on the same line as the text sentences in "sentences.txt", e.g.:
0001.wav 0002.wav
0003.wav 0004.wav
0005.wav
0006.wav 0007.wav 0009.wav
Below is a table of the two files to demonstrate how the two files, if placed side-by-side line up. Files "0001.wav" and "0002.wav" are both needed to make the sentence "This is a sentence." These file names are listed on line 1 in "output.txt" while the corresponding sentence is written in text on line of of "sentences.txt":
Contents of "output.txt": | Contents of "sentences.txt":
0001.wav 0002.wav | 5 This is a sentence.
0003.wav 0004.wav | 7 This is another sentence.
0005.wav | 8 This is yet another sentence.
0006.wav 0007.wav 0009.wav | 9 This is still yet another sentence.

You can create an awk script as follows. Pseudocode:
BEGIN {
init counter=1
read your first file (syllables_in_text.txt) with getline till the end (while...)
store its value in firstfile[counter]
counter++
# when you had finished reading your first file
init another_counter=1
read your second file (syllables_in_audio.txt) with getline till the end (while...)
if $2 (second col from your file) <= firstfile[another_counter]
store $1 like o[another_counter]=" " $1
else
another_counter++
store $1 like o[another_counter]=" " $1
finally loop over the o array after sorint it
print its contents after removing the leading space
}
But there are another solutions as well...

Can you give the explanation on the rule how to match (2 and 3) on another list (5)?
I make the sample to get start, please correct me.
$ cat sentences.txt
5 This is a sentence.
7 This is another sentence.
8 This is yet another sentence.
9 This is still yet another sentence.
$ cat syllables_in_audio.txt
0001.wav 5
0002.wav 5
0003.wav 7
0004.wav 7
0005.wav 8
0006.wav 9
0007.wav 9
0008.wav 9
So you should be fine to run the awk command to get the output:
awk 'NR==FNR{a[$1]=$0;next}{b[$2]=b[$2]==""?$1:b[$2] FS $1}END{for (i in a) printf "%-40s|%s\n", b[i], a[i]}' sentences.txt syllables_in_audio.txt
result
0001.wav 0002.wav |5 This is a sentence.
0003.wav 0004.wav |7 This is another sentence.
0005.wav |8 This is yet another sentence.
0006.wav 0007.wav 0008.wav |9 This is still yet another sentence.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio