join all lines that have the same first column to the same line - sorting

IE:
File:
1234:abcd
1234:930
1234:999999
194:keee
194:284
194:222222
Result:
1234:abcd:930:999999
194:kee:284:222222
I have exhausted my brain to the best of my knowledge and can't come up with a way. Sorry to bother you guys!

$ awk -F: '$1==last {printf ":%s",$2; next} NR>1 {print "";} {last=$1; printf "%s",$0;} END{print "";}' file
1234:abcd:930:999999
194:keee:284:222222
How it works
-F:
This tells awk to use a : as the field separator.
$1==last {printf ":%s",$2; next}
If the first field of this line is the same as the first field of the last line, print a colon followed by field 2. Then, skip the rest of the commands and start over with the next line.
NR>1 {print "";}
If we get here, that means that this line has a new not-seen-before value of the first field. If this not the first line, we finish the last line by printing a newline character.
{last=$1; printf "%s",$0;}
Update the variable last with the new value of field 1. Then, print this line.
END{print "";}
After we reach the end of the file, print one last newline character.
Combining non-consecutive lines
Consider this test file:
$ cat testfile2
3:abcd
4:abcd
10:123
3:999
4:999
10:123
Apply this awk script:
$ awk -F: '{a[$1]=a[$1]":"$2;} END{for (x in a) print x ":" substr(a[x],2);}' testfile2
3:abcd:999
4:abcd:999
10:123:123
In this approach, the lines will not necessarily come out in any particular order. If order is important, you may want to pipe this output to sort.

Related

Transpose rows to column after nth column in bash

I have a file like below format:
$ cat file_in.csv
1308123;28/01/2019;28/01/2019;22/01/2019
1308456;20/11/2018;27/11/2018;09/11/2018;15/11/2018;10/11/2018;02/12/2018
1308789;06/12/2018;04/12/2018
1308012;unknown
How can i transpose as below, starting from second column:
1308123;28/01/2019
1308123;28/01/2019
1308123;22/01/2019
1308456;20/11/2018
1308456;27/11/2018
1308456;09/11/2018
1308456;15/11/2018
1308456;10/11/2018
1308456;02/12/2018
1308789;06/12/2018
1308789;04/12/2018
1308012;unknown
I'm testing my script, but obtain a wrong result
echo "123;23/05/2018;24/05/2018" | awk -F";" 'NR==3{a=$1";";next}{a=a$1";"}END{print a}'
Thanks in advance
1st Solution: Eaisest solution will be, loop through all fields(off course have set field separator as ;) and then print $1 along with all fields in new line. Also note that loop is running from i=2 to till value of NF leaving first field since we need to print in new line from column 2nd onwards.
awk 'BEGIN{FS=OFS=";"} {for(i=2;i<=NF;i++){print $1,$i}}' Input_file
2nd Solution: Using 1 time substitution(sub) and global substitutions(gsub) functionality of awk. Here I am changing very first occurence of ; with ###(assumed that your Input_file will NOT have this characters together, in case it is there choose any unique character(s) which are NOT in one's Input_file on place of ###), then globally subsituting ;(all occurences) with ORS val(a variable which has value of $1) and ; so make values in new column. Now finally remove ### from first field. Why we have done this approch if we DO NOT substitute very first occurence of ; with any other character then it will place a NEW LINE before substituion which we DO NOT want to have. (Also as per Ed sir's comment this solution was tested in 1 Input_file and may have issues while reading multiple Input_files)
awk 'BEGIN{FS=OFS=";"} {val=$1;sub(";","###");gsub(";",ORS val ";");sub("###",";",$1)} 1' Input_file
Another awk
awk -F";" '{ OFS="\n" $1 ";"; $1=$1;$1=""; printf("%s",$0) } ' file

AWK: search substring in first file against second

I have the following files:
data.txt
Estring|0006|this_is_some_random_text|more_text
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here
allids.txt (here the columns are separated by semicolon; the real input is tab-delimited)
Estring|0006;MAR0593
Fstring|0002;MAR0592
Fstring|0028;MAR1195
please note: data.txt: the important part is here the first two "columns" = name|number)
Now I want to use awk to search the first part (name|number) of data.txt in allids.txt and output the second column (starting with MAR)
so my expected output would be (again tab-delimited):
Estring|0006|this_is_some_random_text|more_text;MAR0593
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here;MAR1195
I do not know now how to search that first conserved part within awk, the rest should then be:
awk 'BEGIN{FS=OFS="\t"} FNR == NR { a[$1] = $1; next } $1 in a { print a[$0], [$1] }' data.txt allids.txt
I would use a set of field delimiters, like this:
awk -F'[|\t;]' 'NR==FNR{a[$1"|"$2]=$0; next}
$1"|"$2 in a {print a[$1"|"$2]"\t"$NF}' data.txt allids.txt
In your real-data example you can remove the ;. It is in here just to be able to reproduce the example in the question.
Here is another awk that uses a different field separator for both files:
awk -F ';' 'NR==FNR{a[$1]=FS $2; next} {k=$1 FS $2}
k in a{$0=$0 a[k]} 1' allids.txt FS='|' data.txt
Estring|0006|this_is_some_random_text|more_text;MAR0593
Fstring|0010|random_combination_of_characters
Fstring|0028|again_here;MAR1195
This command uses ; as FS for allids.txt and uses | as FS for data.txt.

awk - how to replace semicolon in string in csv file?

I need to manage smtp logfile handling in my company.
These logfiles need to be imported to MSSQL, so it is my job to provide this data.
I got strange undelivery message with a ";" in the string, I need to replace this with a comma.
So what I got:
Sender;Recipient;Operation;Answer;Error;Servername
bla#bla.com;rockit#sohard.com;RCPT TO;450;+4.2.0+<rockit#sohard.com>:+Recipient+address+rejected:+Policy+restrictions;+try+later;M0641
Mention the ";" in the Answer field after "restrictions", dunno why the mail server sends semicolons, maybe to annoy me :P
I tried following with awk after I did a lot of research:
awk 'BEGIN{FS=OFS=";"} {for (i=5;i<=NF;i++) gsub (";",",",$i)} 1' myfile.csv
This command actually works but it seems it does nothing with my file, the ";" in the error field remains. What I am missing here ?
Replacing the fifth and later ; with ,
$ awk -F\; '{for (i=1;i<=NF;i++) printf "%s%s",$i,(i==NF?ORS:(i<=4?";":","))}' myfile.csv
Sender;Recipient;Operation;Answer;Error,Servername
bla#bla.com;rockit#sohard.com;RCPT TO;450;+4.2.0+<rockit#sohard.com>:+Recipient+address+rejected:+Policy+restrictions,+try+later,M0641
How it works:
-F\;
This sets the field separator for input to ;.
for (i=1;i<=NF;i++) printf "%s%s",$i,(i==NF?ORS:(i<=4?";":","))
This loops over every field and prints the field followed by (a) ORS if we are on the last field, or (b) , if were are on field 5 or later, or (c) ; if we are on one of the first four fields.
Replacing all ; with ,
Try:
$ awk -F\; '{$1=$1} 1' OFS=, myfile.csv
Sender,Recipient,Operation,Answer,Error,Servername
bla#bla.com,rockit#sohard.com,RCPT TO,450,+4.2.0+<rockit#sohard.com>:+Recipient+address+rejected:+Policy+restrictions,+try+later,M0641
How it works:
-F\;
This sets the field separator on input to a semicolon.
$1=$1
This causes awk to think the the line has been changed so that awk will update the output line to use the new field separator.
1
This tells awk to print the line.
OFS=,
This sets the field separator on output to a comma.
Alternative #1
$ awk '{gsub(/;/, ",")} 1' myfile.csv
Sender,Recipient,Operation,Answer,Error,Servername
bla#bla.com,rockit#sohard.com,RCPT TO,450,+4.2.0+<rockit#sohard.com>:+Recipient+address+rejected:+Policy+restrictions,+try+later,M0641
Alternative #2
$ sed 's/;/,/g' myfile.csv
Sender,Recipient,Operation,Answer,Error,Servername
bla#bla.com,rockit#sohard.com,RCPT TO,450,+4.2.0+<rockit#sohard.com>:+Recipient+address+rejected:+Policy+restrictions,+try+later,M0641
I think your problem is replacing the unquotes delimiters in your logical 4th field in a five field wide input. Although this script is repetitious should be easier to understand
$ awk '{n=split($0,a,";");
for(i=1; i<4; i++) printf "%s;", a[i];
for(i=4; i<n-1; i++) printf "%s,", a[i];
printf "%s;%s\n", a[n-1], a[n]}' file
A better way to write the same based on #Ed Morton's comments
$ awk -F';' '{for(i=1; i<NF-1; i++) printf "%s"(i<4?FS:","), $i;
print $(NF-1) FS $NF}' file
For the input
1;2;3;4a;4b;4c;5
1;2;3;4;5
it generates
1;2;3;4a,4b,4c;5
1;2;3;4;5
If the offending semi-colons only appear in your 5th field then you can do this using GNU awk for the 3rd arg to match():
$ awk 'match($0,/(([^;]+;){4})(.*)(;[^;]+$)/,a){gsub(/;/,",",a[3]); print a[1] a[3] a[4]}' file
bla#bla.com;rockit#sohard.com;RCPT TO;450;+4.2.0+<rockit#sohard.com>:+Recipient+address+rejected:+Policy+restrictions,+try+later;M0641
If your fifth ; should be removed, append $6 to $5 and advance accordingly. This could be done with for loop (there are examples in SO) but since the fault is so near the end, we'll just do this in a simpler way:
$ awk 'BEGIN {FS=OFS=";"} NR==1 {nf=NF} NF==(nf+1) {$5=$5 "," $6; $6=$7; NF=nf} 1' file
Explained:
BEGIN {FS=OFS=";"} # set separator
NR==1 {nf=NF} # get field count from the first record (6)
NF==(nf+1) { # if record is one field longer:
$5=$5 "," $6 # append $6 to $5, comma-separated
$6=$7 # set $7 (NF) to $6 (nf)
NF=nf # reset NF
} 1 # output
Testing: Running the program and sending the output to cut -d\; -f 5 outputs:
Error
+4.2.0+<rockit#sohard.com>:+Recipient+address+rejected:+Policy+restrictions,+try+later

Awk, Shell Scripting

I have a file which has the following form:
#id|firstName|lastName|gender|birthday|creationDate|locationIP|browserUsed
111|Arkas|Sarkas|male|1995-09-11|2010-03-17T13:32:10.447+0000|192.248.2.123|Midori
Every field is separated with "|". I am writing a shell script and my goal is to remove the "-" from the fifth field (birthday), in order to make comparisons as if they were numbers.
For example i want the fifth field to be like |19950911|
The only solution I have reached so far, deletes all the "-" from each line which is not what I want using sed.
i would be extremely grateful if you show me a solution to my problem using awk.
If this is a homework writing the complete script will be a disservice. Some hints: the function you should be using is gsub in awk. The fifth field is $5 and you can set the field separator by -F'|' or in BEGIN block as FS="|"
Also, line numbers are in NR variable, to skip first line for example, you can add a condition NR>1
An awk one liner:
awk 'BEGIN { FS="|" } { gsub("-","",$5); print }' infile.txt
To keep "|" as output separator, it is better to define OFS value as "|" :
... | awk 'BEGIN { FS="|"; OFS="|"} {gsub("-","",$5); print $0 }'

retaining text after delimiter in fasta headers using awk

I have what should be a simple problem, but my lack of awk knowledge is holding me back.
I would like to clean up the headers of a fasta file that is in this format:
>HWGG454_Clocus2_Locus3443_allele1
ATTCTACTACTACTCT
>GHW757_clocus37_Locus555662_allele2
CTTCCCTACGATG
>TY45_clocus23_Locus800_allele0
TTCTACTTCATCT
I would like to clean up each header (line starting with ">") to retain only the informative part, which is the second "_Locus*" with or without the allele part.
I thought awk would be the easy way to do this, but I cant quite get it to work.
If I wanted to retain just the first column of text up to the "_" delimiter for the header, and the sequences below, I run this (assuming this toy example is in the file test.fasta):
cat test.fasta | awk -F '_' '{print $1}'
>HWGG454
ATTCTACTACTACTCT
>GHW757
CTTCCCTACGATG
>TY45
TTCTACTTCATCT
But, what I want is to retain just the "Locus*" text, which is after the 3rd delimiter, but, using this code I get this:
cat test.fasta | awk -F '_' '{print $3}'
Locus3443
Locus555662
Locus800
What am I doing wrong here?
thanks.
I understand this to mean that you want to pick the Locus field from the header lines and leave the others unchanged. Then:
awk -F _ '/^>/ { print $3; next } 1' filename
is perhaps the easiest way. This works as follows:
/^>/ { # in lines that begin with >
print $3 # print the third field
next # and go to the next line.
}
1 # print other lines unchanged. Here 1 means true, and the
# default action (unchanged printing) is performed.
The thing to understand here is awk's control flow: awk code consists of conditions with associated actions, and the actions are performed if the condition evaluates to true.
/^>/ is a regex match over the whole record (line by default); it is true if the line begins with > (because ^ matches the beginning), so
/^>/ { print $3; next }
will make awk execute print $3; next in lines that begin with >. The less straightforward part is
1
which prints lines unchanged. We only get here if the first action was not executed (because of the next in it), and this 1 is to be read as a condition that is always true -- nonzero values being true in awk.
Now, if either the condition or the action in an awk statement are omitted, a default is used. The default action is printing the line unchanged, and this takes advantage of it. It would be equally possible to write
1 { print }
or
{ print }
In the latter case, the condition is omitted and the default condition "true" is used. 1 is the shortest variant of this and idiomatic because of it.
$ awk -F_ '{print (/^>/ ? $3 : $0)}' file
Locus3443
ATTCTACTACTACTCT
Locus555662
CTTCCCTACGATG
Locus800
TTCTACTTCATCT
You need a second awk match for the row below. e.g.
cat test.fasta | awk -F _ '/^>/ { print $3"_"$4 } /^[A-Z]/ {print $1}'
Output:
Locus3443_allele1
ATTCTACTACTACTCT
Locus555662_allele2
CTTCCCTACGATG
Locus800_allele0
TTCTACTTCATCT
If you don't want the _allele1 bit remove "_"$4 from the awk script.
You can just do a regex on each line:
$ awk '{ sub(/^.*_L/,"L"); print $0}' /tmp/fasta.txt
Locus3443_allele1
ATTCTACTACTACTCT
Locus555662_allele2
CTTCCCTACGATG
Locus800_allele0
TTCTACTTCATCT

Resources