Editing text in Bash - bash

I am trying to edit text in Bash, i got to point where i am no longer able to continue and i need help.
The text i need to edit:
Symbol Name Sector Market Cap, $K Last Links
AAPL
Apple Inc
Computers and Technology
2,006,722,560
118.03
AMGN
Amgen Inc
Medical
132,594,808
227.76
AXP
American Express Company
Finance
91,986,280
114.24
BA
Boeing Company
Aerospace
114,768,960
203.30
The text i need:
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
I already tried :
sed 's/$/,/' BIPSukol.txt > BIPSukol1.txt | awk 'NR==1{print}' BIPSukol1.txt | awk '(NR-1)%5{printf "%s ", $0;next;}1' BIPSukol1.txt | sed 's/.$//'
But it doesnt quite do the job.
(BIPSukol1.txt is the name of the file i am editing)

The biggest problem you have is you do not have consistent delimiters between your fields. Some have commas, some don't and some are just a combination of 3-fields that happen to run together.
The tool you want is awk. It will allow you to treat the first line differently and then condition the output that follows with convenient counters you keep within the script. In awk you write rules (what comes between the outer {...} and then awk applies your rules in the order they are written. This allows you to "fix-up" your hap-hazard format and arrive at the desired output.
The first rule applied FNR==1 is applied to the 1st line. It loops over the fields and finds the problematic "Market Cap $K" field and considers it as one, skipping beyond it to output the remaining headings. It stores a counter count = NF - 3 as you only have 5 lines of data for each Symbol, and skips to the next record.
When count==n the next rule is triggered which just outputs the records stored in the a[] array, zeros count and deletes the a[] array for refilling.
The next rule is applied to every record (line) of input from the 2nd-on. It simply removes any whitespece from the fields by forcing awk to recalculate the fields with $1 = $1 and then stores the record in the array incrementing count.
The last rule, END is a special rule that runs after all records are processed (it lets you sum final tallies or output final lines of data) Here it is used to output the records that remain in a[] when the end of the file is reached.
Putting it altogether in another cut at awk:
awk '
FNR==1 {
for (i=1;i<=NF;i++)
if ($i == "Market") {
printf ",Market Cap $K"
i = i + 2
}
else
printf (i>1?",%s":"%s"), $i
print ""
n = NF-3
count = 0
next
}
count==n {
for (i=1;i<=n;i++)
printf (i>1?",%s":"%s"), a[i]
print ""
delete a
count = 0
}
{
$1 = $1
a[++count] = $0
}
END {
for (i=1;i<=count;i++)
printf (i>1?",%s":"%s"), a[i]
print ""
}
' file
Example Use/Output
Note: you can simply select-copy the script above and then middle-mouse-paste it into an xterm with the directory set so it contains file (you will need to rename file to whatever your input filename is)
$ awk '
> FNR==1 {
> for (i=1;i<=NF;i++)
> if ($i == "Market") {
> printf ",Market Cap $K"
> i = i + 2
> }
> else
> printf (i>1?",%s":"%s"), $i
> print ""
> n = NF-3
> count = 0
> next
> }
> count==n {
> for (i=1;i<=n;i++)
> printf (i>1?",%s":"%s"), a[i]
> print ""
> delete a
> count = 0
> }
> {
> $1 = $1
> a[++count] = $0
> }
> END {
> for (i=1;i<=count;i++)
> printf (i>1?",%s":"%s"), a[i]
> print ""
> }
> ' file
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
(note: it is unclear why you want the "Links" heading included since there is no information for that field -- but that is how your desired output is specified)
More Efficient No Array
You always have afterthoughts that creep in after you post an answer, no different than remembering a better way to answer a question as you are walking out of an exam, or thinking about the one additional question you wished you would have asked after you excuse a witness or rest your case at trial. (there was some song that captured it -- a little bit ironic :)
The following does essentially the same thing, but without using arrays. Instead it simply outputs the information after formatting it rather than buffer it in an array for output all at once. It was one of those type afterthoughts:
awk '
FNR==1 {
for (i=1;i<=NF;i++)
if ($i == "Market") {
printf ",Market Cap $K"
i = i + 2
}
else
printf (i>1?",%s":"%s"), $i
print ""
n = NF-3
count = 0
next
}
count==n {
print ""
count = 0
}
{
$1 = $1
printf (++count>1?",%s":"%s"), $0
}
END { print "" }
' file
(same output)

With your shown samples, could you please try following(written and tested in GNU awk). Considering that(by seeing OP's attempts) after header of Input_file you want to make every 5 lines into a single line.
awk '
BEGIN{
OFS=","
}
FNR==1{
NF--
match($0,/Market.*\$K/)
matchedPart=substr($0,RSTART,RLENGTH)
firstPart=substr($0,1,RSTART-1)
lastPart=substr($0,RSTART+RLENGTH)
gsub(/,/,"",matchedPart)
gsub(/ +/,",",firstPart)
gsub(/ +/,",",lastPart)
print firstPart matchedPart lastPart
next
}
{
sub(/^ +/,"")
}
++count==5{
print val,$0
count=0
val=""
next
}
{
val=(val?val OFS:"")$0
}
' Input_file
OR if your awk doesn't support NF-- then try following.
awk '
BEGIN{
OFS=","
}
FNR==1{
match($0,/Market.*\$K/)
matchedPart=substr($0,RSTART,RLENGTH)
firstPart=substr($0,1,RSTART-1)
lastPart=substr($0,RSTART+RLENGTH)
gsub(/,/,"",matchedPart)
gsub(/ +/,",",firstPart)
gsub(/ +Links( +)?$/,"",lastPart)
gsub(/ +/,",",lastPart)
print firstPart matchedPart lastPart
next
}
{
sub(/^ +/,"")
}
++count==5{
print val,$0
count=0
val=""
next
}
{
val=(val?val OFS:"")$0
}
' Input_file
NOTE: Looks like your header/first line needed special manipulation because we can't simply set , for all spaces, so taken care of it in this solution as per shown samples.

With GNU awk. If your first line is always the same.
echo 'Symbol,Name,Sector,Market Cap $K,Last,Links'
awk 'NR>1 && NF=5' RS='\n ' ORS='\n' FS='\n' OFS=',' file
Output:
Symbol,Name,Sector,Market Cap $K,Last,Links
AAPL,Apple Inc,Computers and Technology,2,006,722,560,118.03
AMGN,Amgen Inc,Medical,132,594,808,227.76
AXP,American Express Company,Finance,91,986,280,114.24
BA,Boeing Company,Aerospace,114,768,960,203.30
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

Related

Comparing columns and printing comments in a new column based on column values

I have a file with multiple columns. I want to check the following conditions :
file.csv
A.B.P;FATH;FNAME;XTRUC;XIZE;XIZE2;ORG;ORG2
AIT;Y9A;RAIT;UNKNOWN;UNKNOWN;80;X;XY
AIT-A;Y9A;RAIT;VIR;67;217;X;X
if $4 contains UNKNOWN print in a new error column "XTRUC is UNKNOWN "
Example :
A.B.P;FATH;FNAME;XTRUC;XIZE;XIZE2;ORG;ORG2;error
AIT;Y9A;RAIT;UNKNOWN;UNKNOWN;80;X;XY;"XTRUC is UNKNOWN."
if for the same value in $3 we have different values in $4 print in a new column "multiple XTRUC value for the same FNAME" and if the previous error exist print the new error in a new line in the same cell.
Example :
A.B.P;FATH;FNAME;XTRUC;XIZE;XIZE2;ORG;ORG2;error
AIT;Y9A;RAIT;UNKNOWN;UNKNOWN;80;X;XY;"XTRUC is UNKNOWN.
multiple XTRUC value for the same FNAME."
AIT-A;Y9A;RAIT;VIR;67;217;X;X;"multiple XTRUC value for the same FNAME"
if $5 and $6 do not match or one of them or both contain something other tan numbers print the error in a new column "XIZE NOK" and/or "XIZE2 NOK" and/or "XIZE and XIZE2 don't match" in a new line if previous errors exist in the same cell.
Example :
A.B.P;FATH;FNAME;XTRUC;XIZE;XIZE2;ORG;ORG2;error
AIT;Y9A;RAIT;UNKNOWN;UNKNOWN;80;X;XY;"XTRUC is UNKNOWN.
multiple XTRUC value for the same FNAME.
XIZE NOK."
AIT-A;Y9A;RAIT;VIR;67;217;X;X;"multiple XTRUC value for the same FNAME.
XIZE and XIZE2 don't match."
if $7 and $8 do not match print the error in a new column "ORG and ORG2 don't match" in a new line if previous errors exist in the same cell.
Example and expected result:
A.B.P;FATH;FNAME;XTRUC;XIZE;XIZE2;ORG;ORG2;error
AIT;Y9A;RAIT;UNKNOWN;UNKNOWN;80;X;X;"XTRUC is UNKNOWN.
multiple XTRUC value for the same FNAME.
XIZE NOK."
AIT-A;Y9A;RAIT;VIR;67;217;X;X Y;"multiple XTRUC value for the same FNAME.
XIZE and XIZE2 don't match.
ORG and ORG2 don't match."
Visual result from CSV file :
I tried to use multiple awk commands like :
awk '{if($5!=$6) print "XIZE and XIZE2 do not match" ; elif($5!='^[0-9]+$' print "`XIZE` NOK" ; elif($6!="^-\?[0-9]+$" print "`XIZE` NOK"}' file.csv
It didn't work and with multiple conditions i wonder if there's a simpler way to do it.
I assume you want to add these messages to a new final column.
awk -F ';' 'BEGIN {OFS = FS}
{new_field = NF + 1}
$5 != $6 {$new_field = $new_field "XIZE and XIZE2 do not match\n"}
$5 !~ "^[0-9]+$" {$new_field = $new_field "`XIZE` NOK\n"}
$6 !~ "^-\\?[0-9]+$" {$new_field = $new_field "`XIZE` NOK\n"}
{print}' file.csv > new-file.csv
This may output more newlines than you want. If that's a problem, it's possible to fix that, perhaps using an array and a for loop or building a string and adding it at print time (see below) instead of simple concatenation.
This script
sets the field delimiter for input (-F) and output (OFS) to a semicolon
calculates the field number of a new error field at the end of the row, it does this for each row, so it may be different if the lengths of rows varies
for each true field test it concatenates a message to the error field
regex tests use the negated regex match operator !~
each field in each row is tested (tests are not mutually exclusive (no else), if you want them to be mutually exclusive you can change the form of the tests back to using if and else
prints the whole row whether an error field was added or not
redirects the output to a new file
I used the shorter messages from your AWK script rather than the longer ones in your examples. You can easily change them if needed.
Here is an array version that eliminates an excess newline and wraps the new field in quotes:
awk -F ';' 'BEGIN {OFS = FS}
NR == 1 {print; next}
{new_field = NF + 1; delete arr; i = 0; d = ""; msg = ""}
$5 != $6 {arr[i++] = "XIZE and XIZE2 do not match"}
$5 !~ "^[0-9]+$" {arr[i++] = "`XIZE` NOK"}
$6 !~ "^-\\?[0-9]+$" {arr[i++] = "`XIZE` NOK"}
{
if (i > 0) {
msg = "\"";
for (idx in arr) {
msg = d msg arr[idx];
d = "\n";
}
msg = msg "\"";
$new_field = msg;
};
print
}' file.csv > new-file.csv
I think this might be what you want:
$ cat tst.awk
BEGIN { FS=OFS=";" }
NR == 1 { print $0, "error"; next }
{ numErrs = 0 }
($4 == "UNKNOWN") { errs[++numErrs] = "XTRUC is UNKNOWN" }
($3 != $4) { errs[++numErrs] = "multiple XTRUC value for the same FNAME" }
($5 != $6) || ($5+0 != $5) || ($6+0 != $6) { errs[++numErrs] = "XIZE and XIZE2 don't match" }
($7 != $8) { errs[++numErrs] = "ORG and ORG2 don't match" }
{
printf "%s%s\"", $0, OFS
for ( errNr=1; errNr<=numErrs; errNr++ ) {
printf "%s%s", (errNr>1 ? "\n\t\t\t\t" : ""), errs[errNr]
}
print "\""
}
$ awk -f tst.awk file.csv
A.B.P;FATH;FNAME;XTRUC;XIZE;XIZE2;ORG;ORG2;error
AIT;Y9A;RAIT;UNKNOWN;UNKNOWN;80;X;XY;"XTRUC is UNKNOWN
multiple XTRUC value for the same FNAME
XIZE and XIZE2 don't match
ORG and ORG2 don't match"
AIT-A;Y9A;RAIT;VIR;67;217;X;X;"multiple XTRUC value for the same FNAME
XIZE and XIZE2 don't match"
If you don't REALLY want a bunch of white space at the start of the lines in the quoted fields (I only added them to get output that looks ike you say you wanted in your question), then just get rid of \t\t\t\t from the printf but leave the \n, i.e. printf "%s%s", (errNr>1 ? "\n" : ""), errs[errNr]. I'd normally print ORS insead of \n but you may be doing this to create output for MS-Excel in which case you'd set ORS="\r\n" in the BEGIN section and leave that printf with a \n in it for consistency with Excels CSV format.
printf "AIT;Y9A;RAIT;UNKNOWN;UNKNOWN;80;X;XY" | tr ';' '\n' > stack
[[ $(sed -n '/UNKNOWN/p' stack) ]] && printf "\"XTRUC is UNKNOWN\"" >> stack
tr '\n' ';' < stack > s2
You can do the same thing with whatever other tests you like. Just replace the semi colons with newlines, and then use ed or sed to read the line number corresponding with the line you want. After that, replace the newlines with semicolons again.

combining numbers from multiple text files using bash

I'm strugling to combine some data from my txt files generated in my jenkins job.
on each of the files there is 1 line, this is how each file look:
testsuite name="mytest" cars="201" users="0" bus="0" bike="0" time="116.103016"
What I manage to do for now is to extract the numbers for each txt file:
awk '/<testsuite name=/{print $3, $4, $5, $6}' my-output*.txt
Result are :
cars="193" users="2" bus="0" bike="0"
cars="23" users="2" bus="10" bike="7"
cars="124" users="2" bus="5" bike="0"
cars="124" users="2" bus="0" bike="123"
now I have a random number of files like this:
my-output1.txt
my-output2.txt
my-output7.txt
my-output*.txt
I would like to create single command just like the one I did above and to sum all of the files to have the following echo result:
cars=544 users=32 bus=12 bike=44
is there a way to do that? with a single line of command?
Using awk
$ cat script.awk
BEGIN {
FS="[= ]"
} {
gsub(/"/,"")
for (i=1;i<NF;i++)
if ($i=="cars") cars+=$(i+1)
else if($i=="users") users+=$(i+1);
else if($i=="bus") bus+=$(i+1);
else if ($i=="bike")bike+=$(i+1)
} END {
print "cars="cars,"users="users,"bus="bus,"bike="bike
}
To run the script, you can use;
$ awk -f script.awk my-output*.txt
Or, as a ugly one liner.
$ awk -F"[= ]" '{gsub(/"/,"");for (i=1;i<NF;i++) if ($i=="cars") cars+=$(i+1); else if($i=="users") users+=$(i+1); else if($i=="bus") bus+=$(i+1); else if ($i=="bike")bike+=$(i+1)}END{print"cars="cars,"users="users,"bus="bus,"bike="bike}' my-output*.txt
1st solution: With your shown samples please try following awk code, using match function in here. Since awk could read multiple files within a single program itself and your files have .txt format you can pass as .txt format to awk program itself.
Written and tested in GNU awk with its match function's capturing group capability to create/store values into an array to be used later on in program.
awk -v s1="\"" '
match($0,/[[:space:]]+(cars)="([^"]*)" (users)="([^"]*)" (bus)="([^"]*)" (bike)="([^"]*)"/,tempArr){
temp=""
for(i=2;i<=8;i+=2){
temp=tempArr[i-1]
values[i]+=tempArr[i]
indexes[i-1]=temp
}
}
END{
for(i in values){
val=(val?val OFS:"") (indexes[i-1]"=" s1 values[i] s1)
}
print val
}
' *.txt
Explanation:
In start of GNU awk program creating variable named s1 to be set to " to be used later in the program.
Using match function in main program of awk.
Mentioning regex [[:space:]]+(cars)="([^"]*)" (users)="([^"]*)" (bus)="([^"]*)" (bike)="([^"]*)"(explained at last of this post) which is creating 8 groups to be used later on.
Then once condition is matched running a for loop which runs only even numbers in it(to get required values only).
Creating array values with index of i and keep adding its own value + tempArr values to it, where tempArr is created by match function.
Similarly creating indexes array to store only key values in it.
Then in END block of this program traversing through values array and printing the values from indexes and values array as per requirement.
Explanation of regex:
[[:space:]]+ ##Matching spaces 1 or more occurrences here.
(cars)="([^"]*)" ##Matching cars=" till next occurrence of " here.
(users)="([^"]*)" ##Matching spaces followed by users=" till next occurrence of " here.
(bus)="([^"]*)" ##Matching spaces followed by bus=" till next occurrence of " here.
(bike)="([^"]*)" ##Matching spaces followed by bike=" till next occurrence of " here.
2nd solution: In GNU awk only with using RT and RS variables power here. This will make sure the sequence of the values also in output should be same in which order they have come in input.
awk -v s1="\"" -v RS='[[:space:]][^=]*="[^"]*"' '
RT{
gsub(/^ +|"/,"",RT)
num=split(RT,arr,"=")
if(arr[1]!="time" && arr[1]!="name"){
if(!(arr[1] in values)){
indexes[++count]=arr[1]
}
values[arr[1]]+=arr[2]
}
}
END{
for(i=1;i<=count;i++){
val=(val?val OFS:"") (indexes[i]"=" s1 values[indexes[i]] s1)
}
print val
}
' *.txt
You may use this awk solution:
awk '{
for (i=1; i<=NF; ++i)
if (split($i, a, /=/) == 2) {
gsub(/"/, "", a[2])
sums[a[1]] +=a[2]
}
}
END {
for (i in sums) print i "=" sums[i]
}' file*
bus=15
cars=464
users=8
bike=130
found a way to do so a bit long:
awk '/<testsuite name=/{print $3, $4, $5, $6}' my-output*.xml | sed -e 's/[^0-9]/ /g' -e 's/^ *//g' -e 's/ *$//g' | tr -s ' ' | awk '{bus+=$1;users+=$2;cars+=$3;bike+=$4 }END{print "bus=" bus " users="users " cars=" cars " bike=" bike}'
M. Nejat Aydin answer was good fit:
awk -F '[ "=]+' '/testsuite name=/{ cars+=$5; users+=$7; buses+=$9; bikes+=$11 } END{ print "cars="cars, "users="users, "buses="buses, "bikes="bikes }' my-output*.xml

How to assign awk result variable to an array and is it possible to use awk inside another awk in loop

I've started to learn bash and totally stuck with the task. I have a comma separated csv file with records like:
id,location_id,organization_id,service_id,name,title,email,department
1,1,,,Name surname,department1 department2 department3,,
2,1,,,name Surname,department1,,
3,2,,,Name Surname,"department1 department2, department3",, e.t.c.
I need to format it this way: name and surname must start with a capital letter
add an email record that consists of the first letter of the name and full surname in lowercase
create a new csv with records from the old csv with corrected fields.
I split csv on records using awk ( cause some fields contain fields with a comma between quotes "department1 department2, department3" ).
#!/bin/bash
input="$HOME/test.csv"
exec 0<$input
while read line; do
awk -v FPAT='"[^"]*"|[^,]*' '{
...
}' $input)
done
inside awk {...} (NF=8 for each record), I tried to use certain field values ($1 $2 $3 $4 $5 $6 $7 $8):
#it doesn't work
IFS=' ' read -a name_surname<<<$5 # Field 5 match to *name* in heading of csv
# Could I use inner awk with field values of outer awk ($5) to separate the field value of outer awk $5 ?
# as an example:
# $5="${awk '{${1^}${2^}}' $5}"
# where ${1^} and ${2^} fields of inner awk
name_surname[0]=${name_surname[0]^}
name_surname[1]=${name_surname[1]^}
$5="${name_surname[0]}' '${name_surname[1]}"
email_name=${name_surname[0]:0:1}
email_surname=${name_surname[1]}
domain='#domain'
$7="${email_name,}${email_surname,,}$domain" # match to field 7 *email* in heading of csv
how to add field values ($1 $2 $3 $4 $5 $6 $7 $8) to array and call function join for each for loop iteration to add record to new csv file?
function join { local IFS="$1"; shift; echo "$*"; }
result=$(join , ${arr[#]})
echo $result >> new.csv
This may be what you're trying to do (using gawk for FPAT as you already were doing) but without more representative sample input and the expected output it's a guess:
$ cat tst.sh
#!/usr/bin/env bash
awk '
BEGIN {
OFS = ","
FPAT = "[^"OFS"]*|\"[^\"]*\""
}
NR > 1 {
n = split($5,name,/\s*/)
$7 = tolower(substr(name[1],1,1) name[n]) "#example.com"
print
}
' "${#:--}"
$ ./tst.sh test.csv
1,1,,,Name surname,department1 department2 department3,nsurname#example.com,
2,1,,,name Surname,department1,nsurname#example.com,
3,2,,,Name Surname,"department1 department2, department3",nsurname#example.com,
I put the awk script inside a shell script since that looks like what you want, obviously you don't need to do that you could just save the awk script in a file and invoke it with awk -f.
Completely working answer by Ed Morton.
If it may be will be helpful for someone, I added one more checking condition: if in CSV file more than one email address with the same name - index number is added to email local part and output is sent to file
#!/usr/bin/env bash
input="$HOME/test.csv"
exec 0<$input
awk '
BEGIN {
OFS = ","
FPAT = "[^"OFS"]*|\"[^\"]*\""
}
(NR == 1) {print} #header of csv
(NR > 1) {
if (length($0) > 1) { #exclude empty lines
count = 0
n = split($5,name,/\s*/)
email_local_part = tolower(substr(name[1],1,1) name[n])
#array stores emails from csv file
a[i++] = email_local_part
#find amount of occurrences of the same email address
for (el in a) {
ret=match(a[el], email_local_part)
if (ret == 1) { count++ }
}
#add number of occurrence to email address
if (count == 1) { $7 = email_local_part "#abc.com" }
else { --count; $7 = email_local_part count "#abc.com" }
print
}
}
' "${#:--}" > new.csv

count the max number of _ and add additional ; if missing

I have a file with several fields like below
deme_Fort_Email_am;04/02/2015;Deme_Fort_Postal
deme_faible_Email_am;18/02/2015;deme_Faible_Email_Relance_am
equi_Fort_Email_am;23/02/2015;trav_Fort_Email_am
trav_Faible_Email_pm;18/02/2015;trav_Faible_Email_Relance_pm
trav_Fort_Email_am;12/02/2015;Trav_Fort_Postal
voya_Faible_Email_am;29/01/2015;voya_Faible_Email_Relance_am
Aim is to have that
deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
Fort;Email;am;23/02/2015;trav;Fort;Email;am;
trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal
voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am
I'm counting the max of underscore for one of the line then change it to semi-colon and add additional semi-colon, if it is not the maximum number of semi-colon found in all the lines.
I thought about using awk for that but I will only change ,with the command line below , every thing after the first field. My aim is also to add additional semi-colon
awk 'BEGIN{FS=OFS=";"} {for (i=1;i<=NF;i++) gsub(/_/,";", $i) } 1' file
Note: As awk is dealing on a line by line basis, I'm not sure I can do that but I'm asking just in case. If it cannot be done, please let me know and I'll try to find another way.
Thanks.
Here's a two-pass solution. Note you need to put the data file twice on the command line when running awk:
$ cat mu.awk
BEGIN { FS="_"; OFS=";" }
NR == FNR { if (max < NF) max = NF; next }
{ $1=$1; i = max; j = NF; while (i-- > j) $0 = $0 OFS }1
$ awk -f mu.awk mu.txt mu.txt
deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
deme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am
The BEGIN block sets the input and output file separators.
The NF == FNR block makes the first pass through the file, setting the max number of fields.
The last block makes the second pass through the file. First it reconstitutes the line to use the output file separator and than adds an extra ; for however many fields the line is short of the max.
EDIT
This version answers the updated question to only affect fields after field 7:
$ cat mu2.awk
BEGIN { OFS=FS=";" }
# First pass, find the max number of "_"
NR == FNR { gsub("[^_]",""); if (max < length()) max = length(); next }
# Second pass:
{
# count number of "_" less than the max
line = $0
gsub("[^_]","", line)
n = max - length(line)
# replace "_" with ";" after field 7
for (i=8; i<=NF; ++i) gsub("_", ";", $i);
# add an extra ";" for each "_" less than max
while (n-- > 0) $0 = $0 ";"
}1
$ awk -f mu2.awk mu2.txt mu2.txt
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;deme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
xxx;x_x_x;xxx;xxx;x_x_x;xxx;xxx;voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am
This should do:
awk -F_ '{for (i=1;i<=NF;i++) a[NR FS i]=$i;c=NF>c?NF:c} END {for (j=1;j<=NR;j++) {for (i=1;i<c;i++) printf "%s;",a[j FS i];print a[j FS c]}}' file
deme;Fort;Email;am;04/02/2015;Deme;Fort;Postal;;
deme;faible;Email;am;18/02/2015;deme;Faible;Email;Relance;am
equi;Fort;Email;am;23/02/2015;trav;Fort;Email;am;
trav;Faible;Email;pm;18/02/2015;trav;Faible;Email;Relance;pm
trav;Fort;Email;am;12/02/2015;Trav;Fort;Postal;;
voya;Faible;Email;am;29/01/2015;voya;Faible;Email;Relance;am
How it works:
awk -F_ ' # Set field separator to "_"
{for (i=1;i<=NF;i++) # Loop trough one by one field
a[NR FS i]=$i # Store the field in array "a" using both row(NR) and column position(i) as referense
c=NF>c?NF:c} # Find the largest number of fields and store it in "c"
END { # When file read is done, then do at end
for (j=1;j<=NR;j++) { # Loop trough all row
for (i=1;i<c;i++) # Loop trough all column
printf "%s;",a[j FS i] # Print one and one field for every row
print a[j FS c] # Print end field in each row
}
}
' file # read the file

Sequence length of FASTA file

I have the following FASTA file:
>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT
My desired output:
>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.
This is my code:
awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa
The output I get with this code is:
>header1
60
57
>header2
3
>header3
7
I need a small modification in order to deal with multiple sequence lines.
I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.
An awk / gawk solution can be composed by three stages:
Every time header is found these actions should be performed:
Print previous seqlen if exists.
Print tag.
Initialize seqlen.
For the sequence lines we just need to accumulate totals.
Finally at the END stage we print the remnant seqlen.
Commented code:
awk '/^>/ { # header pattern detected
if (seqlen){
# print previous seqlen if exists
print seqlen
}
# pring the tag
print
# initialize sequence
seqlen = 0
# skip further processing
next
}
# accumulate sequence length
{
seqlen += length($0)
}
# remnant seqlen if exists
END{if(seqlen){print seqlen}}' file.fa
A oneliner:
awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa
For the totals:
awk '/^>/ { if (seqlen) {
print seqlen
}
print
seqtotal+=seqlen
seqlen=0
seq+=1
next
}
{
seqlen += length($0)
}
END{print seqlen
print seq" sequences, total length " seqtotal+seqlen
}' file.fa
A quick way with any awk, would be this:
awk '/^>/{if (l!="") print l; print; l=0; next}{l+=length($0)}END{print l}' file.fasta
You might be also interested in BioAwk, it is an adapted version of awk which is tuned to process FASTA files
bioawk -c fastx '{print ">" $name ORS length($seq)}' file.fasta
Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language",
by Al Aho, Brian Kernighan, and Peter Weinberger
(Addison-Wesley, 1988, ISBN 0-201-07981-X)
. I'm not sure if this version is compatible with POSIX.
I wanted to share some tweaks to klashxx's answer that might be useful. Its output differs in that it prints the sequence id and its length on one line, It's no longer a one-liner, so the downside is you'll have to save it as a script file.
It also parses out the sequence id from the header line, based on whitespace (chrM in >chrM gi|251831106|ref|NC_012920.1|). Then, you can select a specific sequence based on the id by setting the variable target like so: $ awk -f seqlen.awk -v target=chrM seq.fa.
BEGIN {
OFS = "\t"; # tab-delimited output
}
# Use substr instead of regex to match a starting ">"
substr($0, 1, 1) == ">" {
if (seqlen) {
# Only print info for this sequence if no target was given
# or its id matches the target.
if (! target || id == target) {
print id, seqlen;
}
}
# Get sequence id:
# 1. Split header on whitespace (fields[1] is now ">id")
split($0, fields);
# 2. Get portion of first field after the starting ">"
id = substr(fields[1], 2);
seqlen = 0;
next;
}
{
seqlen = seqlen + length($0);
}
END {
if (! target || id == target) {
print id, seqlen;
}
}
"seqkit" is a quick way:
seqkit fx2tab --length --name --header-line sequence.fa

Resources