How to combine the two awk files into one? - bash

Here is a original awk file ,i want to format it.
input content----original awk file named test.txt
awk 'BEGIN {maxlength = 0}\
{\
if (length($0) > maxlength) {\
maxlength = length($0);\
longest = $0;\
}\
}\
END {print longest}' somefile
expected output----well-formatted awk file
awk 'BEGIN {maxlength = 0} \
{ \
if (length($0) > maxlength) { \
maxlength = length($0); \
longest = $0; \
} \
} \
END {print longest}' somefile
step1:to get the longest line and chracters number
step1.awk
#! /usr/bin/awk
BEGIN {max =0 }
{
if (length($0) > max) { max = length($0)}
}
END {print max}
awk -f step1.awk test.txt
Now the max length for all lines is 50.
step2 to put \ in the position 50+2=52.
step2.awk
#! /usr/bin/awk
{
if($0 ~ /\\$/){
gsub(/\\$/,"",$0);
printf("%-*s\\\n",n,$0);
}
else{
printf("%s\n",$0);
}
}
awk -f step2.awk -v n=52 test.txt > well_formatted.txt
How to combine step1 and step2 into only one step,and combine step1.awk and step2.awk as one awk file?

Better version, where you can use sub() instead of gsub(), and to avoid testing the same regexp twice sub(/\\$/,""){ ... }
awk 'FNR==NR{
if(length>max)max = length
next
}
sub(/\\$/,""){
printf "%-*s\\\n", max+2, $0
next
}1' test.txt test.txt
Explanation
awk 'FNR==NR{ # Here we read file and will find,
# max length of line in file
# FNR==NR is true when awk reads first file
if(length>max)max = length # find max length
next # stop processing go to next line
}
sub(/\\$/,""){ # Here we read same file once again,
# if substitution was made for the regex in record then
printf "%-*s\\\n", max+2, $0 # printf with format string max+2
next # go to next line
}1 # 1 at the end does default operation print $0,
# nothing but your else statement printf("%s\n",$0) in step2
' test.txt test.txt
You have not shown us, what's your input and expected output, with some assumption,
if your input looks like below
akshay#db-3325:/tmp$ cat f
123 \
\
12345
123456 \
1234567 \
123456789
12345
You get output as follows
akshay#db-3325:/tmp$ awk 'FNR==NR{ if(length>max)max = length; next}
sub(/\\$/,"",$0){ printf "%-*s\\\n",max+2,$0; next }1' f f
123 \
\
12345
123456 \
1234567 \
123456789
12345

awk '
# first round
FNR == NR {
# take longest (compare and take longest line by line)
M = M < (l = length( $0) ) ? l : M
# go to next line
next
}
# for every line of second round (due to previous next) that finish by /
/[/]$/ {
# if a modification is needed
if ( ( l = length( $0) ) < M ) {
# add the missing space (using sprintf "%9s" for 9 spaces)
sub( /[/]$/, sprintf( "%" (M - l) "s/", ""))
}
}
# print all line [modified or not] (7 is private joke but important is <> 0 )
7
' test.txt test.txt
Note:
twice the file at the end is mandatory for reading twice the file
assume that there is nothing after last / (no space). Could be easily adapted but not the purpose
assume that line without / are not modified but still printed

Here is one for GNU awk. Two runs, first one finds the max length and the second one outputs. FS is set to "" so that each char goes on its field and the last char will in $NF:
$ awk 'BEGIN{FS=OFS=""}NR==FNR{m=(m<NF?NF:m);next}$NF=="\\"{$NF=sprintf("% "m-NF+2"s",$NF)}1' file file
Output:
awk 'BEGIN {maxlength = 0} \
{ \
if (length($0) > maxlength) { \
maxlength = length($0); \
longest = $0; \
} \
} \
END {print longest}' somefile
Explained:
BEGIN { FS=OFS="" } # each char on different field
NR==FNR { m=(m<NF?NF:m); next } # find m ax length
$NF=="\\" { $NF=sprintf("% " m-NF+2 "s",$NF) } # NF gets space padded
1 # output
If you want the \s further away from the code, change that 2 in sprintf to suit your liking.

Maybe something like this?
wc -L test.txt | cut -f1 -d' ' | xargs -I{} sed -i -e :a -e 's/^.\{1,'{}'\}$/& /;ta' test.txt && sed -i -r 's/(\\)([ ]*)$/\2\1/g' test.txt

Related

Use an array created using awk as a variable in another awk script

I am trying to use awk to extract data using a conditional statement containing an array created using another awk script.
The awk script I use for creating the array is as follows:
array=($(awk 'NR>1 { print $1 }' < file.tsv))
Then, to use this array in the other awk script
awk var="${array[#]}" 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1" && heading[i] in var){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
However, when I run this, the following error occurs.
awk: fatal: cannot open file 'foo' for reading (No such file or directory)
I've already looked at multiple posts on why this error occurs and on how to correctly implement a shell variable in awk, but none of these have worked so far. However, when removing the shell variable and running the script it does work.
awk 'FNR==1{ for(i=1;i<=NF;i++){ heading[i]=$i } next } { for(i=2;i<=NF;i++){ if($i=="1"){ close(outFile); outFile=heading[i]".txt"; print ">kmer"NR-1"\n"$1 >> (outFile) }}}' < input.txt
I really need that conditional statement but don't know what I am doing wrong with implementing the bash variable in awk and would appreciate some help.
Thx in advance.
That specific error messages is because you forgot -v in front of var= (it should be awk -v var=, not just awk var=) but as others have pointed out, you can't set an array variable on the awk command line. Also note that array in your code is a shell array, not an awk array, and shell and awk are 2 completely different tools each with their own syntax, semantics, scopes, etc.
Here's how to really do what you're trying to do:
array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
awk -v xyz="${array[*]}" '
BEGIN{ split(xyz,tmp,RS); for (i in tmp) var[tmp[i]] }
... now use `var` as you were trying to ...
'
For example:
$ cat file.tsv
col1 col2
a b c d e
f g h i j
$ cat -T file.tsv
col1^Icol2
a b^Ic d e
f g h^Ii j
$ awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv
a b
f g h
$ array=( "$(awk 'BEGIN{FS=OFS="\t"} NR>1 { print $1 }' < file.tsv)" )
$ awk -v xyz="${array[*]}" '
BEGIN {
split(xyz,tmp,RS)
for (i in tmp) {
var[tmp[i]]
}
for (idx in var) {
print "<" idx ">"
}
}
'
<f g h>
<a b>
It's easier and more efficient to process both files in a single awk:
edit: fixed issues in comment, thanks #EdMorton
awk '
FNR == NR {
if ( FNR > 1 )
var[$1]
next
}
FNR == 1 {
for (i = 1; i <= NF; i++)
heading[i] = $i
next
}
{
for (i = 2; i <= NF; i++)
if ( $i == "1" && heading[i] in var) {
outFile = heading[i] ".txt"
print ">kmer" (NR-1) "\n" $1 >> (outFile)
close(outFile)
}
}
' file.tsv input.txt
You might store string in variable, then use split function to turn that into array, consider following simple example, let file1.txt content be
A B C
D E F
G H I
and file2.txt content be
1
3
2
then
var1=$(awk '{print $1}' file1.txt)
awk -v var1="$var1" 'BEGIN{split(var1,arr)}{print "First column value in line number",$1,"is",arr[$1]}' file2.txt
gives output
First column value in line number 1 is A
First column value in line number 3 is G
First column value in line number 2 is D
Explanation: I store output of 1st awk command, which is then used as 1st argument to split function in 2nd awk command. Disclaimer: this solutions assumes all files involved have delimiter compliant with default GNU AWK behavior, i.e. one-or-more whitespaces is always delimiter.
(tested in gawk 4.2.1)

convert table into comma separated in text file using bash

I have a text file like this:
+------------------+------------+----------+
| col_name | data_type | comment |
+------------------+------------+----------+
| _id | bigint | |
| starttime | string | |
+------------------+------------+----------+
how can i get a result like this using bash
(_id bigint, starttime string )
so just the column names and type
#remove first 3 lines
sed -e '1,3d' < columnnames.txt >clean.txt
#remove first character from each line
sed 's/^.//' < clean.txt >clean.txt
#remove last character from each line
sed 's/.$//' < clean.txt >clean.txt
# remove certain characters
sed 's/[+-|]//g' < clean.txt >clean.txt
# remove last line
sed '$ d' < clean.txt >clean.txt
so this is what i have so far, if there is a better implementation let me know!
Something similar, using only awk:
awk -F ' *[|]' 'BEGIN {printf("(")} NR>3 && NF>1 {printf("%s%s%s", NR>4 ? "," : "", $2, $3)} END {printf(" )\n")}' columnnames.txt
# Set the field separator to vertical bar surrounded by any number of spaces.
# BEGIN and END blocks print the opening and closing parens
# The line between skips the header lines and any line starting with '+'
$ awk -F"[[:space:]]*[|][[[:space:]]*" '
BEGIN { printf "%s", "( "}
NR > 3 && $0 !~ /^[+]/ { printf("%s%s %s", c, $2, $3); c = ", " }
END { print " )" }' file
( _id bigint, starttime string )
$ awk -F'[| ]+' 'NR>3 && NF>1{v=v s $2" "$3; s=", "} END{print "("v")"}' file
(_id bigint, starttime string)
I would do this :
cat input.txt \
| tail -n +4 \
| awk -F'[^a-zA-Z_]+' '{ for(i=1;i<=NF;i++) { printf $i" " }}'
Its a little bit shorter.
Another way to implement Diego Torres Milano's solution as a stand-alone awk program:
tableconvert
#!/usr/bin/env -S awk -f
BEGIN {
FS="[[:space:]]*[|][[[:space:]]*"
printf "%s", "( "
}
{
if (FNR <= 3 || match($0, /^[+]/))
next
else {
printf("%s%s %s", c, $2, $3)
c = ", "
}
}
END {
print " )"
}
Make tableconvert an executable:
chmod +x tableconvert
Run tableconvert on intablefile.txt
./tableconvert intablefile.txt
( _id bigint, starttime string )
With added bonus that using FNR instead of NR allow the awk program to process multiple input files as arguments:
./tableconvert infille1.txt infile2.txt infile3.txt ...
A variation on the other answers using awk with the field-separator being the '|' with optional spaces on either side as GNU awk allows, then taking fields 2 and 3 as the fields wanted in each record, and formatting the output as described in the question with the closing " )" provided in the END rule:
$ awk -F' *\\| *' '
NR>3 && $1~/^[+]/{exit} # exit condition first line w/^+
NR==4{$1=$1; printf "(%s %s", $2,$3} # 1st data record is 4
NR>4{$1=$1; printf ", %s %s", $2,$3} # process all remainng records
END{print " )"} # output closing " )"
' table
(_id bigint, starttime string )
(note: if you don't want the two-spaces before the closing ")", just remove them from the print in the END rule)
Rather than using a BEGIN the first record of interest (4) is used to provide the opening "(". Look things over and let me know if you have questions.

end result of bash command with a dot (.)

I have a bash script that greps and sorts information from /etc/passwd here
export FT_LINE1=13
export FT_LINE2=23
cat /etc/passwd | grep -v "#" | awk 'NR%2==1' | cut -f1 -d":" | rev | sort -r | awk -v l1="$FT_LINE1" -v l2="$FT_LINE2" 'NR>=l1 && NR<=l2' | tr '\n' ',' | sed 's/, */, /g'
The result is this list
sstq_, sorebrek_brk_, soibten_, sirtsa_, sergtsop_, sec_, scodved_, rlaxcm_, rgmecived_, revreswodniw_, revressta_,
How can i replace the last comma with a dot (.)? I want it to look like this
sstq_, sorebrek_brk_, soibten_, sirtsa_, sergtsop_, sec_, scodved_, rlaxcm_, rgmecived_, revreswodniw_, revressta_.
You can add:
| sed 's/,$/./'
(where $ means "end of line").
There are way to many pipes in your command, some of them can be removed.
As explained in the comment cat <FILE> | grep is a bad habit!!! In general, cat <FILE> | cmd should be replaced by cmd <FILE> or cmd < FILE depending on what type of arguments your command does accept.
On a few GB size file to process, you will already feel the difference.
This being said, you can do the whole processing without using a single pipe by using awk for example:
awk -v l1="$FT_LINE1" -v l2="$FT_LINE2" 'function reverse(s){p=""; for(i=length(s); i>0; i--){p=p substr(s,i,1);}return p;}BEGIN{cmp=0; FS=":"; ORS=","}!/#/{cmp++;if(cmp%2==1) a[cmp]=reverse($1);}END{asort(a);for(i=length(a);i>0;i--){if((length(a)-i+1)>=l1 && (length(a)-i)<=l2){if(i==1){ORS=".";}print a[i];}}}' /etc/passwd
Explanations:
# BEGIN rule(s)
BEGIN {
cmp = 0 #to be use to count the lines since NR can not be used directly
FS = ":" #file separator :
ORS = "," #output record separator ,
}
# Rule(s)
! /#/ { #for lines that does not contain this char
cmp++
if (cmp % 2 == 1) {
a[cmp] = reverse($1) #add to an array the reverse of the first field
}
}
# END rule(s)
END {
asort(a) #sort the array and process it in reverse order
for (i = length(a); i > 0; i--) {
# apply your range conditions
if (length(a) - i + 1 >= l1 && length(a) - i <= l2) {
if (i == 1) { #when we reach the last character to print, instead of the comma use a dot
ORS = "."
}
print a[i] #print the array element
}
}
}
# Functions, listed alphabetically
#if the reverse operation is necessary then you can use the following function that will reverse your strings.
function reverse(s)
{
p = ""
for (i = length(s); i > 0; i--) {
p = p substr(s, i, 1)
}
return p
}
If you don't need to reverse part you can just remove it from the awk script.
In the end, not a single pipe is used!!!

Translating a sed one-liner into awk

I am parsing files containing lines of "key=value" pairs. An example could be this:
Normal line
Another normal line
[PREFIX] 1=Something 5=SomethingElse 26=42
Normal line again
I'd like to leave all lines not containing key=value pairs as they are, while transforming all lines containing key=value pairs as follows:
Normal line
Another normal line
[PREFIX]
AAA=Something
EEE=SomethingElse
ZZZ=42
Normal line again
Assume I have a valid dictionary for the translation.
What I do at the moment is passing the input to sed, where I turn spaces into newlines for the lines that match '^\['.
The output is then piped into this awk script:
BEGIN {
dict[1] = "AAA"
dict[5] = "EEE"
dict[26] = "ZZZ"
FS="="
}
{
if (match($0, "[0-9]+=.+")) {
key = ""
if ($1 in dict) {
key = dict[$1]
}
printf("%7s = %s\n", key, $2)
}
else {
print
next
}
}
The overall command line then becomes:
cat input | sed '/^\(\[.*\)/s/ /\n/g' | awk -f script.awk
My question is: is there any way I can include the sed operation in the middle so to get rid of that additional step?
$ cat tst.awk
BEGIN {
split("1 AAA 5 EEE 26 ZZZ",tmp)
for (i=1; i in tmp; i+=2) {
dict[tmp[i]] = tmp[i+1]
}
FS="[ =]"
OFS="="
}
$1 == "[PREFIX]" {
print $1
for (i=2; i<NF; i+=2) {
print " " ($i in dict ? dict[$i] : $i), $(i+1)
}
next
}
{ print }
$ awk -f tst.awk file
Normal line
Another normal line
[PREFIX]
AAA=Something
EEE=SomethingElse
ZZZ=42
Normal line again
In fact I could not force awk to read the file twice; one for sed command, one for your algo, so I had to modify your algo.
BEGIN {
dict[1] = "AAA"
dict[5] = "EEE"
dict[26] = "ZZZ"
# FS="="
}
$0 !~/[0-9]+=.+/ { print }
/[0-9]+=.+/ {
nb = split($0,arr1);
for (i=1; i<=nb; i++ in arr1) {
nbb = split(arr1[i], keyVal, "=");
if ( (nbb==2) && (keyVal[1] in dict) ) {
printf("%7s = %s\n", dict[keyVal[1]], keyVal[2])
}
else
print arr1[i];
}
}
When you have to convert a lot, you can first migrate your dict file into a sed script file. When your dicht file has a fixed format, you can convert it on the fly.
Suppose your dict file looks like
1=AAA
5=EEE
26=ZZZ
And your input file is
Normal line
Another normal line
[PREFIX] 1=Something 5=SomethingElse 26=42
Normal line again
You want to do something like
cat input | sed '/^\[/ s/ /\n/g' | sed 's/^1=/ AAA=/'
# Or eliminating the extra step with cat
sed '/^\[/ s/ /\n/g' input | sed 's/^1=/ AAA=/'
So your next step is converting your dict file into sed commands:
sed 's#\([^=]*\)=\(.*\)#s/^\1=/ \2=/#' dictfile
Now you can combine these with
sed '/^\[/ s/ /\n/g' input | sed -f <(
sed 's#\([^=]*\)=\(.*\)#s/^\1=/ \2=/#' dictfile
)

How to replace all but last matching in a file using bash?

Assuming using bash, having a configuration file like:
param-a=aaaaaa
param-b=bbbbbb
param-foo=first occurence <-- Replace
param-c=cccccc
# param-foo=first commented foo <-- Commented: don't replace
param-d=dddddd
param-e=eeeeee
param-foo=second occurence <-- Rreplace
param-foo=third occurence <-- Last active: don't replace
param-x=xxxxxx1
param-f=ffffff
# param-foo=second commented foo <-- Commented: don't replace
param-x=xxxxxx2
In which you can find multiple commented or uncommented lines of the param-foo,
how can you comment all the uncommented param-foos except the very last active one,
resulting in:
param-a=aaaaaa
param-b=bbbbbb
# param-foo=first occurence <-- Replaced
param-c=cccccc
# param-foo=commented foo <-- Left
param-d=dddddd
param-e=eeeeee
# param-foo=second occurence <-- Replaced
param-foo=third occurence <-- Left
param-x=xxxxxx1
param-f=ffffff
# param-foo=second commented foo <-- Left
param-x=xxxxxx2
Two parts of the question:
1. How to do it with only one known repeating param?
(only param-foo in the example above)
2. How to do it with all multiple active params at once?
(param-foo + param-x in the example above)
Attention: In this case I don't know previously the name of the repeating params!
Thanks
If awk is acceptable, this will do it for param-foo and param-x:
awk -F= -v p='param-foo param-x' 'BEGIN {
ARGV[ARGC++] = ARGV[ARGC - 1]
n = split(p, t, OFS)
for (i = 0; ++i <= n;) _p[t[i]]
}
NR == FNR {
$1 in _p && nr[$1] = NR
next
}
$1 in nr && FNR != nr[$1] {
$0 = "# " $0
}1' infile
You may use a single parameter: p=param-x or add more parameters separated by spaces: p='param-1 param-2 ... param-n'.
Edit: I'm assuming the real input file looks like this:
param-a=aaaaaa
param-b=bbbbbb
param-foo=first occurence
param-c=cccccc
# param-foo=commented foo
param-d=dddddd
param-e=eeeeee
param-foo=second occurence
param-foo=third occurence
param-x=xxxxxx1
param-f=ffffff
param-x=xxxxxx2
Let me know if it's different.
Second edit: providing a solution for mawk users:
awk -F= -v p='param-foo param-x' 'BEGIN {
n = split(p, t, OFS)
for (i = 0; ++i <= n;) _p[t[i]]
}
NR == FNR {
$1 in _p && nr[$1] = NR
next
}
$1 in nr && FNR != nr[$1] {
$0 = "# " $0
}1' infile infile
Adding solution for the latest requirement:
awk -F= 'NR == FNR {
if (NF && !/^#/)
_p[$1]++ && nr[$1] = NR
next
}
$1 in nr && FNR != nr[$1] {
FNR != nr[$1] && $0 = "# " $0
}1' infile infile
I have not tested fully the script, but it worked on the first example:
#!/bin/bash
input_file=/path/to/your/input/file
last_occurence=`nl $input_file | grep 'param-foo' | grep -v '#' | tail -1 | awk -F" " '{print $1}'`
sed -i '/#/!s/param-foo/# param-foo/g' $input_file
sed -i "${last_occurence}s/# param-foo/param-foo/" $input_file
It's very straight forward logic. First we get the last occurrence of param-foo, which is not commented.
The first sed goes and comments all param-foo, which are not commented.
The second sed uses the line_number of last occurence of param-foo and removes the # character. You can easily wrap that in a function and use it inside a loop, providing a list of parameters, instead of only one.
A bit slow for long files, but should work for all the parameters:
grep -v ^# $file |
cut -f1 -d= |
sort -u |
sed 's/^/grep -n . '$file' |
tac |
grep -m1 :/;s/$/= /' |
bash |
sed -r 's%([0-9]+):(.*)=(.*)%\1!s/^\2=/# \2=/%' |
sed -f- $file
This might work:
param="param-foo"
tac input_file |sed '/#/!{/'"$param"'/{x;/./{x;s/'"$param"'/# &/;t};x;h;}}'|tac >output_file
For multiple params:
cp input_file{,.backup}
params=(param-{foo,bar,baz})
tac input_file >backwards_file
for param in "${params[#]}"; do
sed -i '/#/!{/'"$param"'/{x;/./{x;s/'"$param"'/# &/;t};x;h;}}' backwards_file
done
tac backwards_file >output_file
Turn input_file backwards, preprend all but the first occurrence of $param with a comment #,then revert the file.
EDIT:
To extract the params from the file use this piece of code:
params=($(sed -rn '/^#/d;/^$/!s/^\s*([^=]*).*/\1/gp' input_file | sort | uniq))

Resources