I am having data in below format and I am trying to capture data between 2 occurrence of string and keep it in a file.
create statement
CREATE VIEW `T1` AS SELECT
aa
bb
cc
dd
create statement
CREATE VIEW `T2` AS SELECT
aa
ff
ee
create statement
CREATE VIEW `T3` AS SELECT
aa
bb
ff
..
...
..
I want output in below format :
FileName T1 should contain :--
create statement
CREATE VIEW `T1` AS SELECT
aa
bb
cc
dd
FileName T2 should contain :--
create statement
CREATE VIEW `T2` AS SELECT
aa
ff
ee
Output filename comes from the value surrounded by backticks
I tried:
sed -n '/create statement/,/create statement/p'
Could you please try following and let me know if this helps you.
awk '/create statement/{create=$0;next} /CREATE VIEW/{val=$3;gsub("`","",val);filename=val;if(create){print create ORS $0 > filename};next} {print > filename}' Input_file
It will create 3 output files named T1, T2 and T3 and so on till all the occurrences of T's. If this is not your question then please be clear in your question and add more details on it too.
Adding a non-one liner form of solution too now:
awk '
/create statement/{
create=$0;
next
}
/CREATE VIEW/{
val=$3;
gsub("`","",val);
filename=val;
if(create){
print create ORS $0 > filename};
next
}
{
print > filename
}
' Input_file
Awk solution:
awk '/^create statement/{ s = $0; n = NR + 1; next }
NR == n{ t = $3; gsub("`", "", t); print t ORS s > t }{ print > t }' file
Results:
$ head T[123]
==> T1 <==
T1
create statement
CREATE VIEW `T1` AS SELECT
aa
bb
cc
dd
==> T2 <==
T2
create statement
CREATE VIEW `T2` AS SELECT
aa
ff
ee
==> T3 <==
T3
create statement
CREATE VIEW `T3` AS SELECT
aa
bb
ff
With GNU awk
awk -F'`' -v ORS= -v RS='create' 'NF{print RS $0 > $2}' ip.txt
-F'`' use backtick as input field separator
-v ORS= empty ORS to avoid extra new line
-v RS='create' use create as input record separator
NF{print RS $0 > $2} for non-empty records, print value of RS and input record to file with second field as filename (which will be T1, T2, etc)
Related
I have the following data (it also contains other lines, here is a meaningful extract):
group
bb 1
cc 1
dd 1
end
group
dd 2
bb 2
end
group
aa 3
end
I don't know the values (like "1", "2", etc.) and have to match by the names (generic "group", "aa", etc.)
I want to get the data filtered and sorted in the following order (with empty tabs when the string is absent):
group bb 1 cc 1 dd 1
group bb 2 dd 2
group aa 3
I run:
awk 'BEGIN {ORS = "\t"}\
/^group/ {print "\n" $0}; \
/^aa/ {AA = $0}; \
/^bb/ {BB = $0}; \
/^cc/ {CC = $0}; \
/^dd/ {DD = $0}; \
/^end/ {print AA; print BB; print CC; print DD}' test.txt
and get
group bb 1 cc 1 dd 1
group bb 2 **cc 1** dd 2
group aa 3 **bb 2** **cc 1** **dd 2**
which is in the right order, but the data is wrong (marked with asterisks). What is the correct way to do this filtering?
Thanks!
Assumptions:
input lines do not start with any white space
each ^group has a matching ^end
the first line in the file is ^group
the last line in the file is ^end
there are no lines (to ignore) between ^end and the next ^group
Primary issue is that each time group is seen we need to clear/reset the other variables otherwise we carryover the values from the previous group.
Other (minor) issues:
ORS vs OFS
multiple print commands vs a single print command
no need for line continuation characters (\)
One idea for an updated awk script:
awk '
BEGIN { OFS="\t" }
/^group/ { AA=BB=CC=DD="" ; next }
/^aa/ { AA=$0 ; next }
/^bb/ { BB=$0 ; next }
/^cc/ { CC=$0 ; next }
/^dd/ { DD=$0 ; next }
/^end/ { print "group",AA,BB,CC,DD }
' test.txt
NOTE: the ; next clauses are optional and are included as a visual reminder that we don't need to worry about the rest of the script (for the current line)
This generates:
group bb 1 cc 1 dd 1
group bb 2 dd 2
group aa 3
Here is a simpler awk solution to do the same:
awk '/^group$/{delete m; next} {m[$1]=$0} /^end$/{
printf "group\t%s\t%s\t%s\t%s\n", m["aa"], m["bb"], m["cc"], m["dd"]
}' file
group bb 1 cc 1 dd 1
group bb 2 dd 2
group aa 3
With GNU awk try following code. Written and tested with shown samples only. Simple explanation would be, setting RS as end\n(optional) and then simply substituting new lines with spaces and printing the line.
awk -v RS='end\n?' 'RT{gsub(/\n/,OFS);print}' Input_file
OR in case you want to have output as tab delimited then try following:
awk -v RS='end\n?' -v OFS="\t" 'RT{gsub(/\n/,OFS);print}' Input_file
We have a requirement to format data, given in list format, into a CSV file
Example input:
ORG MANAGER
No ORG MANAGER found
BILLING MANAGER
No BILLING MANAGER found
ORG AUDITOR
xxx
yyy
zzz
aaa
bbb
ccc
Example output:
ORG MANAGER BILLING MANAGER ORG AUDITOR
No ORG MANAGER found No BILLING MANAGER found xxx
yyy
zzz
aaa
bbb
ccc
I did split the every paragraph and kept it in separate files and tried printing them using paste -d '\t\t command like below:
paste -d "\t\t\t" file1 file2 file3 > fin.csv
After this I tried formatting the output using below command:
awk '{ $NF = "\t" $NF; print }' fin.csv | column -t -s $'\t'
But the output is not like what I have expected.
paste -d "\t\t\t" file1 file2 file3 > fin.csv --> to print the files side by side
awk '{ $NF = "\t" $NF; print }' fin.csv | column -t -s $'\t' --> to format
Expecting to print every paragraph in separate columns such that I can keep them in Excel to format.
I am adding the expected input and output format in the attached snap for the clear understanding.
It is easily done with awk,
awk 'BEGIN{RS=""; FS="\n"; OFS=","; ORS="\n"}
{ for (i=1;i<=NF;++i) {c[FNR,i]=$i; sub(/^[[:blank:]]*/,"",c[FNR,i])} }
{ nf_max= (NF>nf_max?NF:nf_max) }
END{ for(j=1;j<=nf_max;++j) {
for(i=1;i<=FNR;++i) { printf ("%s" (i==FNR?ORS:OFS)), c[i,j] }
}
}' file
This will output a CSV of the following format:
ORG MANAGER,BILLING MANAGER,ORG AUDITOR
No ORG MANAGER found,No BILLING MANAGER found,xxx
,,yyy
,,zzz
,,aaa
,,bbb
,,ccc
How does this work?
By telling awk to set the record separator RS to an empty string, we define each record to be a block of text separated by an empty line.
Each field in that record is separated by a newline character.
We store each field in an array which is indexed by record number FNR and field number. This way we can fully reconstruct the CSV file.
Since you want a CSV file, we set the output field separator OFS to be a <comma> character and the output record separator, which are now lines, to be a <newline>-character.
We keep track of the maximum number of fields per record, which indicates the maximum number of rows in the CSV file.
If a field has less then the maximum number of fields, we can still request that field content from our array as awk, by default, puts string values to empty.
Your question initially asked for a CSV file, but you requested a TSV file which is aligned. We could expand the above awk command for this, but it would be easier to just parse the full output with the column command:
$ awk ... file | column -s, -o $'\t' -t
ORG MANAGER BILLING MANAGER ORG AUDITOR
No ORG MANAGER found No BILLING MANAGER found xxx
yyy
zzz
aaa
bbb
ccc
You can use cat to expect that the fields are adjusted with spaces to match the width and only a single tab is inserted between the fields:
$ awk ... file | column -s, -o $'\t' -t | cat -vET
ORG MANAGER ^IBILLING MANAGER ^IORG AUDITOR$
No ORG MANAGER found^INo BILLING MANAGER found^Ixxx$
^I ^Iyyy$
^I ^Izzz$
^I ^Iaaa$
^I ^Ibbb$
^I ^Iccc$
To get output you can import into Excel as a row of cells would be:
$ awk -v RS= '{gsub(/\n +/,"\n"); printf "%s\"%s\"", s, $0; s=","} END{print ""}' file
"ORG MANAGER
No ORG MANAGER found","BILLING MANAGER
No BILLING MANAGER found","ORG AUDITOR
xxx
yyy
zzz
aaa
bbb
ccc"
Save the output in a file "foo.csv", double-click on it in Windows and it'll be displayed as you want in Excel.
To get the output you asked for visually would be:
$ cat tst.awk
BEGIN { numCols=1; OFS="\t" }
NF {
sub(/^[[:space:]]+/,"")
vals[++rowNr,numCols] = $0
wid[numCols] = (wid[numCols] > length() ? wid[numCols] : length())
numRows = (numRows > rowNr ? numRows : rowNr)
next
}
{ numCols++; rowNr=0 }
END {
for (rowNr=1; rowNr<=numRows; rowNr++) {
for (colNr=1; colNr<=numCols; colNr++) {
printf "%-*s%s", wid[colNr], vals[rowNr,colNr], (colNr<numCols ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
ORG MANAGER BILLING MANAGER ORG AUDITOR
No ORG MANAGER found No BILLING MANAGER found xxx
yyy
zzz
aaa
bbb
ccc
Here is another awk script.
/^[[:space:]]*$/{ # column separator
maxRow = (rowCount > maxRow) ? rowCount : maxRow; # find maxRows
rowCount = 0; # reset rows count
columnCount++; # increment columns count
next; # skip inclusion in cells
}
{ cells[(columnCount + 1)","++rowCount] = $0; } # read each input row as cell
END {
maxRow = (rowCount > maxRow) ? rowCount : maxRow; # ind maxRows (including last column)
columnCount++; # increment last column read (suming no new line termination)
for (row = 1; row <= maxRow; row++) { # print out each row
printf("%s", cells[1","row]); # print out the first element in row
for (col = 2; col <= columnCount; col++) {
printf("\t%s", cells[col","row]); # print , delimiter for each element in row
}
printf("\n"); # terminate each row with newline
}
}
The output is tab delimited csv:
ORG MANAGER BILLING MANAGER ORG AUDITOR
No ORG MANAGER found No BILLING MANAGER found xxx
yyy
zzz
aaa
bbb
ccc
You can add as many columns as required.
execution command is:
awk -f script.awk input.txt > output.csv
To edit with micorsoft-excell or libreOffice-cals. Open a new speardsheet.
Import the data from output.csv using data tools. The output.csv data is tab delimited.
good luck.
I have a large dataset that looks like this:
ID224912 A A A B B A B A B A B
and I want to make it look like:
ID224912 AA AB BA BA BA BA
I have tried modifying this code that I found somewhere else but no success:
AWK=''' { printf (""%s %s %s %s"", $1, $2, $3, $4); }
{ for (f = 5; f <= NF; f += 2) printf (""%s %s"", $(f), $(f + 1)); }
{ printf (""\n""); } '''
awk ""${AWK}"" InFile > OutFile
Any suggestions?
This might work for you (GNU sed):
sed -E 's/((\S+\s\S+\s)*\S+).*/\1/g;s/(\S+\s\S+)\s/\1/g' file
The solution is in two parts. First group the spaces between fields to be an even number and delete an extra field if there is one. Then group the fields
$ awk '{r=$1; for (i=2; i<NF; i+=2) r=r OFS $i $(i+1); print r}' file
ID224912 AA AB BA BA BA
You do not have to assign the AWK script into a variable. Just invoke it inline, which is simpler and safer.
It looks strange that you are grouping the first four fields. As far as I can see from your desired output, it would be enough just to treat the first (ID) field separately.
Try something like:
awk '{printf("%s", $1); for (i=2; i<=NF; i+=2) printf(" %s%s", $i, $(i+1)); print ""}' InFile > OutFile
Hope this hepls.
For funsies here is a sed solution:
cat input | sed 's/\([ A-Z ]\) \([ A-Z ]\)/\1\2/g' > output
Just for clarification I tested on BSD sed.
Regarding InFile as your input file, you can use sed this way:
cat InFile |sed -e 's/\([a-zA-Z]\)[ \t]\([a-zA-Z]\)/\1\2/g'
N.B.: with the specified InFile in your initial question (with an odd count of letters), the result is:
ID224912 AA AB BA BA BA B
The following awk line
awk '{printf $1}{for(i=2;i<=NF;i+=2) printf OFS $i $(i+1); print "" }'
will output
ID224912 AA AB BA BA BA B
As you notice, we have an extra column B in the end due to the even amount of columns in the original output. As the OP does not want this, we can fix this with a simple update in the for-loop condititions
awk '{printf $1}{for(i=2;i<NF;i+=2) printf OFS $i $(i+1); print "" }'
will output
ID224912 AA AB BA BA BA
Hi I have a text file with values
A VAL|1|2|3|
C VAL|2|2|3|
D VAL|1|2|3|
[No space between lines]
I want to replace the values in the above as per the first col i.e A VAL,C
VAL,D VAL,
so I want to
1. replace 3 from A VAL row
2. replace 2 value from C VAL row.
3. replace 1 value from D VAL row.
Basically I want to modify the above values by using AWK as AWK helps
treating csv , pipe delimited files
So I tried by using AWK command as
enter code here
`awk 'BEGIN {OFS=FS="|"} {if ($1="A") sub($4,"A1") ;elseif ($1="C") sub
($2,"B1"); print }' myval.txt`
*But I am getting wrong results *
C|B1|2|A1|B1C
C|B1|2|A1|B1C
C|B1|2|3|B1C
>The fisrt column itself is geting replace and the substitution is at wrong
>position.
**Expected output is **
A VAL|1|2|A1|
C VAL|2|2|B1|
D VAL|1|2|3|
You can try this awk:
awk 'BEGIN{OFS=FS="|"} $1 ~ /^A/{$(NF-1)="A1"} $1 ~ /^C/{$(NF-1)="B1"} 1' file.csv
A VAL|1|2|A1|
C VAL|2|2|B1|
D VAL|1|2|3|
awk 'BEGIN{OFS=FS="|"}{if(substr($1,0,1)=="A")sub($3,"A1",$3);else if(substr($1,0,1)=="C")sub($3,"B1",$3);else if(substr($1,0,1)=="D")sub($3,"3",$3);print }' inputtext.txt > outtext.txt
This is working fine
I have a file with 500 columns and I would need to split each column into a new file while printing $1 as common in all the files. Below is a sample file, and I managed to do this using the below bash/awk solution :
ID F1 F2 F4 F4
aa 1 2 3 4
bb 1 2 3 4
cc 1 2 3 4
dd 1 2 3 4
num=('1' '2' '3' '4')
for i in ${num[#]}; do awk -F "\t" -v col="$i" '{print $1,$col}' OFS="\t"
Input.txt > ${i}.txt; done
which gives the required output as:
1.txt
ID ID
aa aa
bb bb
cc cc
dd dd
2.txt
ID F1
aa 1
bb 1
cc 1
dd 1
....
However, I could not track which file corresponds to which column as the output file name is the field number but not the field name. Could it be possible to write the header of the field as prefix to the output file name?
ID.txt
ID ID
aa aa
bb bb
cc cc
dd dd
F1.txt
ID F1
aa 1
bb 1
cc 1
dd 1
You can do it all in one awk script. When processing the first line, put all the column headings in an array. Then when you process lines you write to the file names from that array in a loop.
awk -F'\t' 'NR == 1 { split($0, filenames) }
{for (col = 1; col <= NF; col++) {
file= filenames[col] ".txt";
print $1, $col >> file;
close(file) } }' Input.txt
If I understand your requirement correctly, it seems like you're very close. Try
num=('1' '2' '3' '4')
for i in ${num[#]}; do
echo "i=$i"
awk -F "\t" -v col="$i" -v OFS="\t" '
NR==1{fName=$(col+1)".out";next}
{print $1,$(col+1) > fName}' data.txt
done
1>cat F1.out
aa 1
bb 1
cc 1
dd 1
. . . .
1>cat F4.out
aa 4
bb 4
cc 4
dd 4
Edit
If you need to keep the headers as shown in your example output, just remove the ;next.
Edit 2
If you have multiple column with the same name, you can append the data to same file by using >> fName instead. One word of warning with this technique. When you use > fName, this "restarts" the file each time you rerun your script. But when using >>, you will be appending to each file each time you run the script. That can cause problems for down-stream processes ;-) ... So you'd need to add code that cleans up your previous run of the script.
Here, we're relying on the fact that awk can also write output to a file, using > fName (where fName has been defined as the value of col(Num)+1 (to skip over the first column values).
And, if you were going to do this thousands of times a day, it would be worth further optimizing per comments above to have awk read the file once and create all the outputs from internal loops. But if you only need to do this a couple of times, then your 'use the tools of unix/linux to decompose the task into manageable parts' is perfectly appropriate.
IHTH