(sed/awk) extract values text file and write to csv (no pattern) - bash

I have (several) large text files from which I want to extract some values to create a csv file with all of these values.
My current solution is to have a few different calls to sed from which I save the values and then have a python script in which I combine the data in different files to a single csv file. However, this is quite slow and I want to speed it up.
The file let's call it my_file_1.txt has a structure that looks something like this
lines I don't need
start value 123
lines I don't need
epoch 1
...
lines I don't need
some epoch 18 words
stop value 234
lines I don't need
words start value 345 more words
lines I don't need
epoch 1
...
lines I don't need
epoch 72
stop value 456
...
and I would like to construct something like
file,start,stop,epoch,run
my_file_1.txt,123,234,18,1
my_file_1.txt,345,456,72,2
...
How can I get the results I want? It doesn't have to be Sed or Awk as long as I don't need to install something new and it is reasonably fast.
I don't really have any experience with awk. With sed my best guess would be
filename=$1
echo 'file,start,stop,epoch,run' > my_data.csv
sed -n '
s/.*start value \([0-9]\+\).*/'"$filename"',\1,/
h
$!N
/.*epoch \([0-9]\+\).*\n.*stop value\([0-9]\+\)/{s/\2,\1/}
D
T
G
P
' $filename | sed -z 's/,\n/,/' >> my_data.csv
and then deal with not getting the run number. Furthermore, this is not quite correct as the N will gobble up some "start value" lines leading to wrong result. It feels like it could be done easier with awk.
It is similar to 8992158 but I can't use that pattern and I know too little awk to rewrite it.
Solution (Edit)
I was not general enough in my description of the problem so I changed it up a bit and fixed some inconsistensies.
Awk (Rusty Lemur's answer)
Here I generalised from knowing that the numbers were at the end of the line to using gensub. For this I should have specified version of awk at is not available in all versions.
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = gensub(/.*start value ([0-9]+).*/, "\\1", 1, $0)
}
/epoch/ {
epoch = gensub(/.*epoch ([0-9]+).*/, "\\1", 1, $0)
}
/stop value/ {
stopValue = gensub(/.*stop value ([0-9]+).*/, "\\1", 1, $0)
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
I accepted this answer because it most understandable.
Sed (potong's answer)
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^.*start value/{:a;N;/\n.*stop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^.*start value (\S+).*\n.*epoch (\S+)\n.*stop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' my_file_1.txt | sed '1!N;s/\n//'

It's not clear how you'd get exactly the output you provided from the input you provided but this may be what you're trying to do (using any awk in any shell on every Unix box):
$ cat tst.awk
BEGIN {
OFS = ","
print "file", "start", "stop", "epoch", "run"
}
{ f[$1] = $NF }
$1 == "stop" {
print FILENAME, f["start"], f["stop"], f["epoch"], ++run
delete f
}
$ awk -f tst.awk my_file_1.txt
file,start,stop,epoch,run
my_file_1.txt,123,234,N,1
my_file_1.txt,345,456,M,2

awk's basic structure is:
read a record from the input (by default a record is a line)
evaluate conditions
apply actions
The record is split into fields (by default based on whitespace as the separator).
The fields are referenced by their position, starting at 1. $1 is the first field, $2 is the second.
The last field is referenced by a variable named NF for "number of fields." $NF is the last field, $(NF-1) is the second-to-last field, etc.
A "BEGIN" section will be executed before any input file is read, and it can be used to initialize variables (which are implicitly initialized to 0).
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = $NF # when a line contains "start value" store the last field as startValue
}
/epoch/ {
epoch = $NF
}
/stop value/ {
stopValue = $NF
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
Save that as processor.awk and invoke as:
awk -f processor.awk my_file_1.txt my_file_2.txt my_file_3.txt > output.csv

This might work for you (GNU sed):
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^start value/{:a;N;/\nstop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^start value (\S+).*\nepoch (\S+)\nstop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' file |
sed '1!N;s/\n//'
The solution contains two invocations of sed, the first to format all but the file name and second to embed the file name into the csv file.
Format the header line on the first line and prime the run number.
Gather up lines between start value and stop value.
Increment the run number, append it to the current line and output the file name. This prints two lines per record, the first is the file name and the second the remainder of the csv file.
In the second sed invocation read two lines at a time (except for the first line) and remove the newline between them, formatting the csv file.

Related

Reverse complement SOME sequences in fasta file

I've been reading lots of helpful posts about reverse complementing sequences, but I've got what seems to be an unusual request. I'm working in bash and I have DNA sequences in fasta format in my stdout that I'd like to pass on down the pipe. The seemingly unusual bit is that I'm trying to reverse complement SOME of those sequences, so that the output has all the sequences in the same direction (for multiple sequence alignment later).
My fasta headers end in either "C" or "+". I'd like to reverse complement the ones that end in "C". Here's a little subset:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
caccttagagataatgaagtatattcagaatgtagaacattctataagac
aactgacccaatatcttttaaaaagtcaatgccatgttaaaaataaaaag
I know there are lots of ways to reverse complement out there, like:
echo ACCTTGAAA | tr ACGTacgt TGCAtgca | rev
and
seqtk seq -r in.fa > out.fa
But I'm not sure how to do this for only those sequences that have a C at the end of the header. I think awk or sed is probably the ticket, but I'm at a loss as to how to actually code it. I can get the sequence headers with awk, like:
awk '/^>/ { print $0 }'
>chr1:84518073-84524089C
>chr1:86214203-86220231+
But if someone could help me figure out how to turn that awk statement into one that asks "if the last character in the header has a C, do this!" that would be great!
Edited to add:
I was so tired when I made this post, I apologize for not including my desired output. Here is what I'd like to output to look like, using my little example:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagtt
gtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
You can see the sequence that ends in + is unchanged, but the sequence with a header that ends in C is reverse complemented.
Thanks!
An earlier answer (by Ed Morton) uses a self-contained awk procedure to selectively reverse-complement sequences following a comment line ending with "C". Although I think that to be the best approach, I will offer an alternative approach that might have wider applicability.
The procedure here uses awk's system() function to send data extracted from the fasta file in awk to the shell where the sequence can be processed by any of the many shell applications existing for sequence manipulation.
I have defined an awk user function to pass the isolated sequence from awk to the shell. It can be called from any part of the awk procedure:
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
The argument of the system function is a string containing the command you would type into terminal to achieve the desired outcome (in this case I've used one of the example reverse-complement routines mentioned in the question). The parts to note are the correct escaping of quote marks that are to appear in the shell command, and the variable s that will be substituted for the sequence string assigned to it when the function is called. The value of s is concatenated with the strings quoted before and after it in the argument to system() shown above.
isolating the required sequences
The rest of the procedure addresses how to achieve:
"if the last character in the header has a C, do this"
Before making use of shell applications, awk needs to isolate the part(s) of the file to process. In general terms, awk employs one or more pattern/action blocks where only records (lines by default) that match a given pattern are processed by the subsequent action commands. For example, the following illustrative procedure performs the action of printing the whole line print $0 if the pattern /^>/ && /C$/ is true for that line (where /^>/ looks for ">" at the start of a line and /C$/ looks for "C" at the end of the same line.:
/^>/ && /C$/{ print $0 }
For the current needs, the sequence begins on the next record (line) after any record beginning with > and ending with C. One way of referencing that next line is to set a variable (named line in my example) when the C line is encountered and establishing a later pattern for the record with numerical value one more than line variable.
Because fasta sequences may extend over several lines, we have to accumulate several successive lines following a C title line. I have achieved this by concatenating each line following the C title line until a record beginning with > is encountered again (or until the end of the file is reached, using the END block).
In order that sequence lines following a non-C title line are ignored, I have used a variable named flag with values of either "do" or "ignore" set when a title record is encountered.
The call to a the custom function processSeq() that employs the system() command, is made at the beginning of a C title action block if the variable seq holds an accumulated sequence (and in the END block for relevant sequences that occur at the end of the file where there will be no title line).
Test file and procedure
A modified version of your example fasta was used to test the procedure. It contains an extra relevant C record with three and-a-bit lines instead of two, and an extra irrelevant + record.
seq.fasta:
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
caccttagagataatgaagtatattcagaatgtagaacattctataagac
aactgacccaatatcttttaaaaagtcaatgccatgttaaaaataaaaag
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chranotherC
aatgaagtatattcagaatgtagaacattaactgacccgccatgttaatc
aatatctataagaccttttaaaaagcaccttagagattcaataaagtcag
gaagtatattcagaatgtagaacattaactgactaagaccttttaacatg
gcattgact
procedure
awk '
/^>/ && /C$/{
if (length(seq)>0) {processSeq(seq); seq="";}
line=NR; print $0; flag="do"; next;
}
/^>/ {line=NR; flag="ignore"}
NR>1 && NR==(line+1) && (flag=="do"){seq=seq $0; line=NR; next}
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
END { if (length(seq)>0) processSeq(seq);}
' seq.fasta
output
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagttgtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
>chranotherC
agtcaatgccatgttaaaaggtcttagtcagttaatgttctacattctgaatatacttcctgactttattgaatctctaaggtgctttttaaaaggtcttatagatattgattaacatggcgggtcagttaatgttctacattctgaatatacttcatt
Tested using GNU Awk 5.1.0 on a Raspberry Pi 400.
performance note
Because calling sytstem() creates a sub shell, this process will be slower than a self-contained awk procedure. It might be useful where existing shell routines are available or tricky to reproduce with custom awk routines.
Edit: modification to include unaltered + records
This version has some repetition of earlier blocks, with minor changes, to handle printing of the lines that are not to be reverse-complemented (the changes should be self-explanatory if the main explanations were understood)
awk '
/^>/ && /C$/{
if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq} seq="";line=NR; print $0; flag="do"; next;
}
/^>/ {if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq} seq=""; print $0; line=NR; flag="ignore"}
NR>1 && NR==(line+1){seq=seq $0; line=NR; next}
function processSeq(s)
{system("echo \"" s "\" | tr ACGTacgt TGCAtgca | rev ");}
END { if (length(seq)>0 && flag=="do") {processSeq(seq)} else {print seq}}
' seq.fasta
Using any awk:
$ cat tst.awk
/^>/ {
if ( NR > 1 ) {
prt()
}
head = $0
tail = ""
next
}
{ tail = ( tail == "" ? "" : tail ORS ) $0 }
END { prt() }
function prt( type) {
type = substr(head,length(head),1)
tail = ( type == "C" ? rev( tr( tail, "ACGTacgt TGCAtgca" ) ) : tail )
print head ORS tail
}
function tr(oldStr,trStr, i,lgth,char,newStr) {
if ( !_trSeen[trStr]++ ) {
lgth = (length(trStr) - 1) / 2
for ( i=1; i<=lgth; i++ ) {
_trMap[trStr,substr(trStr,i,1)] = substr(trStr,lgth+1+i,1)
}
}
lgth = length(oldStr)
for (i=1; i<=lgth; i++) {
char = substr(oldStr,i,1)
newStr = newStr ( (trStr,char) in _trMap ? _trMap[trStr,char] : char )
}
return newStr
}
function rev(oldStr, i,lgth,char,newStr) {
lgth = length(oldStr)
for ( i=1; i<=lgth; i++ ) {
char = substr(oldStr,i,1)
newStr = char newStr
}
return newStr
}
$ awk -f tst.awk file
>chr1:86214203-86220231+
CTGGTGGTACAGCTACATTGTACCATAAAACTTATTCATATTAAAACTTA
TTTATATGTACCTCAAAAGATTAAACTGGGAGATAAGGTGTGGCATTTTT
>chr1:84518073-84524089C
ctttttatttttaacatggcattgactttttaaaagatattgggtcagtt
gtcttatagaatgttctacattctgaatatacttcattatctctaaggtg
This might work for you (GNU sed):
sed -nE ':a;p;/^>.*C$/!b
:b;n;/^>/ba;s/^/\n/;y/ACGTacgt/TGCAtgca/
:c;tc;/\n$/{s///p;bb};s/(.*)\n(.)/\2\1\n/;tc' file
Print the current line and then inspect it.
If the line does not begin with > and end with C, bail out and repeat.
Otherwise, fetch the next line and if it begins with >, repeat the above line.
Otherwise, insert a newline (to use as a pivot point when reversing the line), complement the code of the line using a translation command. Then set about reversing the line, character by character until the inserted newline makes its way to the end of the line.
Remove the newline, print the result and repeat the line above.
N.B. The n command will terminate the script when it is executed after the last line has been read.
Since the OP has amended the ouput, another solution is when the whole of the sequence is complemented and then reversed. Here is another solution that I believe follows these criteria.
sed -nE ':a;p;/^>.*C$/!b
:b;n;/^>/!{H;$!bb};x;y/ACGTacgt\n/TGCAtgca%/;s/%/\n/
:c;tc;s/\n$//;td;s/(.*)\n(.)/\2\1\n/;tc
:d;y/%/\n/;p;z;x;$!ba' file

Using awk command to compare values on separate lines?

I am trying to build a bash script that uses the awk command to go through a sorted tab-separated file, line-by-line and determine if:
the field 1 (molecule) of the line is the same as in the next line,
field 5 (strand) of the line is the string "minus", and
field 5 of the next line is the string "plus".
If this is true, I want to add the values from fields 1 and 3 from the line and then field 4 from the next line to a file. For context, after sorting, the input file looks like:
molecule gene start end strand
ERR2661861.3269 JN051170.1 11330 10778 minus
ERR2661861.3269 JN051170.1 11904 11348 minus
ERR2661861.3269 JN051170.1 12418 11916 minus
ERR2661861.3269 JN051170.1 13000 12469 minus
ERR2661861.3269 JN051170.1 13382 13932 plus
ERR2661861.3269 JN051170.1 13977 14480 plus
ERR2661861.3269 JN051170.1 14491 15054 plus
ERR2661861.3269 JN051170.1 15068 15624 plus
ERR2661861.3269 JN051170.1 15635 16181 plus
Thus, in this example, the script should find the statement true when comparing lines 4 and 5 and append the following line to a file:
ERR2661861.3269 13000 13382
The script that I have thus far is:
# test input file
file=Eg2.1.txt.out
#sort the file by 'molecule' field, then 'start' field
sort -k1,1 -k3n $file > sorted_file
# create output file and add 'molecule' 'start' and 'end' headers
echo molecule$'\t'start$'\t'end >> Test_file.txt
# for each line of the input file, do this
for i in $sorted_file
do
# check to see if field 1 on current line is the same as field 1 on next line AND if field 5 on current line is "minus" AND if field 5 on next line is "plus"
if [awk '{if(NR==i) print $1}' == awk '{if(NR==i+1) print $1}'] && [awk '{if(NR==i) print $5}' == "minus"] && [awk '{if(NR==i+1) print $5}' == "plus"];
# if this is true, then get the 1st and 3rd fields from current line and 4th field from next line and add this to the output file
then
mol=awk '{if(NR==i) print $1}'
start=awk '{if(NR==i) print $3}'
end=awk '{if(NR==i+1) print $4}'
new_line=$mol$'\t'$start$'\t'$end
echo new_line >> Test_file.txt
fi
done
The first part of the bash script works as I want it but the for loop does not seem to find any hits in the sorted file. Does anyone have any insights or suggestions for why this might not be working as intended?
Many thanks in advance!
Explanation why your code does not work
For a better solution to your problem see karakfa's answer.
String comparison in bash needs spaces around [ and ]
Bash interprets your command ...
[awk '{if(NR==i) print $1}' == awk '{if(NR==i+1) print $1}']
... as the command [awk with the arguments {if(NR..., ==, awk, and {if(NR...]. On your average system there is no command named [awk, therefore this should fail with an error message. Add a space after [ and before ].
awk wasn't executed
[ awk = awk ] just compares the literal string awk. To execute the commands and compare their outputs use [ "$(awk)" = "$(awk)" ].
awk is missing the input file
awk '{...}' tries to read input from stdin (the user, in your case). Since you want to read the file, add it as an argument: awk '{...}' sorted_file
awk '... NR==i ...' is not referencing the i from bash's for i in
awk does not know about your bash variable. When you write i in your awk script, that i will always have the default value 0. To pass a variable from bash to awk use awk -v i="$i" .... Also, it seems like you assumed for i in would iterate over the line numbers of your file. Right now, this is not the case, see the next paragraph.
for i in $sorted_file is not iterating the file sorted_file
You called your file sorted_file. But when you write $sorted_file you reference a variable that wasn't declared before. Undeclared variables expand to the empty string, therefore you iterate nothing.
You probably wanted to write for i in $(cat sorted_file), but that would iterate over the file content, not the line numbers. Also, the unquoted $() can cause unforsen problems depending on the file content. To iterate over the line numbers, use for i in $(seq $(wc -l sorted_file)).
this will do the last step, assumes data is sorted in the key and "minus" comes before "plus".
$ awk 'NR==1{next} $1==p && f && $NF=="plus"{print p,v,$3} {p=$1; v=$3; f=$NF=="minus"}' sortedfile
ERR2661861.3269 13000 13382
Note that awk has an implicit loop, no need force it to iterate externally.
The best thing to do when comparing adjacent lines in a stream using awk, or any other program for that matter, is to store the relevant data of that line and then compare as soon as both lines have been read, like in this awk script.
molecule = $1
strand = $5
if (molecule==last_molecule)
if (last_strand=="minus")
if (strand=="plus")
print $1,end,$4
last_molecule = molecule
last_strand = strand
end = $3
You essentially described a proto-program in your bullet points:
the field 1 (molecule) of the line is the same as in the next line,
field 5 (strand) of the line is the string "minus", and
field 5 of the next line is the string "plus".
You have everything needed to write a program in Perl, awk, ruby, etc.
Here is Perl version:
perl -lanE 'if ($l0==$F[0] && $l4 eq "minus" && $F[4] eq "plus") {say join("\t", #F[0..2])}
$l0=$F[0]; $l4=$F[4];' sorted_file
The -lanE part enables auto split (like awk) and auto loop and compiles the text as a program;
The if ($l0==$F[0] && $l4 eq "minus" && $F[4] eq "plus") tests your three bullet points (but Perl is 0 based index arrays so 'first' is 0 and fifth is 4)
The $l0=$F[0]; $l4=$F[4]; saves the current values of field 1 and 5 to compare next loop through. (Both awk and perl allow comparisons to non existent variables; hence why $l0 and $l4 can be used in a comparison before existing on the first time through this loop. Most other languages such as ruby they need to be initialized first...)
Here is an awk version, same program essentially:
awk '($1==l1 && l5=="minus" && $5=="plus"){print $1 "\t" $2 "\t" $3}
{l1=$1;l5=$5}' sorted_file
Ruby version:
ruby -lane 'BEGIN{l0=l4=""}
puts $F[0..2].join("\t") if (l0==$F[0] && l4=="minus" && $F[4]=="plus")
l0=$F[0]; l4=$F[4]
' sorted_file
All three print:
ERR2661861.3269 JN051170.1 13382
My point is that you very effectively understood and stated the problem you were trying to solve. That is 80% of solving it! All you then needed is the idiomatic details of each language.

Display column from empty column (fixed width and space delimited) in bash

I have log file (in txt) with the following text
UNIT PHYS STATE LOCATION INFO
TCSM-1098 SE-NH -
ETPE-5-0 1403 SE-OU BCSU-1 ACTV FLTY
ETIP-6 1402 SE-NH -
They r delimited by space...
How am I acquired the output like below?
UNIT|PHYS|STATE|LOCATION|INFO
TCSM-1098||SE-NH||-
ETPE-5-0|1403|SE-OU|BCSU-1|ACTV FLTY
ETIP-6|1402|SE-NH||-
Thank in advance
This is what I've tried so far
cat file.txt | awk 'BEGIN { FS = "[[:space:]][[:space:]]+" } {print $1,$2,$3,$4}' | sed 's/ /|/g'
It produces output like this
|UNIT|PHYS|STATE|LOCATION|INFO|
|TCSM-1098|SE-NH|-|
|ETPE-5-0|1403|SE-OU|BCSU-1|ACTV|FLTY
|ETIP-6|1402|SE-NH|-|
The column isn't excatly like what I hope for
It seems it's not delimited but fixed-width format.
$ perl -ple '
$_ = join "|",
map {s/^\s+|\s+$//g;$_}
unpack ("a11 a5 a6 a22 a30",$_);
' <file.txt
how it works
-p switch : loop over input lines (default var: $_) and print it
-l switch : chomp line ending (\n) and add it to output
-e : inline command
unpack function : takes defined format and input line and returns an array
map function : apply block to each element of array: regex to remove heading trailing spaces
join function : takes delimiter and array and gives string
$_ = : affects the string to default var for output
Perl to the rescue!
perl -wE 'my #lengths;
$_ = <>;
push #lengths, length $1 while /(\S+\s*)/g;
$lengths[-1] = "*";
my $f;
say join "|",
map s/^\s+|\s+$//gr,
unpack "A" . join("A", #lengths), $_
while (!$f++ or $_ = <>);' -- infile
The format is not whitespace separated, it's a fixed-width.
The #lengths array will be populated by the widths of the columns taken from the first line of the input. The last column width is replaced with *, as its width can't be deduced from the header.
Then, an unpack template is created from the lengths that's used to parse the file.
$f is just a flag that makes it possible to apply the template to the header line itself.
With GNU awk for FIELDWITDHS to handle fixed-width fields:
awk -v FIELDWIDTHS='11 5 6 22 99' -v OFS='|' '{$1=$1; gsub(/ *\| */,"|"); sub(/ +$/,"")}1' file
UNIT|PHYS|STATE|LOCATION|INFO
TCSM-1098||SE-NH||-
ETPE-5-0|1403|SE-OU|BCSU-1|ACTV FLTY
ETIP-6|1402|SE-NH||-
I think it's pretty clear and self-explanatory but let me know if you have any questions.
Manually, in awk:
$ awk 'BEGIN{split("11 5 6 23 99", cols); }
{s=0;
for (i in cols) {
field = substr($0, s, cols[i]);
s += cols[i];
sub(/^ */, "", field);
sub(/ *$/, "", field);
printf "%s|", field;
};
printf "\n" } ' file
UNIT|PHYS|STATE|LOCATION|INFO|
TCSM-1098||SE-NH||-|
ETPE-5-0|1403|SE-OU|BCSU-1|ACTV FLTY|
ETIP-6|1402|SE-NH||-|
The widths of the columns are set in the BEGIN block, then for each line we take substrings of the line of the required length. s counts the starting position of the current column, the sub() calls remove leading and trailing spaces. The code as such prints a trailing | on each line, but that can be worked around by making the first or last column a special case.
Note that the last field is not like in your output, it's hard to tell where the split between ACTV and FLTY should be. Is that fixed width too, or is the space a separator there?

How can I retrieve the matching records from mentioned file format in bash

XYZNA0000778800Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
I have above file format from which I want to find a matching record. For example, match a number(7789) on line starting with XYZ and once matched look for a matching number (7345) in lines below starting with 1 until it reaches to line starting with 9. retrieve the entire line record. How can I accomplish this using shell script, awk, sed or any combination.
Expected Output:
XYZNA0000778900Z
17345000012300324000000004000000000000000
With sed one can do:
$ sed -n '/^XYZ.*7789/,/^9$/{/^1.*7345/p}' file
17345000012300324000000004000000000000000
Breakdown:
sed -n ' ' # -n disabled automatic printing
/^XYZ.*7789/, # Match line starting with XYZ, and
# containing 7789
/^1.*7345/p # Print line starting with 1 and
# containing 7345, which is coming
# after the previous match
/^9$/ { } # Match line that is 9
range { stuff } will execute stuff when it's inside range, in this case the range is starting at /^XYZ.*7789/ and ending with /^9$/.
.* will match anything but newlines zero or more times.
If you want to print the whole block matching the conditions, one can use:
$ sed -n '/^XYZ.*7789/{:s;N;/\n9$/!bs;/\n1.*7345/p}' file
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
This works by reading lines between ^XYZ.*7779 and ^9$ into the pattern
space. And then printing the whole thing if ^1.*7345 can be matches:
sed -n ' ' # -n disables printing
/^XYZ.*7789/{ } # Match line starting
# with XYZ that also contains 7789
:s; # Define label s
N; # Append next line to pattern space
/\n9$/!bs; # Goto s unless \n9$ matches
/\n1.*7345/p # Print whole pattern space
# if \n1.*7345 matches
I'd use awk:
awk -v rid=7789 -v fid=7345 -v RS='\n9\n' -F '\n' 'index($1, rid) { for(i = 2; i < $NF; ++i) { if(index($i, fid)) { print $i; next } } }' filename
This works as follows:
-v RS='\n9\n' is the meat of the whole thing. Awk separates its input into records (by default lines). This sets the record separator to \n9\n, which means that records are separated by lines with a single 9 on them. These records are further separated into fields, and
-F '\n' tells awk that fields in a record are separated by newlines, so that each line in a record becomes a field.
-v rid=7789 -v fid=7345 sets two awk variables rid and fid (meant by me as record identifier and field identifier, respectively. The names are arbitrary.) to your search strings. You could encode these in the awk script directly, but this way makes it easier and safer to replace the values with those of a shell variables (which I expect you'll want to do).
Then the code:
index($1, rid) { # In records whose first field contains rid
for(i = 2; i < $NF; ++i) { # Walk through the fields from the second
if(index($i, fid)) { # When you find one that contains fid
print $i # Print it,
next # and continue with the next record.
} # Remove the "next" line if you want all matching
} # fields.
}
Note that multi-character record separators are not strictly required by POSIX awk, and I'm not certain if BSD awk accepts it. Both GNU awk and mawk do, though.
EDIT: Misread question the first time around.
an extendable awk script can be
$ awk '/^9$/{s=0} s&&/7345/; /^XYZ/&&/7789/{s=1} ' file
set flag s when line starts with XYZ and contains 7789; reset when line is just 9, and print when flag is set and contains pattern 7345.
This might work for you (GNU sed):
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789/!b;/7345/p' file
Use the option -n for the grep-like nature of sed. Gather up records beginning with XYZ and ending in 9. Reject any records which do not have 7789 in the header. Print any remaining records that contain 7345.
If the 7345 will always follow the header,this could be shortened to:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789.*7345/p' file
If all records are well-formed (begin XYZ and end in 9) then use:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^[^\n]*7789.*7345/p' file

extract each line followed by a line with a different value in column two

Given the following file structure,
9.975 1.49000000 0.295 0 0.4880 0.4929 0.5113 0.5245 2.016726 1.0472 -30.7449 1
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1
9.975 1.50000000 0.295 0 0.5145 0.4984 0.4873 0.5019 2.002143 1.0854 -30.3044 2
is there a way to extract each line in which the value in column two is not equal to the value in column two in the following line?
I.e. from these three lines I would like to extract the second one, since 1.49 is not equal to 1.50.
Maybe with sed or awk?
This is how I do this in MATLAB:
myline = 1;
mynewline = 1;
while myline < length(myfile)
if myfile(myline,2) ~= myfile(myline+1,2)
mynewfile(mynewline,:) = myfile(myline,:);
mynewline = mynewline+1;
myline = myline+1;
else
myline = myline+1;
end
end
However, my files are so large now that I would prefer to carry out this extraction in terminal before transferring them to my laptop.
Awk should do.
<data awk '($2 != prev) {print line} {line = $0; prev = $2}'
A brief intro to awk: awk program consists of a set of condition {code} blocks. It operates line by line. When no condition is given, the block is executed for each line. BEGIN condition is executed before the first line. Each line is split to fields, which are accessible with $_number_. The full line is in $0.
Here I compare the second field to the previous value, if it does not match I print the whole previous line. In all cases I store the current line into line and the second field into prev.
And if you really want it right, careful with the float comparisons - something like abs($2 - prev) < eps (there is no abs in awk, you need to define it yourself, and eps is some small enough number). I'm actually not sure if awk converts to number for equality testing, if not you're safe with the string comparisons.
This might work for you (GNU sed):
sed -r 'N;/^((\S+)\s+){2}.*\n\S+\s+\2/!P;D' file
Read two lines at a time. Pattern match on the first two columns and only print the first line when the second column does not match.
Try following command:
awk '$2 != field && field { print line } { field = $2; line = $0 }' infile
It saves previous line and second field, comparing in next loop with current line values. The && field check is useful to avoid a blank line at the beginning of file, when $2 != field would match because variable is empty.
It yields:
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1

Resources