extract each line followed by a line with a different value in column two - bash

Given the following file structure,
9.975 1.49000000 0.295 0 0.4880 0.4929 0.5113 0.5245 2.016726 1.0472 -30.7449 1
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1
9.975 1.50000000 0.295 0 0.5145 0.4984 0.4873 0.5019 2.002143 1.0854 -30.3044 2
is there a way to extract each line in which the value in column two is not equal to the value in column two in the following line?
I.e. from these three lines I would like to extract the second one, since 1.49 is not equal to 1.50.
Maybe with sed or awk?
This is how I do this in MATLAB:
myline = 1;
mynewline = 1;
while myline < length(myfile)
if myfile(myline,2) ~= myfile(myline+1,2)
mynewfile(mynewline,:) = myfile(myline,:);
mynewline = mynewline+1;
myline = myline+1;
else
myline = myline+1;
end
end
However, my files are so large now that I would prefer to carry out this extraction in terminal before transferring them to my laptop.

Awk should do.
<data awk '($2 != prev) {print line} {line = $0; prev = $2}'
A brief intro to awk: awk program consists of a set of condition {code} blocks. It operates line by line. When no condition is given, the block is executed for each line. BEGIN condition is executed before the first line. Each line is split to fields, which are accessible with $_number_. The full line is in $0.
Here I compare the second field to the previous value, if it does not match I print the whole previous line. In all cases I store the current line into line and the second field into prev.
And if you really want it right, careful with the float comparisons - something like abs($2 - prev) < eps (there is no abs in awk, you need to define it yourself, and eps is some small enough number). I'm actually not sure if awk converts to number for equality testing, if not you're safe with the string comparisons.

This might work for you (GNU sed):
sed -r 'N;/^((\S+)\s+){2}.*\n\S+\s+\2/!P;D' file
Read two lines at a time. Pattern match on the first two columns and only print the first line when the second column does not match.

Try following command:
awk '$2 != field && field { print line } { field = $2; line = $0 }' infile
It saves previous line and second field, comparing in next loop with current line values. The && field check is useful to avoid a blank line at the beginning of file, when $2 != field would match because variable is empty.
It yields:
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1

Related

Print all lines between line containing a string and first blank line, starting with the line containing that string

I've tried awk:
awk -v RS="zuzu_mumu" '{print RS $0}' input_file > output_file
The obtained file is the exact input_file but now the first line in file is zuzu_mumu.
How could be corrected my command?
After solved this, I've found the same string/patern in another arrangement; so I need to save all those records that match too, in an output file, following this rule:
if pattern match on a line, then look at previous lines and print the first line that follows an empty line, and print also the pattern match line and an empty line.
record 1
record 2
This is record 3 first line
info 1
info 2
This is one matched zuzu_mumu line
info 3
info 4
info 5
record 4
record 5
...
This is record n-1 first line
info a
This is one matched zuzu_mumu line
info b
info c
record n
...
I should obtain:
This is record 3 first line
This is one matched zuzu_mumu line
This is record n-1 first line
This is one matched zuzu_mumu line
Print all lines between line containing a string and first blank line,
starting with the line containing that string
I would use GNU AWK for this task. Let file.txt content be
Able
Baker
Charlie
Dog
Easy
Fox
then
awk 'index($0,"aker"){p=1}p{if(/^$/){exit};print}' file.txt
output
Baker
Charlie
Explanation: use index String function which gives either position of aker in whole line ($0) or 0 and treat this as condition, so this is used like is aker inside line? Note that using index rather than regular expression means we do not have to care about characters with special meaning, like for example .. If it does set p value to 1. If p then if it is empty line (it matches start of line followed by end of line) terminate processing (exit); print whole line as is.
(tested in gawk 4.2.1)
If you don't want to match the same line again, you can record all lines in an array and print the valid lines in the END block.
awk '
f && /zuzu_mumu/ { # If already found and found again
delete ary; entries=1; next; # Delete the array, reset entries and go to the next record
}
f || /zuzu_mumu/ { # If already found or match the word or interest
if(/^[[:blank:]]*$/){exit} # If only spaces, exit
f=1 # Mark as found
ary[entries++]=$0 # Add the current line to the array and increment the entry number
}
END {
for (j=1; j<entries; j++) # Loop and print the array values
print ary[j]
}
' file

(sed/awk) extract values text file and write to csv (no pattern)

I have (several) large text files from which I want to extract some values to create a csv file with all of these values.
My current solution is to have a few different calls to sed from which I save the values and then have a python script in which I combine the data in different files to a single csv file. However, this is quite slow and I want to speed it up.
The file let's call it my_file_1.txt has a structure that looks something like this
lines I don't need
start value 123
lines I don't need
epoch 1
...
lines I don't need
some epoch 18 words
stop value 234
lines I don't need
words start value 345 more words
lines I don't need
epoch 1
...
lines I don't need
epoch 72
stop value 456
...
and I would like to construct something like
file,start,stop,epoch,run
my_file_1.txt,123,234,18,1
my_file_1.txt,345,456,72,2
...
How can I get the results I want? It doesn't have to be Sed or Awk as long as I don't need to install something new and it is reasonably fast.
I don't really have any experience with awk. With sed my best guess would be
filename=$1
echo 'file,start,stop,epoch,run' > my_data.csv
sed -n '
s/.*start value \([0-9]\+\).*/'"$filename"',\1,/
h
$!N
/.*epoch \([0-9]\+\).*\n.*stop value\([0-9]\+\)/{s/\2,\1/}
D
T
G
P
' $filename | sed -z 's/,\n/,/' >> my_data.csv
and then deal with not getting the run number. Furthermore, this is not quite correct as the N will gobble up some "start value" lines leading to wrong result. It feels like it could be done easier with awk.
It is similar to 8992158 but I can't use that pattern and I know too little awk to rewrite it.
Solution (Edit)
I was not general enough in my description of the problem so I changed it up a bit and fixed some inconsistensies.
Awk (Rusty Lemur's answer)
Here I generalised from knowing that the numbers were at the end of the line to using gensub. For this I should have specified version of awk at is not available in all versions.
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = gensub(/.*start value ([0-9]+).*/, "\\1", 1, $0)
}
/epoch/ {
epoch = gensub(/.*epoch ([0-9]+).*/, "\\1", 1, $0)
}
/stop value/ {
stopValue = gensub(/.*stop value ([0-9]+).*/, "\\1", 1, $0)
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
I accepted this answer because it most understandable.
Sed (potong's answer)
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^.*start value/{:a;N;/\n.*stop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^.*start value (\S+).*\n.*epoch (\S+)\n.*stop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' my_file_1.txt | sed '1!N;s/\n//'
It's not clear how you'd get exactly the output you provided from the input you provided but this may be what you're trying to do (using any awk in any shell on every Unix box):
$ cat tst.awk
BEGIN {
OFS = ","
print "file", "start", "stop", "epoch", "run"
}
{ f[$1] = $NF }
$1 == "stop" {
print FILENAME, f["start"], f["stop"], f["epoch"], ++run
delete f
}
$ awk -f tst.awk my_file_1.txt
file,start,stop,epoch,run
my_file_1.txt,123,234,N,1
my_file_1.txt,345,456,M,2
awk's basic structure is:
read a record from the input (by default a record is a line)
evaluate conditions
apply actions
The record is split into fields (by default based on whitespace as the separator).
The fields are referenced by their position, starting at 1. $1 is the first field, $2 is the second.
The last field is referenced by a variable named NF for "number of fields." $NF is the last field, $(NF-1) is the second-to-last field, etc.
A "BEGIN" section will be executed before any input file is read, and it can be used to initialize variables (which are implicitly initialized to 0).
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = $NF # when a line contains "start value" store the last field as startValue
}
/epoch/ {
epoch = $NF
}
/stop value/ {
stopValue = $NF
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
Save that as processor.awk and invoke as:
awk -f processor.awk my_file_1.txt my_file_2.txt my_file_3.txt > output.csv
This might work for you (GNU sed):
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^start value/{:a;N;/\nstop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^start value (\S+).*\nepoch (\S+)\nstop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' file |
sed '1!N;s/\n//'
The solution contains two invocations of sed, the first to format all but the file name and second to embed the file name into the csv file.
Format the header line on the first line and prime the run number.
Gather up lines between start value and stop value.
Increment the run number, append it to the current line and output the file name. This prints two lines per record, the first is the file name and the second the remainder of the csv file.
In the second sed invocation read two lines at a time (except for the first line) and remove the newline between them, formatting the csv file.

Using awk command to compare values on separate lines?

I am trying to build a bash script that uses the awk command to go through a sorted tab-separated file, line-by-line and determine if:
the field 1 (molecule) of the line is the same as in the next line,
field 5 (strand) of the line is the string "minus", and
field 5 of the next line is the string "plus".
If this is true, I want to add the values from fields 1 and 3 from the line and then field 4 from the next line to a file. For context, after sorting, the input file looks like:
molecule gene start end strand
ERR2661861.3269 JN051170.1 11330 10778 minus
ERR2661861.3269 JN051170.1 11904 11348 minus
ERR2661861.3269 JN051170.1 12418 11916 minus
ERR2661861.3269 JN051170.1 13000 12469 minus
ERR2661861.3269 JN051170.1 13382 13932 plus
ERR2661861.3269 JN051170.1 13977 14480 plus
ERR2661861.3269 JN051170.1 14491 15054 plus
ERR2661861.3269 JN051170.1 15068 15624 plus
ERR2661861.3269 JN051170.1 15635 16181 plus
Thus, in this example, the script should find the statement true when comparing lines 4 and 5 and append the following line to a file:
ERR2661861.3269 13000 13382
The script that I have thus far is:
# test input file
file=Eg2.1.txt.out
#sort the file by 'molecule' field, then 'start' field
sort -k1,1 -k3n $file > sorted_file
# create output file and add 'molecule' 'start' and 'end' headers
echo molecule$'\t'start$'\t'end >> Test_file.txt
# for each line of the input file, do this
for i in $sorted_file
do
# check to see if field 1 on current line is the same as field 1 on next line AND if field 5 on current line is "minus" AND if field 5 on next line is "plus"
if [awk '{if(NR==i) print $1}' == awk '{if(NR==i+1) print $1}'] && [awk '{if(NR==i) print $5}' == "minus"] && [awk '{if(NR==i+1) print $5}' == "plus"];
# if this is true, then get the 1st and 3rd fields from current line and 4th field from next line and add this to the output file
then
mol=awk '{if(NR==i) print $1}'
start=awk '{if(NR==i) print $3}'
end=awk '{if(NR==i+1) print $4}'
new_line=$mol$'\t'$start$'\t'$end
echo new_line >> Test_file.txt
fi
done
The first part of the bash script works as I want it but the for loop does not seem to find any hits in the sorted file. Does anyone have any insights or suggestions for why this might not be working as intended?
Many thanks in advance!
Explanation why your code does not work
For a better solution to your problem see karakfa's answer.
String comparison in bash needs spaces around [ and ]
Bash interprets your command ...
[awk '{if(NR==i) print $1}' == awk '{if(NR==i+1) print $1}']
... as the command [awk with the arguments {if(NR..., ==, awk, and {if(NR...]. On your average system there is no command named [awk, therefore this should fail with an error message. Add a space after [ and before ].
awk wasn't executed
[ awk = awk ] just compares the literal string awk. To execute the commands and compare their outputs use [ "$(awk)" = "$(awk)" ].
awk is missing the input file
awk '{...}' tries to read input from stdin (the user, in your case). Since you want to read the file, add it as an argument: awk '{...}' sorted_file
awk '... NR==i ...' is not referencing the i from bash's for i in
awk does not know about your bash variable. When you write i in your awk script, that i will always have the default value 0. To pass a variable from bash to awk use awk -v i="$i" .... Also, it seems like you assumed for i in would iterate over the line numbers of your file. Right now, this is not the case, see the next paragraph.
for i in $sorted_file is not iterating the file sorted_file
You called your file sorted_file. But when you write $sorted_file you reference a variable that wasn't declared before. Undeclared variables expand to the empty string, therefore you iterate nothing.
You probably wanted to write for i in $(cat sorted_file), but that would iterate over the file content, not the line numbers. Also, the unquoted $() can cause unforsen problems depending on the file content. To iterate over the line numbers, use for i in $(seq $(wc -l sorted_file)).
this will do the last step, assumes data is sorted in the key and "minus" comes before "plus".
$ awk 'NR==1{next} $1==p && f && $NF=="plus"{print p,v,$3} {p=$1; v=$3; f=$NF=="minus"}' sortedfile
ERR2661861.3269 13000 13382
Note that awk has an implicit loop, no need force it to iterate externally.
The best thing to do when comparing adjacent lines in a stream using awk, or any other program for that matter, is to store the relevant data of that line and then compare as soon as both lines have been read, like in this awk script.
molecule = $1
strand = $5
if (molecule==last_molecule)
if (last_strand=="minus")
if (strand=="plus")
print $1,end,$4
last_molecule = molecule
last_strand = strand
end = $3
You essentially described a proto-program in your bullet points:
the field 1 (molecule) of the line is the same as in the next line,
field 5 (strand) of the line is the string "minus", and
field 5 of the next line is the string "plus".
You have everything needed to write a program in Perl, awk, ruby, etc.
Here is Perl version:
perl -lanE 'if ($l0==$F[0] && $l4 eq "minus" && $F[4] eq "plus") {say join("\t", #F[0..2])}
$l0=$F[0]; $l4=$F[4];' sorted_file
The -lanE part enables auto split (like awk) and auto loop and compiles the text as a program;
The if ($l0==$F[0] && $l4 eq "minus" && $F[4] eq "plus") tests your three bullet points (but Perl is 0 based index arrays so 'first' is 0 and fifth is 4)
The $l0=$F[0]; $l4=$F[4]; saves the current values of field 1 and 5 to compare next loop through. (Both awk and perl allow comparisons to non existent variables; hence why $l0 and $l4 can be used in a comparison before existing on the first time through this loop. Most other languages such as ruby they need to be initialized first...)
Here is an awk version, same program essentially:
awk '($1==l1 && l5=="minus" && $5=="plus"){print $1 "\t" $2 "\t" $3}
{l1=$1;l5=$5}' sorted_file
Ruby version:
ruby -lane 'BEGIN{l0=l4=""}
puts $F[0..2].join("\t") if (l0==$F[0] && l4=="minus" && $F[4]=="plus")
l0=$F[0]; l4=$F[4]
' sorted_file
All three print:
ERR2661861.3269 JN051170.1 13382
My point is that you very effectively understood and stated the problem you were trying to solve. That is 80% of solving it! All you then needed is the idiomatic details of each language.

Fill lines with a certain character until it contains a given amount of them

I have various flat files that contain roughly 30k records that I need to reformat using a script to ensure they all have the same amount of ~'s.
For example below are two of the sample records in the flat file. The first record containing 8 ~'s and the second containing 10 ~'s.
736~company 1~cp1~1~19~~08/07/1878~09/12/2015~
658~company 2~cp2~1~19~65.12~27/06/1868~22/08/2015~address line 1~address line 2~
I need both records to contain 12 ~'s so I need code that will loop through the file add pad out each line to contain the correct number of ~'s. The desired result would be as follows.
736~company 1~cp1~1~19~~08/07/1878~09/12/2015~~~~~
658~company 2~cp2~1~19~65.12~27/06/1868~22/08/2015~address line 1~address line 2~~~
I have the following bit of code which will display the number of ~'s in each file but I'm sure how to proceed from here.
sed 's/[^~]//g' inputfile.text| awk '{ print length }'
Set the field separator to ~ and keep adding ~s until you have enough:
$ awk -F"~" -v cols=12 'NF<=cols{for (i=NF;i<=cols;i++) $0=$0 FS}1' file
736~company 1~cp1~1~19~~08/07/1878~09/12/2015~~~~~
658~company 2~cp2~1~19~65.12~27/06/1868~22/08/2015~address line 1~address line 2~~~
awk -v FS='~' -v count=12 '{line = ""; for (i = 1; i <= count; i++) { line = line "~" $i } print line }' tildes.txt
Here's a bash builtin-only solution:
a=~~~~~~~~~~~~;
while read -r line; do
n=${line//[!~]};
echo "$line${a/$n}";
done < file
Explanation
n=${l//[!~]} makes a string from all ~ characters in $l.
${a/$n} takes the string of 12 ~ characters and deletes the first match of $n, so when we append it to $l, there will be exactly 12 tildes in the echoed line.
Warning: If there are more than 12 tildes in one line in file, the corresponding outputted line will have more than 24 tildes in it, as there are no checks to make sure ${#n} is less than ${#a}.
$ awk -F"~" -v c=12 'BEGIN{OFS=FS;c++}{$c=$c}1' file
736~company 1~cp1~1~19~~08/07/1878~09/12/2015~~~~~
658~company 2~cp2~1~19~65.12~27/06/1868~22/08/2015~address line 1~address line 2~~~

How to combine two lines that share the same keyword?

lets say I have a file looking somewhat like this:
X NeedThis1 KEYWORD
.
.
NeedThis2 X KEYWORD
And I need to combine the two lines into one like this:
NeedThis2 NeedThis1 KEYWORD
It needs to be done for every line in that file that contains the same KEYWORD but it can't combine two lines that look like this (two X's at the first|second position)
X NeedThis1 KEYWORD
X NeedThis2 KEYWORD
I am considering myself bash-noob so any advice if it can be done with something like awk or sed would be appreciated.
awk '
{if ($1 == "X") end[$3] = $2; else start[$3] = $1}
END {for (kw in start) if (kw in end) print start[kw], end[kw], kw}
' file
Try this:
awk '
$1=="X" {key = $NF; value = $2; next}
$2=="X" && $NF==key {print value, $1, key}' file
Explanation:
When a line where first field is X, store the last field as key and second field as value.
Look for the next line where second field is X and last field matches the key stored from pervious action.
When found, print the value of last matched line along with first field of the current line and the key.
This will most definitely break if your data does not match the sample you have shown (if it has more spaces or fields in between), so feel free to adjust as per your needs.
I won't give you the full answer, but if you have some way to identify "KEYWORD" (not in your problem statement), then use a BASH associative array:
declare -A keys
while IFS= read -u3 -r line
do
set -- $line
eval keyword=\$$#
keys[$keyword]+=${line%$keyword}
done
you'll certainly have to do some more fiddling, but your problem statement is incomplete and some of the work needs to be an exercise for the reader.

Resources