Use sed or awk to remove lines not adjacent to pattern in bash - bash

I need to delete any line that is not surrounding a ">" symbol.
Here is some sample data:
sample1.fasta
>R00003
ATCATACTACTACG
sample2.fasta
sample3.fasta
sample4.fasta
>R00003
ATACTACGTA
sample7.fasta
>R00003
ATGCATCAT
sample8.fasta
>R00003
AATCATCGACCT
sample9.fasta
sample10.fasta
>R00003
AGCATCTCAGTC
I tried using awk to help reveal the issue:
awk '{/fasta/?f++:f=0} f==2' R3.fasta
This returns:
sample3.fasta
sample10.fasta
This is diagnostic in that it shows where duplicates are present. However, I want to remove the lines that are not flanking the ">" symbol on either side. This does not remove them and only displays the second.
The result I expect is:
sample1.fasta
>R00003
ATCATACTACTACG
sample4.fasta
>R00003
ATACTACGTA
sample7.fasta
>R00003
ATGCATCAT
sample8.fasta
>R00003
AATCATCGACCT
sample10.fasta
>R0003
AGCATCTCAGTC
Where the lines not flaking the ">" symbol have been removed

Seems like the plain grep will suffice for this:
grep '^>' -C1 file | grep -v ^--$
First print one line above and one line below each line starting with > (use context -C1), then just filter out the -- lines that grep inserts to separate each context.
But if you prefer awk:
awk '/^>/{print a ORS $0; getline; print} {a=$0}' file
Keep the previous line in a, and when a line starts with >, print the previous line (in a), the current line, and the next line (which we get with getline).

From the example, it appears that you want all non-fasta lines and, for the fasta lines, you want only the last fasta file before the next >. In that case, try:
$ awk 'f && !/fasta/{print f; f=""} /fasta/{f=$0; next} END{if(f)print f} 1' R3.fasta
sample1.fasta
>R00003
ATCATACTACTACG
sample4.fasta
>R00003
ATACTACGTA
sample7.fasta
>R00003
ATGCATCAT
sample8.fasta
>R00003
AATCATCGACCT
sample10.fasta
>R0003
AGCATCTCAGTC
How it works
f && !/fasta/{print f; f=""}
If the variable f is set and the current line does not contain fasta, then print f and erase its current value.
/fasta/{f=$0; next}
If the current line contains fasta, then save the current line to variable f, slip the rest of the commands, and jump to the next line.
END{if(f)print f}
If f is still set to something after we have reached the end of the file, then print it.
1
For all other lines, print them.

another awk
$ awk '{if (/fasta/) f=$0;
else {if(f) print f; f=""; print}}' file
doesn't depend on number of lines in between.

With sed
sed ':A;/fasta$/!d;N;/\n>/!{s/.*\n//;bA};N' infile
sed '
:A # label for jump
/fasta$/!d # if the line end with fasta not delete
N # add the next line in the pattern space
/\n>/!{ # if this new line don'\''t start with >
s/.*\n// # delete it
bA} # and jump to A
N # else get the next line and print
' infile

Related

Remove newline if preceded with specific character on non-consecutive lines

I have a text file with every other line ending with a % character. I want to find the pattern "% + newline" and replace it with "%". In other words, I want to delete the newline character right after the % and not the other newline characters.
For example, I want to change the following:
abcabcabcabc%
123456789123
abcabcabcabc%
123456789123
to
abcabcabcabc%123456789123
abcabcabcabc%123456789123
I've tried the following sed command, to no avail.
sed 's/%\n/%/g' < input.txt > output.txt
By default sed can't remove newlines because it reads one newline-separated line at a time.
With any awk in any shell on every UNIX box for any number of lines ending in %, consecutive or not:
$ awk '{printf "%s%s", $0, (/%$/ ? "" : ORS)}' file
abcabcabcabc%123456789123
abcabcabcabc%123456789123
and with consecutive % lines:
$ cat file
now is the%
winter of%
our%
discontent
$ awk '{printf "%s%s", $0, (/%$/ ? "" : ORS)}' file
now is the%winter of%our%discontent
Your data sample imply that there are no several consecutive lines ending with %.
In that case, you may use
sed '/%$/{N;s/\n//}' file.txt > output.txt
It works as follows:
/%$/ - finds all lines ending with %
{N;s/\n//} - a block:
N - adds a newline to the pattern space, then appends the next line of input to the pattern space
s/\n// - removes a newline in the current pattern space.
See the online sed demo.
In portable sed that supports any number of continued lines:
parse.sed
:a # A goto label named 'a'
/%$/ { # When the last line ends in '%'
N # Append the next line
s/\n// # Remove new-line
ta # If new-line was replaced goto label 'a'
}
Run it like this:
sed -f parse.sed infile
Output when infile contains your input and the input from Ed Morton's answer:
abcabcabcabc%123456789123
abcabcabcabc%123456789123
now is the%winter of%our%discontent

Delete range of lines when line number of known or not in unix using head and tail?

This is my sample file.
I want to do this.
I have fixed requirement to delete 2nd and 3rd line keeping the 1st line.
From the bottom, I want to delete above 2 lines excluding last line, as I wouldn't know what my last line number is as it depends on file.
Once I delete my 2nd and 3rd line 4th line should ideally come at 2nd and so on, same for a bottom after delete.
I want to use head/tail command and modify the existing file only. as Changes to write back to the same file.
Sample file text format.
Input File
> This is First Line
> Delete Delete Delete This Line
> Delete Delete Delete This Line
> ..
> ..
> ..
> ..
> Delete Delete Delete This Line
> Delete Delete Delete This Line
> This is Last Line, should not be deleted It could be come at any line
number (variable)
Output file (same file modified)
This is First Line
..
..
..
..
This is Last Line, should not be deleted It could be come at any line number (variable)
Edit - Because of compatibility issues on Unix (Using HP Unix on ksh shell) I want to implement this using head/tail/awk. not sed.
Adding solution as per OP's request to make it genuine solution.
Approach: In this solution OP could provide lines from starting point and from ending point of any Input_file and those lines will be skipped.
What code will do: I have written code in that way it will generate an awk code as per your given lines to be skipped then and will run it too.
cat print_lines.ksh
start_line="2,3"
end_line="2,3"
total_lines=$(wc -l<Input_file)
awk -v len="$total_lines" -v OFS="||" -v s1="'" -v start="$start_line" -v end="$end_line" -v lines=$(wc -l <Input_file) '
BEGIN{
num_start=split(start, a,",");
num_end=split(end, b,",");
for(i=1;i<=num_start;i++){
val=val?val OFS "FNR=="a[i]:"FNR=="a[i]};
for(j=1;j<=num_end;j++){
b[j]=b[j]>1?len-(b[j]-1):b[j];
val=val?val OFS "FNR=="b[j]:"FNR=="b[j]};
print "awk " s1 val "{next} 1" s1" Input_file"}
' | sh
Change Input_file name to your actual file name and let me know how it goes then.
Following awk may help you in same(Since I don't have Hp system so didn't test it).
awk -v lines=$(wc -l <Input_file) 'FNR==2 || FNR==3 || FNR==(lines-1) || FNR==(lines-2){next} 1' Input_file
EDIT: Adding non-one liner form of solution too now.
awk -v lines=$(wc -l <Input_file) '
FNR==2 || FNR==3 || FNR==(lines-1) || FNR==(lines-2){
next}
1
' Input_file
wc + sed solution:
len=$(wc -l inpfile | cut -d' ' -f1)
sed "$(echo "$((len-2)),$((len-1))")d; 2,3d" inpfile > tmp_f && mv tmp_f inpfile
$ cat inputfile
> This is First Line
> ..
> ..
> ..
> ..
> This is Last Line, should not be deleted It could be come at any line
Perl suggestion... read whole file into array #L, get index of last line. Delete 2nd last, 3rd last, 3rd and 2nd line. Print what's left.
perl -e '#L=<>; $p=$#L; delete $L[$p-1]; delete $L[$p-2]; delete $L[2]; delete $L[1]; print #L' file.txt
Or, maybe a little more succinctly with splice:
perl -e '#L=<>; splice #L,1,2; splice #L,$#L-2,2; print #L' file.txt
If you wish to have some flexibility a ksh script approach may work, though little expensive in terms of resources :
#!/bin/ksh
[ -f "$1" ] || echo "Input is not a file" || exit 1
total=$(wc -l "$1" | cut -d' ' -f1 )
echo "How many lines to delete at the end?"
read no
[ -z "$no" ] && echo "Not sure how many lines to delete, aborting" && exit 1
sed "2,3d;$((total-no)),$((total-1))d" "$1" >tempfile && mv tempfile "$1"
And feed the file as argument to the script.
Notes
This deletes second and third lines.
Plus no number of lines from last excluding last as read from user.
Note: My ksh version is 93u+ 2012-08-01
awk '{printf "%d\t%s\n", NR, $0}' < file | sed '2,3d;N;$!P;D' file
The awk here serves the purpose of providing line numbers and then passing the output to the sed which uses the line numbers to do the required operations.
%d : Used to print the numbers. You can also use '%i'
'\t' : used to place a tab between the number and string
%s : to print the string of charaters
'\n' : To create a new line
NR : to print lines numbers starting from 1
For sed
N: Read/append the next line of input into the pattern space.
$! : is for not deleting the last line
D : This is used when pattern space contains no new lines normal and start a new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the specified lines, and restart cycle with the resultant pattern space, without reading a new line of input.
P : Print up to the first embedded newline of the current pattern space.This
prints the lines after removing the subjected lines.
I enjoyed this task and wrote awk script for more scaleable case (huge files).
Reading/scanning the input file once (no need to know line count), not storing the whole file in memory.
script.awk
BEGIN { range = 3} # define sliding window range
{lines[NR] = $0} # capture each line in array
NR == 1 {print} # print 1st line
NR > range * 2{ # for lines in sliding window range bottom
print lines[NR - range]; # print sliding window top line
delete lines[NR - range]; # delete sliding window top line
}
END {print} # print last line
running:
awk -f script.awk input.txt
input.txt
line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8
line 9
line 10
output:
line 1
line 4
line 5
line 6
line 7
line 10

Append and replace using awk/sed

I have this file:
2016,05,P,0002 ,CJGLOPSD8
00,BBF,BBDFTP999,051000100,GBP, , -2705248.00
00,BBF,BBDFTP999,059999998,GBP, , -3479679.38
00,BBF,BBDFTP999,061505141,GBP, , -0.40
00,BBF,BBDFTP999,061505142,GBP, , 6207621.00
00,BBF,BBDFTP999,061505405,GBP, , -0.16
00,BBF,BBDFTP999,061552000,GBP, , -0.24
00,BBF,BBDFTP999,061559010,GBP, , -0.44
00,BBF,BBDFTP999,062108021,GBP, , -0.34
00,BBF,BBDFTP999,063502007,GBP, , -0.28
I want to programmatically (in unix, or informatica if possible) grab the first two fields in the top row, concatenate them, append them to the end of each line and remove that first row.
Like so:
00,BBF,BBDFTP999,051000100,GBP,,-2705248.00,201605
00,BBF,BBDFTP999,059999998,GBP,,-3479679.38,201605
00,BBF,BBDFTP999,061505141,GBP,,-0.40,201605
00,BBF,BBDFTP999,061505142,GBP,,6207621.00,201605
00,BBF,BBDFTP999,061505405,GBP,,-0.16,201605
00,BBF,BBDFTP999,061552000,GBP,,-0.24,201605
00,BBF,BBDFTP999,061559010,GBP,,-0.44,201605
00,BBF,BBDFTP999,062108021,GBP,,-0.34,201605
00,BBF,BBDFTP999,063502007,GBP,,-0.28,201605
This is my current attempt:
awk -vvar1=`cat OF\ OPSDOWN8.CSV | head -1 | cut -d',' -f1` -vvar2=`cat OF\ OPSDOWN8.CSV | head -1 | cut -d',' -f2` 'BEGIN {FS=OFS=","} {print $0, var 1var2}' OF\ OPSDOWN8.CSV> OF_OPSDOWN8.csv
Any pointers? I've tried looking around the forum but can only find answers to part of my question.
Thanks for your help.
Use this awk:
awk 'BEGIN{FS=OFS=","} NR==1{val=$1$2;next} {gsub(/ */,"");print $0,val}' file
Explanation:
BEGIN{FS=OFS=","} - This block will set FS (Field Separator) and OFS (Output Field Separator) as ,.
NR==1 - Working with line number 1. Here, $1 and $2 denotes field number.
print $0,val - Printing $0 (whole line) and stored value from val.
I would use the following awk command:
awk 'NR==1{d=$1$2;next}{$(NF+1)=d;gsub(/[[:space:]]/,"")}1' FS=, OFS=, file
Explanation:
NR==1{d=$1$2;next} applies on line 1 and set's a variable d(ate) to the value of the first and the second field. The variable is being used when processing the remaining lines. next tells awk to go ahead with the next line right away without processing further instructions on this line.
{$(NF+1)=d;gsub(/[[:space:]]/,"")}1 appends a new field to the line (NF is the number of fields, assigning d to $(NF+1) effectively adds a field. gsub() is used to removing spaces. 1 at the end always evaluates to true and makes awk print the modified line.
FS=, is a command line argument. It set's the input field delimiter to ,.
OFS=, is a command line argument. It set's the output field delimiter to ,.
Output:
00,BBF,BBDFTP999,051000100,GBP,,-2705248.00,201605
00,BBF,BBDFTP999,059999998,GBP,,-3479679.38,201605
00,BBF,BBDFTP999,061505141,GBP,,-0.40,201605
00,BBF,BBDFTP999,061505142,GBP,,6207621.00,201605
00,BBF,BBDFTP999,061505405,GBP,,-0.16,201605
00,BBF,BBDFTP999,061552000,GBP,,-0.24,201605
00,BBF,BBDFTP999,061559010,GBP,,-0.44,201605
00,BBF,BBDFTP999,062108021,GBP,,-0.34,201605
00,BBF,BBDFTP999,063502007,GBP,,-0.28,201605
With sed :
sed '1{s/\([^,]*\),\([^,]*\),.*/\1\2/;h;d};/.*/G;s/\n/,/;s/ //g' file
in ERE mode :
sed -r '1{s/([^,]*),([^,]*),.*/\1\2/;h;d};/.*/G;s/\n/,/;s/ //g' file
Output :
00,BBF,BBDFTP999,051000100,GBP,,-2705248.00,201605
00,BBF,BBDFTP999,059999998,GBP,,-3479679.38,201605
00,BBF,BBDFTP999,061505141,GBP,,-0.40,201605
00,BBF,BBDFTP999,061505142,GBP,,6207621.00,201605
00,BBF,BBDFTP999,061505405,GBP,,-0.16,201605
00,BBF,BBDFTP999,061552000,GBP,,-0.24,201605
00,BBF,BBDFTP999,061559010,GBP,,-0.44,201605
00,BBF,BBDFTP999,062108021,GBP,,-0.34,201605
00,BBF,BBDFTP999,063502007,GBP,,-0.28,201605
This might work for you (GNU sed):
sed '1s/,//;1s/,.*//;1h;1d;s/ //g;G;s/\n/,/' file
For the first line only: remove the first comma, remove from the next comma to the end of the line, store the amended line in the hold space (HS) and then delete the current line (the d abruptly ends processing). For subsequent lines: remove all spaces, append the HS and replace the newline (from the G command) with a comma.
Or if you prefer:
sed '1{s/,//;s/,.*//;h;d};s/ //g;G;s/\n/,/' file
If you want to use Informatica for this, use two Source Qualifiers. Read the file twice - just one line in one SQ (filter out the rest) and in the second SQ read the whole file except the first line (skip header). Join the two on dummy port and you're done.

How to replace the empty place with next line content in shell script

1,n1,abcd,1234
2,n2,abrt,5666
,h2,yyyy,123x
3,h2,yyyy,123y
3,h2,yyyy,1234
,k1,yyyy,5234
4,22,yyyy,5234
the above given is my input file abc.txt , all I want the missing first column value should fill with next row first value.
example:
3,h2,yyyy,123x
3,h2,yyyy,123y
I want output like below,
1,n1,abcd,1234
2,n2,abrt,5666
3,h2,yyyy,123x// the missing first column value 3 should fill with second row first value
3,h2,yyyy,123y
3,h2,yyyy,1234
4,k1,yyyy,5234
4,22,yyyy,5234
How to implement this with help of AWK or some other alternate in shell script,please help.
Using awk you can do:
awk -F, '$1 ~ /^ *$/ {
p=p RS $0
next
}
p!="" {
gsub(RS " +", RS $1, p)
sub("^" RS, "", p)
print p
p=""
} 1' file
1,n1,abcd,1234
2,n2,abrt,5666
3,h2,yyyy,123x
3,h2,yyyy,123y
3,h2,yyyy,1234
4,k1,yyyy,5234
4,22,yyyy,5234
I would reverse the file, and then replace the value from the previous line:
tac filename | awk -F, '$1 ~ /^[[:blank:]]*$/ {$1 = prev} {print; prev=$1}' | tac
This will also fill in missing values on multiple lines.
With GNU sed:
$ sed '/^ ,/{N;s/ \(.*\n\)\([^,]*\)\(.*\)/\2\1\2\3/}' infile
1,n1,abcd,1234
2,n2,abrt,5666
3,h2,yyyy,123x
3,h2,yyyy,123y
3,h2,yyyy,1234
4,k1,yyyy,5234
4,22,yyyy,5234
The sed command does the following:
/^ ,/ { # If the line starts with 'space comma'
N # Append the next line
# Extract the value before the comma, prepend to first line
s/ \(.*\n\)\([^,]*\)\(.*\)/\2\1\2\3/
}
BSD sed would require an extra semicolon before the closing brace.
This only works with non-contiguous lines with missing values.

How to print the line number where a string appears in a file?

I have a specific word, and I would like to find out what line number in my file that word appears on.
This is happening in a c shell script.
I've been trying to play around with awk to find the line number, but so far I haven't been able to. I want to assign that line number to a variable as well.
Using grep
To look for word in file and print the line number, use the -n option to grep:
grep -n 'word' file
This prints both the line number and the line on which it matches.
Using awk
This will print the number of line on which the word word appears in the file:
awk '/word/{print NR}' file
This will print both the line number and the line on which word appears:
awk '/word/{print NR, $0}' file
You can replace word with any regular expression that you like.
How it works:
/word/
This selects lines containing word.
{print NR}
For the selected lines, this prints the line number (NR means Number of the Record). You can change this to print any information that you are interested in. Thus, {print NR, $0} would print the line number followed by the line itself, $0.
Assigning the line number to a variable
Use command substitution:
n=$(awk '/word/{print NR}' file)
Using shell variables as the pattern
Suppose that the regex that we are looking for is in the shell variable url:
awk -v x="$url" '$0~x {print NR}' file
And:
n=$(awk -v x="$url" '$0~x {print NR}' file)
Sed
You can use the sed command
sed -n '/pattern/=' file
Explanation
The -n suppresses normal output so it doesn't print the actual lines. It first matches the /pattern/, and then the = operator means print the line number. Note that this will print all lines that contains the pattern.
Use the NR Variable
Given a file containing:
foo
bar
baz
use the built-in NR variable to find the line number. For example:
$ awk '/bar/ { print NR }' /tmp/foo
2
find the line number for which the first column match RRBS
awk 'i++ {if($1~/RRBS/) print i}' ../../bak/bak.db

Resources