remove only *some* fullstops from a csv file - bash

If I have lines like the following:
1,987372,987372,C,T,.,.,.,.,.,.,.,.,1,D,.,.,.,.,.,.,.,1.293,12.23,0.989,0.973,D,.,.,.,.,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,.,.,.,.,.,.,.,.,1,D,.,.,.,.,.,.,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252
how can I replace all instances of ,., with ,?,
I want to preserve actual decimal places in the numbers so I can't just do
sed 's/./?/g' file
however when doing:
sed 's/,.,/,?,/g' file
this only appears to work in some cases. i.e. there are still instances of ,., hanging around.
anyone have any pointers?
Thanks

This should work :
sed ':a;s/,\.,/,?,/g;ta' file
With successive ,., strings, after a substitution succeeded, next character to be processed will be the following . that doesn't match the pattern, so with you need a second pass.
:a is a label for upcoming loop
,\., will match dot between commas. Note that the dot must be escaped because . is for matching any character (,a, would match with ,.,).
g is for general substitution
ta tests previous substitution and if it succeeded, loops to :a label for remaining substitutions.

Using sed it is possible by running a loop as shown in above answer however problem is easily solved using perl command line with lookarounds:
perl -pe 's/(?<=,)\.(?=,)/?/g' file
1,987372,987372,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,?,1.293,12.23,0.989,0.973,D,?,?,?,?,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252
This command doesn't need a loop because instead of matching surrounding commas we're just asserting their position using a lookbehind and lookahead.

All that's necessary is a single substitution
$ perl -pe 's/,\.(?=,)/,?/g' dots.csv
1,987372,987372,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,?,1.293,12.23,0.989,0.973,D,?,?,?,?,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252

You have an example using sed style regular expressions. I'll offer an alternative - parse the CSV, and then treat each thing as a 'field':
#!/usr/bin/perl
use strict;
use warnings;
#iterate input row by row
while ( <DATA> ) {
#remove linefeeds
chomp;
#split this row on ,
my #row = split /,/;
#iterate each field
foreach my $field ( #row ) {
#replace this field with "?" if it's "."
$field = "?" if $field eq ".";
}
#stick this row together again.
print join ",", #row,"\n";
}
__DATA__
1,987372,987372,C,T,.,.,.,.,.,.,.,.,1,D,.,.,.,.,.,.,.,1.293,12.23,0.989,0.973,D,.,.,.,.,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,.,.,.,.,.,.,.,.,1,D,.,.,.,.,.,.,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252
This is more verbose than it needs to be, to illustrate the concept. This could be reduced down to:
perl -F, -lane 'print join ",", map { $_ eq "." ? "?" : $_ } #F'
If your CSV also has quoting, then you can break out the Text::CSV module, which handles that neatly.

You just need 2 passes since the trailing , found on a ,., match isn't available to match the leading , on the next ,.,:
$ sed 's/,\.,/,?,/g; s/,\.,/,?,/g' file
1,987372,987372,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,?,1.293,12.23,0.989,0.973,D,?,?,?,?,0.253,0,4.08,0.917,1.048,1.000,1.000,12.998
1,987393,987393,C,T,?,?,?,?,?,?,?,?,1,D,?,?,?,?,?,?,0.152,1.980,16.09,0.999,0.982,D,-0.493,T,0.335,T,0.696,0,5.06,0.871,0.935,0.998,0.997,16.252
The above will work in any sed on any OS.

Related

How to replace text in file between known start and stop positions with a command line utility like sed or awk?

I have been tinkering with this for a while but can't quite figure it out. A sample line within the file looks like this:
"...~236 characters of data...Y YYY. Y...many more characters of data"
How would I use sed or awk to replace spaces with a B character only between positions 236 and 246? In that example string it starts at character 29 and ends at character 39 within the string. I would want to preserve all the text preceding and following the target chunk of data within the line.
For clarification based on the comments, it should be applied to all lines in the file and expected output would be:
"...~236 characters of data...YBBYYY.BBY...many more characters of data"
With GNU awk:
$ awk -v FIELDWIDTHS='29 10 *' -v OFS= '{gsub(/ /, "B", $2)} 1' ip.txt
...~236 characters of data...YBBYYY.BBY...many more characters of data
FIELDWIDTHS='29 10 *' means 29 characters for first field, next 10 characters for second field and the rest for third field. OFS is set to empty, otherwise you'll get space added between the fields.
With perl:
$ perl -pe 's/^.{29}\K.{10}/$&=~tr| |B|r/e' ip.txt
...~236 characters of data...YBBYYY.BBY...many more characters of data
^.{29}\K match and ignore first 29 characters
.{10} match 10 characters
e flag to allow Perl code instead of string in replacement section
$&=~tr| |B|r convert space to B for the matched portion
Use this Perl one-liner with substr and tr. Note that this uses the fact that you can assign to substr, which changes the original string:
perl -lpe 'BEGIN { $from = 29; $to = 39; } (substr $_, ( $from - 1 ), ( $to - $from + 1 ) ) =~ tr/ /B/;' in_file > out_file
To change the file in-place, use:
perl -i.bak -lpe 'BEGIN { $from = 29; $to = 39; } (substr $_, ( $from - 1 ), ( $to - $from + 1 ) ) =~ tr/ /B/;' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
I would use GNU AWK following way, for simplicity sake say we have file.txt content
S o m e s t r i n g
and want to change spaces from 5 (inclusive) to 10 (inclusive) position then
awk 'BEGIN{FPAT=".";OFS=""}{for(i=5;i<=10;i+=1)$i=($i==" "?"B":$i);print}' file.txt
output is
S o mBeBsBt r i n g
Explanation: I set field pattern (FPAT) to any single character and output field seperator (OFS) to empty string, thus every field is populated by single characters and I do not get superfluous space when print-ing. I use for loop to access desired fields and for every one I check if it is space, if it is I assign B here otherwise I assign original value, finally I print whole changed line.
Using GNU awk:
awk -v strt=29 -v end=39 '{ ram=substr($0,strt,(end-strt));gsub(" ","B",ram);print substr($0,1,(strt-1)) ram substr($0,(end)) }' file
Explanation:
awk -v strt=29 -v end=39 '{ # Pass the start and end character positions as strt and end respectively
ram=substr($0,strt,(end-strt)); # Extract the 29th to the 39th characters of the line and read into variable ram
gsub(" ","B",ram); # Replace spaces with B in ram
print substr($0,1,(strt-1)) ram substr($0,(end)) # Rebuild the line incorporating raw and printing the result
}'file
This is certainly a suitable task for perl, and saddens me that my perl has become so rusty that this is the best I can come up with at the moment:
perl -e 'local $/=\1;while(<>) { s/ /B/ if $. >= 236 && $. <= 246; print }' input;
Another awk but using FS="":
$ awk 'BEGIN{FS=OFS=""}{for(i=29;i<=39;i++)sub(/ /,"B",$i)}1' file
Output:
"...~236 characters of data...YBBYYY.BBY...many more characters of data"
Explained:
$ awk ' # yes awk yes
BEGIN {
FS=OFS="" # set empty field delimiters
}
{
for(i=29;i<=39;i++) # between desired indexes
sub(/ /,"B",$i) # replace space with B
# if($i==" ") # couldve taken this route, too
# $i="B"
}1' file # implicit output
With sed :
sed '
H
s/\(.\{236\}\)\(.\{11\}\).*/\2/
s/ /B/g
H
g
s/\n//g
s/\(.\{236\}\)\(.\{11\}\)\(.*\)\(.\{11\}\)/\1\4\3/
x
s/.*//
x' infile
When you have an input string without \r, you can use:
sed -r 's/(.{236})(.{10})(.*)/\1\r\2\r\3/;:a;s/(\r.*) (.*\r)/\1B\2/;ta;s/\r//g' input
Explanation:
First put \r around the area that you want to change.
Next introduce a label to jump back to.
Next replace a space between 2 markers.
Repeat until all spaces are replaced.
Remove the markers.
In your case, where the length doesn't change, you can do without the markers.
Replace a space after 236..245 characters and try again when it succeeds.
sed -r ':a; s/^(.{236})([^ ]{0,9}) /\1\2B/;ta' input
This might work for you (GNU sed):
sed -E 's/./&\n/245;s//\n&/236/;h;y/ /B/;H;g;s/\n.*\n(.*)\n.*\n(.*)\n.*/\2\1/' file
Divide the problem into 2 lines, one with spaces and one with B's where there were spaces.
Then using pattern matching make a composite line from the two lines.
N.B. The newline can be used as a delimiter as it is guaranteed not to be in seds pattern space.

sed replace string with pipe and stars

I have the following string:
|**barak**.version|2001.0132012031539|
in file text.txt.
I would like to replace it with the following:
|**barak**.version|2001.01.2012031541|
So I run:
sed -i "s/\|\*\*$module\*\*.version\|2001.0132012031539/|**$module**.version|$version/" text.txt
but the result is a duplicate instead of replacing:
|**barak**.version|2001.01.2012031541|**barak**.version|2001.0132012031539|
What am I doing wrong?
Here is the value for module and version:
$ echo $module
barak
$ echo $version
2001.01.2012031541
Assumptions:
lines of interest start and end with a pipe (|) and have one more pipe somewhere in the middle of the data
search is based solely on the value of ${module} existing between the 1st/2nd pipes in the data
we don't know what else may be between the 1st/2nd pipes
the version number is the only thing between the 2nd/3rd pipes
we don't know the version number that we'll be replacing
Sample data:
$ module='barak'
$ version='2001.01.2012031541'
$ cat text.txt
**barak**.version|2001.0132012031539| <<<=== leave this one alone
|**apple**.version|2001.0132012031539|
|**barak**.version|2001.0132012031539| <<<=== replace this one
|**chuck**.version|2001.0132012031539|
|**barak**.peanuts|2001.0132012031539| <<<=== replace this one
One sed solution with -Extended regex support enabled and making use of a capture group:
$ sed -E "s/^(\|[^|]*${module}[^|]*).*/\1|${version}|/" text.txt
Where:
\| - first occurrence (escaped pipe) tells sed we're dealing with a literal pipe; follow-on pipes will be treated as literal strings
^(\|[^|]*${module}[^|]*) - first capture group that starts at the beginning of the line, starts with a pipe, then some number of non-pipe characters, then the search pattern (${module}), then more non-pipe characters (continues up to next pipe character)
.* - matches rest of the line (which we're going to discard)
\1|${version}| - replace line with our first capture group, then a pipe, then the new replacement value (${version}), then the final pipe
The above generates:
**barak**.version|2001.0132012031539|
|**apple**.version|2001.0132012031539|
|**barak**.version|2001.01.2012031541| <<<=== replaced
|**chuck**.version|2001.0132012031539|
|**barak**.peanuts|2001.01.2012031541| <<<=== replaced
An awk alternative using GNU awk:
awk -v mod="$module" -v vers="$version" -F \| '{ OFS=FS;split($2,map,".");inmod=substr(map[1],3,length(map[1])-4);if (inmod==mod) { $3=vers } }1' file
Pass two variables mod and vers to awk using $module and $version. Set the field delimiter to |. Split the second field into array map using the split function and using . as the delimiter. Then strip the leading and ending "**" from the first index of the array to expose the module name as inmod using the substr function. Compare this to the mod variable and if there is a match, change the 3rd delimited field to the variable vers. Print the lines with short hand 1
Pipe is only special when you're using extended regular expressions: sed -E
There's no reason why you need extended here, stick with basic regex:
sed "
# for lines matching module.version
/|\*\*$module\*\*.version|/ {
# replace the version
s/|2001.0132012031539|/|$version|/
}
" text.txt
or as an unreadable one-liner
sed "/|\*\*$module\*\*.version|/ s/|2001.0132012031539|/|$version|/" text.txt

bash: transform scaffold fasta

I have a fasta file with the following sequences:
>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG
>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG
And I want to split the all sequences in the file that contain multiple N at the site where it occurs and make two sequences out of it.
Expected solution:
>NZ_OCNF01123018.1
TACAAATACAACAAATACAAGTACACCAAGTACAAATACAAGTATCCCAAGTACAAATACAAGTA
TCCCAAGTACAAATACAAGTATTCCAAGTACAAATACAAAACCTGTTGAGCAACCTAAACCTGTTGAAC
AGCCCAAACCTGTTGAACAGC
>contig1
AAACCTTTATCCGCACTTA
CGAGCAAATACACCAATACCGCTTTATCGGCACAGTCTGCCCAAATTGACGGATGCACCATGTTACCCAACAC
ATCAATCAACGTTTGTGGGATCACCTGAAAAAGGGCGCGGTTTGTGGTTGATG
>NZ_OCNF01123018.2
AATTGTCGTGTAAAGCCACACCAAACCCCATTATAGCCCCAAAAACACCAAAAAGGCTGCCTGAACCACATTTCAGACAG
my (inelegant) approach would be this:
perl -pe 's/[N]+/\*/g' $file | perl -pe 's/\*/\n>contig1\n/g'
of course that also replaces the N of the sequence header and creates headers without a sequence. As a plus, it would be nice to number the new 'contigs' from 1 to x in case there are multiple sequences with N.
What do you suggest?
I'd suggest to use split instead of trying to get a regex just right, and in a script instead of a brittle and crammed "one"-liner.
use warnings;
use strict;
use feature 'say';
my $file = shift #ARGV;
die "Usage: $0 filename\n" if !$file; # also check submitted $file
my $content = do { # or: my $content = Path::Tiny::path($file)->slurp;
local $/;
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>;
};
my #f = grep { /\S/ } split /(?<!>)NN+/, $content;
say shift #f;
my $cnt;
for (#f) {
say "\n>contig", (++$cnt), ":\n$_";
}
This slurps the file into $content since NN+ can span multiple lines; Path::Tiny module can make that cleaner. The first element of the obtained array needs no >contig so it is shifted away.
The negative lookbehind (?<!...) makes the regex in split's separator pattern match NN+ only when not preceded by >, thus protecting (excluding) header lines that may start with that. If headers may contain consecutive N which are not right after > then you need to refine this.
I expaned your perl one-liner a bit:
cat file.fasta | \
perl -pe 's/\n//g unless /^>/; s/>/\n>/g;' | \
perl -pe 's/N+(?{$n++})/\n>contig${n}\n/g unless /^>/'
the first part is to remove newlines between bases, the second part is to replace continuous 'N'.

Replace and increment letters and numbers with awk or sed

I have a string that contains
fastcgi_cache_path /var/run/nginx-cache15 levels=1:2 keys_zone=MYSITEP:100m inactive=60m;
One of the goals of this script is to increment nginx-cache two digits based on the value find on previous file. For doing that I used this code:
# Replace cache_path
PREV=$(ls -t /etc/nginx/sites-available | head -n1) #find the previous cache_path number
CACHE=$(grep fastcgi_cache_path $PREV | awk '{print $2}' |cut -d/ -f4) #take the string to change
SUB=$(echo $CACHE |sed "s/nginx-cache[0-9]*[0-9]/&#/g;:a {s/0#/1/g;s/1#/2/g;s/2#/3/g;s/3#/4/g;s/4#/5/g;s/5#/6/g;s/6#/7/g;s/7#/8/g;s/8#/9/g;s/9#/#0/g;t a};s/#/1/g") #increment number
sed -i "s/nginx-cache[0-9]*/$SUB/g" $SITENAME #replace number
Maybe not so elegant, but it works.
The other goal is to increment last letter of all occurrences of MYSITEx (MYSITEP, in that case, should become MYSITEQ, after MYSITEP, etc. etc and once MYSITEZ will be reached add another letter, like MYSITEAA, MYSITEAB, etc. etc.
I thought something like:
sed -i "s/MYSITEP[A-Z]*/MYSITEGG/g" $SITENAME
but it can't works cause MYSITEGG is a static value and can't be used.
How can I calculate the last letter, increment it to the next one and once the last Z letter will be reached, add another letter?
Thank you!
Perl's autoincrement will work on letters as well as digits, in exactly the manner you describe
We may as well tidy your nginx-cache increment as well while we're at it
I assume SITENAME holds the name of the file to be modified?
It would look like this. I have to assign the capture $1 to an ordinary variable $n to increment it, as $1 is read-only
perl -i -pe 's/nginx-cache\K(\d+)/ ++($n = $1) /e; s/MYSITE\K(\w+)/ ++($n = $1) /e;' $SITENAME
If you wish, this can be done in a single substitution, like this
perl -i -pe 's/(?:nginx-cache|MYSITE)\K(\w+)/ ++($n = $1) /ge' $SITENAME
Note: The solution below is needlessly complicated, because as Borodin's helpful answer demonstrates (and #stevesliva's comment on the question hinted at), Perl directly supports incrementing letters alphabetically in the manner described in the question, by applying the ++ operator to a variable containing a letter (sequence); e.g.:
$ perl -E '$letters = "ZZ"; say ++$letters'
AAA
The solution below may still be of interest as an annotated showcase of how Perl's power can be harnessed from the shell, showing techniques such as:
use of s///e to determine the replacement string with an expression.
splitting a string into a character array (split //, "....")
use of the ord and chr functions to get the codepoint of a char., and convert a(n incremented) codepoint back to a char.
string replication (x operator)
array indexing and slices:
getting an array's last element ($chars[-1])
getting all but the last element of an array (#chars[0..$#chars-1])
A perl solution (in effect a re-implementation of what ++ can do directly):
perl -pe 's/\bMYSITE\K([A-Z]+)/
#chars = split qr(), $1; $chars[-1] eq "Z" ?
"A" x (1 + scalar #chars)
:
join "", #chars[0..$#chars-1], chr (1 + ord $chars[-1])
/e' <<'EOF'
...=MYSITEP:...
...=MYSITEZP:...
...=MYSITEZZ:...
EOF
yields:
...=MYSITEQ:... # P -> Q
...=MYSITEZQ:... # ZP -> ZQ
...=MYSITEAAA:... # ZZ -> AAA
You can use perl's -i option to replace the input file with the result
(perl -i -pe '...' "$SITENAME").
As Borodin's answer demonstrates, it's not hard to solve all tasks in the question using perl alone.
The s function's /e option allows use of a Perl expression for determining the replacement string, which enables sophisticated replacements:
$1 references the current MYSITE suffix in the expression.
#chars = split qr(), $1 splits the suffix into a character array.
$chars[-1] eq "Z" tests if the last suffix char. is Z
If so: The suffix is replaced with all As, with an additional A appended
("A" x (1 + scalar #chars)).
Otherwise: The last suffix char. is replaced with the following letter in the alphabet
(join "", #chars[0..$#chars-1], chr (1 + ord $chars[-1]))

insert a string at specific position in a file by SED awk

I have a string which i need to insert at a specific position in a file :
The file contains multiple semicolons(;) i need to insert the string just before the last ";"
Is this possible with SED ?
Please do post the explanation with the command as I am new to shell scripting
before :
adad;sfs;sdfsf;fsdfs
string = jjjjj
after
adad;sfs;sdfsf jjjjj;fsdfs
Thanks in advance
This might work for you:
echo 'adad;sfs;sdfsf;fsdfs'| sed 's/\(.*\);/\1 jjjjj;/'
adad;sfs;sdfsf jjjjj;fsdfs
The \(.*\) is greedy and swallows the whole line, the ; makes the regexp backtrack to the last ;. The \(.*\) make s a back reference \1. Put all together in the RHS of the s command means insert jjjjj before the last ;.
sed 's/\([^;]*\)\(;[^;]*;$\)/\1jjjjj\2/' filename
(substitute jjjjj with what you need to insert).
Example:
$ echo 'adad;sfs;sdfsf;fsdfs;' | sed 's/\([^;]*\)\(;[^;]*;$\)/\1jjjjj\2/'
adad;sfs;sdfsfjjjjj;fsdfs;
Explanation:
sed finds the following pattern: \([^;]*\)\(;[^;]*;$\). Escaped round brackets (\(, \)) form numbered groups so we can refer to them later as \1 and \2.
[^;]* is "everything but ;, repeated any number of times.
$ means end of the line.
Then it changes it to \1jjjjj\2.
\1 and \2 are groups matched in first and second round brackets.
For now, the shorter solution using sed : =)
sed -r 's#;([^;]+);$#; jjjjj;\1#' <<< 'adad;sfs;sdfsf;fsdfs;'
-r option stands for extented Regexp
# is the delimiter, the known / separator can be substituted to any other character
we match what's finishing by anything that's not a ; with the ; final one, $ mean end of the line
the last part from my explanation is captured with ()
finally, we substitute the matching part by adding "; jjjj" ans concatenate it with the captured part
Edit: POSIX version (more portable) :
echo 'adad;sfs;sdfsf;fsdfs;' | sed 's#;\([^;]\+\);$#; jjjjj;\1#'
echo 'adad;sfs;sdfsf;fsdfs;' | sed -r 's/(.*);(.*);/\1 jjjj;\2;/'
You don't need the negation of ; because sed is by default greedy, and will pick as much characters as it can.
sed -e 's/\(;[^;]*\)$/ jjjj\1/'
Inserts jjjj before the part where a semicolon is followed by any number of non-semicolons ([^;]*) at the end of the line $. \1 is called a backreference and contains the characters matched between \( and \).
UPDATE: Since the sample input has no longer a ";" at the end.
Something like this may work for you:
echo "adad;sfs;sdfsf;fsdfs"| awk 'BEGIN{FS=OFS=";"} {$(NF-1)=$(NF-1) " jjjjj"; print}'
OUTPUT:
adad;sfs;sdfsf jjjjj;fsdfs
Explanation: awk starts with setting FS (field separator) and OFS (output field separator) as semi colon ;. NF in awk stands for number of fields. $(NF-1) thus means last-1 field. In this awk command {$(NF-1)=$(NF-1) " jjjjj" I am just appending jjjjj to last-1 field.

Resources