sed find and replace a specific number [duplicate] - bash

This question already has answers here:
sed whole word search and replace
(5 answers)
Closed 4 years ago.
I have a file like following.
abc 259200000 2 3 864000000 3 5
def 86400000 2 62 864000000 3 62
efg 864000000 2 347 0 0 0
abcd 259200000 3 3 0 0 0
I need to replace any single 0 with word Not Exist. I tried following and none of them are working.
sed 's/[0]/Not Exist/g' data.txt > out.txt
sed 's/[^0]/Not Exist/g' data.txt > out.txt
sed 's/^[0]/Not Exist/g' data.txt > out.txt
Much appreciate any help.

Could you please try following if ok with awk.
awk '{for(i=1;i<=NF;i++){if($i==0){$i="Not Exist"}}}{$1=$1} 1' OFS="\t" Input_file
Adding a non-one liner form of solution too now.
awk '
{
for(i=1;i<=NF;i++){
if($i==0){
$i="Not Exist"
}
}
}
{
$1=$1
}
1
' OFS="\t" Input_file
Explanation: Adding explanation for above code too now.
awk '
{
for(i=1;i<=NF;i++){ ##Starting for loop from variable i=1 to value of NF(number of field) increment with 1 each time.
if($i==0){ ##Checking condition if value of field is 0 then do following.
$i="Not Exist" ##Re-making value of that field to string Not Exist now.
} ##Closing if condition block now.
} ##Closing for loop block here.
}
{
$1=$1 ##re-setting first field on current line(to make sure TAB is being made output field separator to edited lines).
}
1 ##Mentioning 1 means awk works on method on pattern and action. So making condition/pattern as TRUE and not mentioning any action so by default print of current line will happen.
' OFS="\t" Input_file ##Setting OFS as TAB and mentioning Input_file name here.

Here's why your three attempts so far don't work:
sed 's/[0]/Not Exist/g' data.txt > out.txt
This asks sed to replace any zero character with the replacement string, including those that are part of a larger number.
sed 's/[^0]/Not Exist/g' data.txt > out.txt
This asks sed to replace any character which is NOT zero with the replacement string. The ^ "negates" the regex bracket expression.
sed 's/^[0]/Not Exist/g' data.txt > out.txt
This asks sed to replace any zero that is at the beginning of the line, since the ^ means "the null at the beginning of the line" in this context.
What you're looking for is might be expressed as follows:
sed 's/\([[:space:]]\)0\([[:space:]]\)/\1Not exist\2/g; s/\([[:space:]]\)0$/\1Not exist/' data.txt > out.txt
In this solution I'm using the space character class since I don't know whether your input file is tab or space separated. The class works with both, and retains whatever was there before.
Note that there are two sed commands here -- the first processes zeros that are have text after them, and the second processes zeros that at are the end of the line. This does make the script a bit awkward, so if you're on a more modern operating system with a sed that includes a -E option, the following might be easier to read:
sed -E 's/([[:space:]])0([[:space:]]|$)/\1Not exist\2/g' data.txt > out.txt
This takes advantage of the fact that in ERE, an "atom" can have multiple "branches", separated by an or bar (|). For more on this, man re_format.
Note that sed is probably not the best tool for this. Processing fields is usually best done with awk. I can't improve on #RavinderSingh13's awk solution, so you should use that if awk is an option.
Of course, your formatting is going to be wonky with almost any option.

I assume the columns are separated by white-space characters, then:
When using sed, you need to search for a lonely zero, that is zero "enclosed" in spaces. So you need to check the char after and before zero if it is equal to space. Also you need to handle the first zero and the last zero on the line separately.
sed '
# replace 0 beeing the first character on the line
s/^0\([[:space:]]\)/Not Exists\1/
# replace zeros separated by spaces
s/\([[:space:]]\)0\([[:space:]]\)/\1Not Exists\2/g
# replace the last 0
s/\([[:space:]]\)0&/\1Not Exists/ ' data.txt > out.txt
Live example at tutorialpoint.

Using sed:
sed 's/\<0\>/NotExist/g' file | column -t
\<...\> matches a word.
column -t display in column nicely.

Related

Replace every 4th occurence of char "_" with "#" in multiple files

I am trying to replace every 4th occurrence of "_" with "#" in multiple files with bash.
E.g.
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo..
would become
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo...
#perl -pe 's{_}{++$n % 4 ? $& : "#"}ge' *.txt
I have tried perl but the problem is this replaces every 4th _ carrying on from the last file. So for example, some files the first _ is replaced because it is not starting each new file at a count of 0, it carries on from the previous file.
I have tried:
#awk '{for(i=1; i<=NF; i++) if($i=="_") if(++count%4==0) $i="#"}1' *.txt
but this also does not work.
Using sed I cannot find a way to keep replacing every 4th occurrence as there are different numbers of _ in each file. Some files have 20 _, some have 200 _. Therefore, I cant specify a range.
I am really lost what to do, can anybody help?
You just need to reset the counter in the perl one using eof to tell when it's done reading each file:
perl -pe 's{_}{++$n % 4 ? "_" : "#"}ge; $n = 0 if eof' *.txt
This MAY be what you want, using GNU awk for RT:
$ awk -v RS='_' '{ORS=(FNR%4 ? RT : "#")} 1' file
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo..
It only reads each _-separated string into memory 1 at a time so should work no matter how large your input file, assuming there are _s in it.
It assumes you want to replace every 4th _ across the whole file as opposed to within individual lines.
A simple sed would handle this:
s='foo_foo_foo_foo_foo_foo_foo_foo_foo_foo'
sed -E 's/(([^_]+_){3}[^_]+)_/\1#/g' <<< "$s"
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
Explanation:
(: Start capture group #1
([^_]+_){3}: Match Match 1+ of non-_ characters followed by a _. Repeat this group 3 times to match 3 such words separated by _
[^_]+: Match 1+ of non-_ characters
): End capture group #1
_: Match a _
Replacement is \1# to replace 4th _ with a #
With GNU sed:
sed -nsE ':a;${s/(([^_]*_){3}[^_]*)_/\1#/g;p};N;ba' *.txt
-n suppresses the automatic printing, -s processes each file separately, -E uses extended regular expressions.
The script is a loop between label a (:a) and the branch-to-label-a command (ba). Each iteration appends the next line of input to the pattern space (N). This way, after the last line has been read, the pattern space contains the whole file(*). During the last iteration, when the last line has been read ($), a substitute command (s) replaces every 4th _ in the pattern space by a # (s/(([^_]*_){3}[^_]*)_/\1#/g) and prints (p) the result.
When you will be satisfied with the result you can change the options:
sed -i -nE ':a;${s/(([^_]*_){3}[^_]*)_/\1#/g;p};N;ba' *.txt
to modify the files in-place, or:
sed -i.bkp -nE ':a;${s/(([^_]*_){3}[^_]*)_/\1#/g;p};N;ba' *.txt
to modify the files in-place, but keep a *.txt.bkp backup of each file.
(*) Note that if you have very large files this could cause memory overflows.
With your shown samples, please try following awk program. Have created an awk variable named fieldNum where I have assigned 4 to it, since OP needs to enter # after every 4th _, you can keep it as per your need too.
awk -v fieldNum="4" '
BEGIN{ FS=OFS="_" }
{
val=""
for(i=1;i<=NF;i++){
val=(val?val:"") $i (i%fieldNum==0?"#":(i<NF?OFS:""))
}
print val
}
' Input_file
With GNU awk
$ cat ip.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
123_45678_90
_
$ awk -v RS='(_[^_]+){3}_' -v ORS= '{sub(/_$/, "#", RT); print $0 RT}' ip.txt
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
123_45678_90
#
-v RS='(_[^_]+){3}_' set input record separator to cover sequence of four _ (text matched by this separator will be available via RT)
-v ORS= empty output record separator
sub(/_$/, "#", RT) change last _ to #
Use -i inplace for inplace editing.
If the count should reset for each line:
perl -pe's/(?:_[^_]*){3}\K_/\#/g'
$ cat a.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
$ perl -pe's/(?:_[^_]*){3}\K_/\#/g' a.txt a.txt
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
If the count shouldn't reset for each line, but should reset for each file:
perl -0777pe's/(?:_[^_]*){3}\K_/\#/g'
The -0777 cause the whole file to be treated as one line. This causes the count to work properly across lines.
But since a new a match is used for each file, the count is reset between files.
$ cat a.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
$ perl -0777pe's/(?:_[^_]*){3}\K_/\#/g' a.txt a.txt
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo#foo_foo_foo_foo#foo_foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo#foo_foo_foo_foo#foo_foo_foo
To avoid that reading the entire file at once, you could continue using the same approach, but with the following added:
$n = 0 if eof;
Note that eof is not the same thing as eof()! See eof.

How to delete certain characters after a pattern using sed or awk?

I have a text file containing number of lines formatted like below
001_A.wav;112.680;115.211;;;Ja. Hello; Hi:
my goal is to clean whatever is after ;;;. Meaning to delete the following characters ,;()~?
I know i can do something like sed 's/[,.;()~?,]//g'. However if I do that, it would give me something like
001_Awav112.680115211Ja Hello Hi
However I would like to delete those character only after ;;; so I would get
001_A.wav;112.680;115.211;;;Ja Hello Hi
How can I accomplish this task?
1st solution: Could you please try following, written and tested with shown samples in GNU awk(where assuming ;;; occurring one time in lines).
awk '
match($0,/.*;;;/){
laterPart=substr($0,RSTART+RLENGTH)
gsub(/[,.:;()~?]/,"",laterPart)
print substr($0,RSTART,RLENGTH) laterPart
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/.*;;;/){ ##Using atch function to match everything till ;;; here.
laterPart=substr($0,RSTART+RLENGTH) ##Creating variable laterPart which has rest of the line apart from matched regex part above.
gsub(/[,.:;()~?]/,"",laterPart) ##Globally substituting ,.:;()~? with NULL in laterPart variable.
print substr($0,RSTART,RLENGTH) laterPart ##Printing sub string of matched regex and laterPart var here.
}' Input_file ##Mentioning Input_file name here.
2nd solution: In case you have multiple occurrences of ;;; in lines and you want to substitute characters from all fields, after 1st occurrence of ;;; then try following.
awk 'BEGIN{FS=OFS=";;;"} {for(i=2;i<=NF;i++){gsub(/[,.:;()~?,]/,"",$i)}} 1' Input_file
You can use
sed ':a; s/\(;;;[^,.:;()~?,]*\)[,.:;()~?,]/\1/; ta' file > newfile
sed ':a; s/\(;;;[^[:punct:]]*\)[[:punct:]]/\1/; ta' file > newfile
Details
:a sets a label
\(;;;[^,.:;()~?,]*\)[,.:;()~?,] matches and captures into Group 1 a ;;; substring and then any zero or more chars other than ,.:;()~?, chars, and then just matches a char from the ,.:;()~?, set
[^[:punct:]]* matches any 0 or more chars other than punctuation chars
[[:punct:]] matches any punctuation char
\1 is the replacement, the contents of Group 1
ta branches back to a label on a successful replacement.
See the online sed demo:
s='001_A.wav;112.680;115.211;;;Ja. Hello; Hi:'
sed ':a; s/\(;;;[^,.:;()~?,]*\)[,.:;()~?,]/\1/; ta' <<< "$s"
# => 001_A.wav;112.680;115.211;;;Ja Hello Hi
sed ':a; s/\(;;;[^[:punct:]]*\)[[:punct:]]/\1/; ta' <<< "$s"
# => 001_A.wav;112.680;115.211;;;Ja Hello Hi
Didn't read your question correctly, but I've changed it now.
I suggest to make use of perl instead, since it has lookup groups.
$ perl -pe 's/^((?:(?!;;;).)*;;;)|[:,.;\(\)~\?,]/\1/g' file.txt
^ is the beginning of the line.
((?:(?!;;;).)*;;;) is the string equivalent of [^;]*, and makes sure that the first ;;; is found and groups it in \1.
|[:,\.;\(\)~\?,] selects the characters :,.;\(\)~\?, and denies it in the result. (Thus leaving "Ja" in it).
You can use the combination of some sed commands with
echo '001_A.wav;112.680;115.211;;;Ja. Hello; Hi:' |
sed 's/;;;/;;;\n\r/' |
sed '/^\r/ s/[,;():~?]//g' |
sed -z 's/;;;\n\r/;;;/g'
Different GNU AWK-solution:
echo "001_A.wav;112.680;115.211;;;Ja. Hello; Hi:" | awk 'BEGIN{FS=OFS=";;;"}{print $1,gensub(/[,;()~?]/,"","g",substr($0,length($1)+1))}'
output:
001_A.wav;112.680;115.211;;;Ja. Hello Hi:
This assumes your description has precedence over example (only ,;()~? will be removed). Explanation: I use ;;; as seperator and output seperator then I print 1st column (what is before ;;;) and get rest by finding its start as length of 1st column plus 1, then remove all specified characters from that part and print it.
If example has precedence over description then you might use [[:punct:]] set of characters, namely:
echo "001_A.wav;112.680;115.211;;;Ja. Hello; Hi:" | awk 'BEGIN{FS=OFS=";;;"}{print $1,gensub(/[[:punct:]]/,"","g",substr($0,length($1)+1))}'
will give
001_A.wav;112.680;115.211;;;Ja Hello Hi

Bash: Keep all lines with duplicate values in column X

I have a file with a few thousand lines and 20+ columns. I now want to keep only the lines that have the same e-mail address in column 3 as in other lines.
file: (First Name; Last Name; E-Mail; ...)
Mike;Tyson;mike#tyson.com
Tom;Boyden;tom#boyden.com
Tom;Cruise;mike#tyson.com
Mike;Myers;mike#tyson.com
Jennifer;Lopez;jennifer#lopez.com
Andre;Agassi;tom#boyden.com
Paul;Walker;paul#walker.com
I want to keep ALL lines that have a matching e-mail address. In this case the expected output would be
Mike;Tyson;mike#tyson.com
Tom;Boyden;tom#boyden.com
Tom;Cruise;mike#tyson.com
Mike;Myers;mike#tyson.com
Andre;Agassi;tom#boyden.com
If I use
awk -F';' '!seen[$3]++' file
I will lose the first instance of the e-mail address, in this case line 1 and 2 and will keep ONLY the duplicates.
Is there a way to keep all lines?
This awk one-liner will help you:
awk -F';' 'NR==FNR{a[$3]++;next}a[$3]>1' file file
It passes the file twice, the first time it calculates the count of occurrence, the 2nd pass will check and output.
With the given input example, it prints:
Mike;Tyson;mike#tyson.com
Tom;Boyden;tom#boyden.com
Tom;Cruise;mike#tyson.com
Mike;Myers;mike#tyson.com
Andre;Agassi;tom#boyden.com
If the output order doesn't matter, here's a one-pass approach:
$ awk -F';' '$3 in first{print first[$3] $0; first[$3]=""; next} {first[$3]=$0 ORS}' file
Mike;Tyson;mike#tyson.com
Tom;Cruise;mike#tyson.com
Mike;Myers;mike#tyson.com
Tom;Boyden;tom#boyden.com
Andre;Agassi;tom#boyden.com
Could you please try following, in a single read Input_file in single awk.
awk '
BEGIN{
FS=";"
}
{
mail[$3]++
mailVal[$3]=($3 in mailVal?mailVal[$3] ORS:"")$0
}
END{
for(i in mailVal){
if(mail[i]>1){ print mailVal[i] }
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=";" ##Setting field separator as ; here.
}
{
mail[$3]++ ##Creating mail with index of 3rd field here and keep adding its value with 1 here.
mailVal[$3]=($3 in mailVal?mailVal[$3] ORS:"")$0 ##Creating mailVal which has 3rd field as index and value is current line and keep concatinating to it wiht new line.
}
END{ ##Starting END block of this program from here.
for(i in mailVal){ ##Traversing through mailVal here.
if(mail[i]>1){ print mailVal[i] } ##Checking condition if value is greater than 1 then printing its value here.
}
}
' Input_file ##Mentioning Input_file name here.
I think #ceving just needs to go a little further.
ASSUMING the chosen column is NOT the first or last -
cut -f$col -d\; file | # slice out the right column
tr '[[:upper:]]' '[[:lower:]]' | # standardize case
sort | uniq -d | # sort and output only the dups
sed 's/^/;/; s/$/;/;' > dups # save the lowercased keys
grep -iFf dups file > subset.csv # pull matching records
This breaks if the chosen column is the first or last, but should otherwise preserve case and order from the original version.
If it might be the first or last, then pad the stream to that last grep and clean it afterwards -
sed 's/^/;/; s/$/;/;' file | # pad with leading/trailing delims
grep -iFf dups | # grab relevant records
sed 's/^;//; s/;$//;' > subset.csv # strip the padding
Find the duplicate e-mail addresses:
sed -s 's/^.*;/;/;s/$/$/' < file.csv | sort | uniq -d > dups.txt
Report the duplicate csv rows:
grep -f dups.txt file.csv
Update:
As "Ed Morton" pointed out the above commands will fail, when the e-mail addresses contain characters, which have a special meaning in a regular expression. This makes it necessary to escape the e-mail addresses.
One way to do so is to use Perl compatible regular expression. In a PCRE the escape sequences \Q and \E mark the beginning and the end of a string, which should not be treated as a regular expression. GNU grep supports PCREs with the option -P. But this can not be combined with the option -f. This makes it necessary to use something like xargs. But xargs interprets backslashes and ruins the regular expression. In order to prevent it, it is necessary to use the option -0.
Lessen learned: it is quite difficult to get it right without programming it in AWK.
sed -s 's/^.*;/;\\Q/;s/$/\\E$/' < file.csv | sort | uniq -d | tr '\n' '\0' > dups.txt
xargs -0 -i < dups.txt grep -P '{}' file.csv

Prepend text to specific line numbers with variables

I have spent hours trying to solve this. There are a bunch of answers as to how to prepend to all lines or specific lines but not with a variable text and a variable number.
while [ $FirstVariable -lt $NextVariable ]; do
#sed -i "$FirstVariables/.*/$FirstVariableText/" "$PWD/Inprocess/$InprocessFile"
cat "$PWD/Inprocess/$InprocessFile" | awk 'NR==${FirstVariable}{print "$FirstVariableText"}1' > "$PWD/Inprocess/Temp$InprocessFile"
FirstVariable=$[$FirstVariable+1]
done
Essentially I am looking for a particular string delimiter and then figuring out where the next one is and appending the first result back into the following lines... Note that I already figured out the logic I am just having issues prepending the line with the variables.
Example:
This >
Line1:
1
2
3
Line2:
1
2
3
Would turn into >
Line1:
Line1:1
Line1:2
Line1:3
Line2:
Line2:1
Line2:2
Line2:3
You can do all that using below awk one liner.
Assuming your pattern starts with Line, then the below script can be used.
> awk '{if ($1 ~ /Line/ ){var=$1;print $0;}else{ if ($1 !="")print var $1}}' $PWD/Inprocess/$InprocessFile
Line1:
Line1:1
Line1:2
Line1:3
Line2:
Line2:1
Line2:2
Line2:3
Here is how the above script works:
If the first record contains word Line then it is copied into an awk variable var. From next word onwards, if the record is not empty, the newly created var is appended to that record and prints it producing the desired result.
If you need to pass the variables dynamically from shell to awk you can use -v option. Like below:
awk -v var1=$FirstVariable -v var2=$FirstVariableText 'NR==var{print var2}1' > "$PWD/Inprocess/Temp$InprocessFile"
The way you addressed the problem is by parsing everything both with bash and awk to process the file. You make use of bash to extract a line, and then use awk to manipulate this one line. The whole thing can actually be done with a single awk script:
awk '/^Line/{str=$1; print; next}{print (NF ? str $0 : "")}' inputfile > outputfile
or
awk 'BEGIN{RS="";ORS="\n\n";FS=OFS="\n"}{gsub(FS,OFS $1)}1' inputfile > outputfile

sed squeeze multiple occurrence of word

I have text file with lines like below:
this is the code ;rfc1234;rfc1234
this is the code ;rfc1234;rfc1234;rfc1234;rfc1234
How can I squeeze the the repeating words in file to single word like below:
this is the code ;rfc1234
this is the code ;rfc1234
I tried 'tr' command but it's limited to squeezing characters only
with sed for arbitrary repeated strings prefixed with ;
$ sed -E 's/(;[^;]+)(\1)+/\1/g' file
or, if you want to delete everything after the first token without checking whether they match the preceding one or not
$ sed -E 's/(\S);.*/\1/' file
Explanation
(;[^;]+) is to capture a string starting with semicolon
(\1)+ followed by the same captured string one or more times
/\1/g replace the whole chain with one instance, and repeat
Following awk may help here. It will look for all items in last column of you Input_file and will keep only unique values in it.
awk '{num=split($NF,array,";");for(i=1;i<=num;i++){if(!array1[array[i]]++){val=val?val ";" array[i]:array[i]}};NF--;print $0";"val;val="";delete array;delete array1}' Input_file
Adding a non-one liner form of solution too now.
awk '
{
num=split($NF,array,";");
for(i=1;i<=num;i++){
if(!array1[array[i]]++){
val=val?val ";" array[i]:array[i]}
};
NF--;
print $0";"val;
val="";
delete array;
delete array1
}' Input_file
Explanation:
awk '
{
num=split($NF,array,";"); ##Creating a variable named num whose value is length of array named array, which is created on last field of line with ; as a delimiter.
for(i=1;i<=num;i++){ ##Starting a for loop from i=1 to till value of num each time increment i as 1.
if(!array1[array[i]]++){ ##Chrcking here a condition if array named array1 index is value of array[i] is NOT coming more than 1 value then do following.
val=val?val ";" array[i]:array[i]}##Creating a variable named val here whose value is array[i] value and keep concatenating its own value of it.
};
NF--; ##Reducing the value of NF(number of fields) in current line to remove the last field from it.
print $0";"val; ##Printing the current line(without last field) ; and then value of val here.
val=""; ##Nullifying variable val here.
delete array; ##Deleting array named array here.
delete array1 ##Deleting array named array1 here.
}' Input_file ##Mentioning Input_file name here.
I started playing around with s/(.+)\1/\1/g. It seemed to work with perl (even found the is_is_) but didn't quite take me there:
$ perl -pe 's/(.+)\1+/\1/g' file
this the code ;rfc1234
this the code ;rfc1234;rfc1234
sed 's/\(;[^;]*\).*/\1/' file
You can use the below command to achieve this:-
echo "this is the code ;rfc1234;rfc1234" | sed 's/rfc1234//2g'
echo "this is the code ;rfc1234;rfc1234;rfc1234;rfc1234" | sed 's/rfc1234//2g'
or
sed 's/rfc1234//2g' yourfile.txt
This might work for you (GNU sed):
sed -r ':a;s/(\S+)\1+/\1/g;ta' file
The regex is repeated until only the first pattern remains.

Resources