What is the most compact or efficient way of doing several subsitutions in a file in bash - bash

I have a file data.base which looks like:
1234 XXXX
4321 XXXX
9884 ZZZZ
5454 YYYY
4311 YYYY
9882 ZZZZ
9976 ZZZZ
( ... random occurrences like this till 10000 lines)
I would like to create a file called data.case which derives from data.base just with substitutions of XXXX, YYYY, ZZZZ for float numbers.
I wonder what would be the most compact/efficient/short way to do that on bash or friends.
What I usually do is something like:
sed -e "s/XXXX/1.34555/g" data.base > temp1
sed -e "s/YYYY/2.985/g" temp1 > temp2
sed -e "s/ZZZZ/-4.3435/g" temp2 > data.case
rm -fr temp1 temp2
But I do not think this is the most compact or efficient way when you have to deal with more than 3 substitutions.
Thanks
Thanks

Use an option to ececute several commands in same sed:
sed "s/XXXX/1.34555/g; s/YYYY/2.985/g"; s/ZZZZ/-4.3435/g" data.base > data.case

$ cat sedcommands
s/XXXX/1.34555/g
s/YYYY/2.985/g
s/ZZZZ/-4.3435/g
$ sed -f sedcommands data.base > data.case

you can make use of associative arrays in awk
awk 'BEGIN{
# add as needed
s["XXXX"]=1.3455
s["YYYY"]=2.985
s["ZZZZ"]=-4.3435
}
($2 in s) { print $1,s[$2] }' file
output
$ ./shell.sh
1234 1.3455
4321 1.3455
9884 -4.3435
5454 2.985
4311 2.985
9882 -4.3435
9976 -4.3435

sed -e "s/XXXX/1.34555/g;s/YYYY/2.985/g;s/ZZZZ/-4.3435/g"
or put them in a cmd file
and list them out.

Whilst sed can do multiple substitutions in one pass, the general UNIX approach which is more widely applicable and can be combined with other commands is to use command piping:
cat data.base | \
sed -e "s/XXXX/1.34555/g" | \
sed -e "s/YYYY/2.985/g" | \
sed -e "s/ZZZZ/-4.3435/g" > data.base
The redirection at the end will 'unlink' the old data.base that is being used as input by cat; you could however still use a temporary file so that you can intercept error conditions and not have lost the original data.base in the process.
(When using piping, its useful to be familiar with the tee program, which saves the stream to a file whilst passing it on)

Related

Speed up bash for loop which contains multiple sed commands

my bash for loop looks like:
for i in read_* ; do
cut -f1 $i | sponge $i
sed -i '1 s/^/>/g' $i
sed -i '3 s/^/>ref\n/g' $i
sed -i '4d' $i
sed -i '1h;2H;1,2d;4G' $i
mv $i $i.fasta
done
Are there any methods of speeding up this process, perhaps using GNU parallel?
EDIT: Added input and expected output.
Input:
sampleid 97 stuff 2086 42 213M = 3322 1431
TATTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
Hopeful output:
>ref
TTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
>sampleid
TATTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
I used the sed -i '1h;2H;1,2d;4G' $i command to swap lines 2 and 4.
If I read it right, this should create the same result, though it would probably help a LOT if I could see what your input and expected output look like...
awk '{$0=$1}
FNR==1{hd=">"$0; next}
FNR==2{hd=hd"\n"$0;next}
FNR==3{print ">ref\n"$0 > FILENAME".fasta"}
FNR==4{next}
FNR==5{print hd"\n"$0 > FILENAME".fasta"}
' read_*
My input files:
$: cat read_x
foo x
bar x
baz x
last x
curiosity x
$: cat read_y
FOO y
BAR y
BAZ y
LAST y
CURIOSITY y
and the resulting output files:
$: cat read_x.fasta
>ref
baz
>foo
bar
curiosity
$: cat read_y.fasta
>ref
BAZ
>FOO
BAR
CURIOSITY
This runs in one pass with no loop aside from awk's usual internals, and leaves the originals in place so you can check it first. If all is good, all that's left is to remove the originals. For that, I would use extended globbing.
$: shopt -s extglob; rm read_!(*.fasta)
That will clean up the original inputs but not the new outputs.
Same results, three commands, no loops.
I am, or course, making some assumptions about what you are meaning to do that might not be accurate. To get this format in a single sed call -
$: sed -e 's/[[:space:]].*//' -e '1{s/^/>/;h;d}' -e '2{H;s/.*/>ref/}' -e '4x' read_x
>ref
baz
>foo
bar
curiosity
but that's not the same commands you used, so maybe I'm misreading it.
To use this to in-place edit multiple files at a time (instead of calling it in a loop on each file), use -si so that the line numbers apply to each file rather than the stream of records they collectively produce.
DON'T use -is, though you could use -i -s.
$: sed -s -i -e 's/[[:space:]].*//' -e '1{s/^/>/;h;d}' -e '2{H;s/.*/>ref/}' -e '4x' read_*
This still leaves you with the issue of renaming each, but xargs makes that pretty easy in the given example.
printf "%s\n" read_* | xargs -I# mv # #.fasta
addendum
Using the file you gave in the OP, assuming every file is the same general structure and exactly 4 lines -
$: cat file_0 # I made files 0 through 7, but with same data
sampleid 97 stuff 2086 42 213M = 3322 1431
TATTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
$: sed -Esi '1{s/^([^[:space:]]+).*/>\1/;h;s/.*/>ref/}; 3x;' file_?
$: cat file_0 # used a diff on each, worked on all at once
>ref
TATTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
>sampleid
TTTTTAGGGAAGATCTGGCCTTCCTACAAGGGAAGGCCAGGGAATTTTCTTCAGAGCAGA
Breakout:
-Esi Extended pattern matching, separate file linecounts, in-place edits
1{...}; Collectively do these commands, in order, only on every line 1
s/^([^[:space:]]+).*/>\1/ add leading > but strip everything after any whitespace
h store the resulting >\1 line in the hold buffer
s/.*/>ref/ then replace the whole line with a literal >ref
`3x' swap line 3 with the value in the hold buffer from line 1
file_? I used a glob to supply the appropriate list of files all at once.
Doing same with awk:
$: awk 'FNR==1{id=">"$1; print ">ref" >FILENAME".fasta"; next} FNR==3{print id > FILENAME".fasta"; next} {print $0 > FILENAME".fasta"}' file_?
Then you can do file management as above with the xargs/mv for the sed or the shopt/rm for the awk - or we could add a little organizational work in awk if you like. Consider this:
awk 'BEGIN { system(" mkdir -p done ") }
FNR==1 { id=">"$1; print ">ref" > FILENAME".fasta"; next } # skip printing original
FNR==3 { print id > FILENAME".fasta"; next } # skip printing original
{ print $0 > FILENAME".fasta" } # every line NOT skipped
FNR==4 { close(FILENAME); close(FILENAME".fasta");
system("mv " FILENAME " done/")
}' file_?
Then if there are any problems, it's easy to delete the fasta's, move the originals back, adjust the code, and try again. If everything is ok, it's fast and easy to rm -fr done, yes?
Note that I really only added the mkdir inside a system call in the awk to show that you can, and to keep from having to manually do it separately if you have to run a few iterations or move it all into a wrapper script, etc.
The code in the question runs multiple subprocesses (cut, sponge, sed four times, and mv) for each file that is processed. Running subprocesses is relatively slow, so you can speed up the code significantly by reducing the number of them.
This Shellcheck-clean code is one way to do it:
#! /bin/bash -p
old_files=()
for f in read_* ; do
readarray -t lines <"$f"
printf '>ref\n%s\n>%s\n%s\n' \
"${lines[3]}" "${lines[0]%%[[:space:]]*}" "${lines[1]}" >"$f.fasta"
old_files+=( "$f" )
done
rm -- "${old_files[#]}"
This runs no subprocesses when processing individual files. It just reads the lines of the old file into an array using the built-in readarray command and writes to the new file using the built-in printf.
See Removing part of a string (BashFAQ/100 (How do I do string manipulation in bash?)) for an explanation of the %% in ${lines[0]%%[[:space:]]*}.
To avoid running rm for each file, the code keeps a list of files to be deleted and removes all of them at the end. If you try the code, consider commenting the rm line until you are very confident that the rest of the code is doing what you want.

String manipulation via script

I am trying to get a substring between &DEST= and the next & or a line break.
For example :
MYREQUESTISTO8764GETTHIS&DEST=SFO&ORIG=6546
In this I need to extract "SFO"
MYREQUESTISTO8764GETTHIS&DEST=SANFRANSISCO&ORIG=6546
In this I need to extract "SANFRANSISCO"
MYREQUESTISTO8764GETTHISWITH&DEST=SANJOSE
In this I need to extract "SANJOSE"
I am reading a file line by line, and I need to update the text after &DEST= and put it back in the file. The modification of the text is to mask the dest value with X character.
So, SFO should be replaced with XXX.
SANJOSE should be replaced with XXXXXXX.
Output :
MYREQUESTISTO8764GETTHIS&DEST=XXX&ORIG=6546
MYREQUESTISTO8764GETTHIS&DEST=XXXXXXXXXXXX&ORIG=6546
MYREQUESTISTO8764GETTHISWITH&DEST=XXXXXXX
Please let me know how to achieve this in script (Preferably shell or bash script).
Thanks.
$ cat file
MYREQUESTISTO8764GETTHIS&DEST=SFO&ORIG=6546
MYREQUESTISTO8764GETTHIS&DEST=PORTORICA
MYREQUESTISTO8764GETTHIS&DEST=SANFRANSISCO&ORIG=6546
MYREQUESTISTO8764GETTHISWITH&DEST=SANJOSE
$ sed -E 's/^.*&DEST=([^&]*)[&]*.*$/\1/' file
SFO
PORTORICA
SANFRANSISCO
SANJOSE
should do it
Replacing airports with an equal number of Xs
Let's consider this test file:
$ cat file
MYREQUESTISTO8764GETTHIS&DEST=SFO&ORIG=6546
MYREQUESTISTO8764GETTHIS&DEST=SANFRANSISCO&ORIG=6546
MYREQUESTISTO8764GETTHISWITH&DEST=SANJOSE
To replace the strings after &DEST= with an equal length of X and using GNU sed:
$ sed -E ':a; s/(&DEST=X*)[^X&]/\1X/; ta' file
MYREQUESTISTO8764GETTHIS&DEST=XXX&ORIG=6546
MYREQUESTISTO8764GETTHIS&DEST=XXXXXXXXXXXX&ORIG=6546
MYREQUESTISTO8764GETTHISWITH&DEST=XXXXXXX
To replace the file in-place:
sed -i -E ':a; s/(&DEST=X*)[^X&]/\1X/; ta' file
The above was tested with GNU sed. For BSD (OSX) sed, try:
sed -Ee :a -e 's/(&DEST=X*)[^X&]/\1X/' -e ta file
Or, to change in-place with BSD(OSX) sed, try:
sed -i '' -Ee :a -e 's/(&DEST=X*)[^X&]/\1X/' -e ta file
If there is some reason why it is important to use the shell to read the file line-by-line:
while IFS= read -r line
do
echo "$line" | sed -Ee :a -e 's/(&DEST=X*)[^X&]/\1X/' -e ta
done <file
How it works
Let's consider this code:
search_str="&DEST="
newfile=chart.txt
sed -E ':a; s/('"$search_str"'X*)[^X&]/\1X/; ta' "$newfile"
-E
This tells sed to use Extended Regular Expressions (ERE). This has the advantage of requiring fewer backslashes to escape things.
:a
This creates a label a.
s/('"$search_str"'X*)[^X&]/\1X/
This looks for $search_str followed by any number of X followed by any character that is not X or &. Because of the parens, everything except that last character is saved into group 1. This string is replaced by group 1, denoted \1 and an X.
ta
In sed, t is a test command. If the substitution was made (meaning that some character needed to be replaced by X), then the test evaluates to true and, in that case, ta tells sed to jump to label a.
This test-and-jump causes the substitution to be repeated as many times as necessary.
Replacing multiple tags with one sed command
$ name='DEST|ORIG'; sed -E ':a; s/(&('"$name"')=X*)[^X&]/\1X/; ta' file
MYREQUESTISTO8764GETTHIS&DEST=XXX&ORIG=XXXX
MYREQUESTISTO8764GETTHIS&DEST=XXXXXXXXXXXX&ORIG=XXXX
MYREQUESTISTO8764GETTHISWITH&DEST=XXXXXXX
Answer for original question
Using shell
$ s='MYREQUESTISTO8764GETTHIS&DEST=SFO&ORIG=6546'
$ s=${s#*&DEST=}
$ echo ${s%%&*}
SFO
How it works:
${s#*&DEST=} is prefix removal. This removes all text up to and including the first occurrence of &DEST=.
${s%%&*} is suffix removal_. It removes all text from the first & to the end of the string.
Using awk
$ echo 'MYREQUESTISTO8764GETTHIS&DEST=SFO&ORIG=6546' | awk -F'[=\n]' '$1=="DEST"{print $2}' RS='&'
SFO
How it works:
-F'[=\n]'
This tells awk to treat either an equal sign or a newline as the field separator
$1=="DEST"{print $2}
If the first field is DEST, then print the second field.
RS='&'
This sets the record separator to &.
With GNU bash:
while IFS= read -r line; do
[[ $line =~ (.*&DEST=)(.*)((&.*|$)) ]] && echo "${BASH_REMATCH[1]}fooooo${BASH_REMATCH[3]}"
done < file
Output:
MYREQUESTISTO8764GETTHIS&DEST=fooooo&ORIG=6546
MYREQUESTISTO8764GETTHIS&DEST=fooooo&ORIG=6546
MYREQUESTISTO8764GETTHISWITH&DEST=fooooo
Replace the characters between &DEST and & (or EOL) with x's:
awk -F'&DEST=' '{
printf("%s&DEST=", $1);
xlen=index($2,"&");
if ( xlen == 0) xlen=length($2)+1;
for (i=0;i<xlen;i++) printf("%s", "X");
endstr=substr($2,xlen);
printf("%s\n", endstr);
}' file

Conditional replacement of string fragment with sed (one-liner!)

I am trying to process the result of diff operation with sed. This is my diff output, which I pipe into sed
3d2
< 12-03-22_JET_D_CL_UR_l4053_0061 True_Warning All 9 149261
62a62
> 13-01-29_VUE_EPM3_v37_CSAV2_0370 True_Warning All 13 22125
68c68
< 13-05-14_Regular_Front_0062 True_Warning All 13 123383
---
> 13-05-14_Regular_Front_0062 True_Warning All 21 123383
119c119
< CADS4_PMP363_20130202_DPH_069 True_Warning All 13 233405
---
> CADS4_PMP363_20130202_DPH_069 True_Warning All 9 233409
149c149
< CADS4_PMP363_20130315_Fujifilm_UK_186 True_Warning All 21 18611
---
> CADS4_PMP363_20130315_Fujifilm_UK_186 True_Warning All 17 18615
I need to sort out the difference string and prepend the 3rd word in the strings with either "Old" or "New" - depending on the first character. My best effort so far is
diff new_jumps/true.jump old_jumps/true.jump | sed -n "/^[<>]/ s:\(.\) \(\S\+\) \(.\+\):\2 \1,\3: p" | replace ">" Old | replace "<" New
Which give me this result (exactly what I wanted).
12-03-22_JET_D_CL_UR_l4053_0061 New,True_Warning All 9 149261
13-01-29_VUE_EPM3_v37_CSAV2_0370 Old,True_Warning All 13 22125
13-05-14_Regular_Front_0062 New,True_Warning All 13 123383
13-05-14_Regular_Front_0062 Old,True_Warning All 21 123383
CADS4_PMP363_20130202_DPH_069 New,True_Warning All 13 233405
CADS4_PMP363_20130202_DPH_069 Old,True_Warning All 9 233409
CADS4_PMP363_20130315_Fujifilm_UK_186 New,True_Warning All 21 18611
CADS4_PMP363_20130315_Fujifilm_UK_186 Old,True_Warning All 17 18615
My question is - how can I change conditional expression within sed one-liner that will eliminate the need to use replace afterwards? (I assume that it is possible)
Thanks in advance
EDIT:
I know, I missed the option to chain sed expressions, but what I had in mind - is it possible to do it within one substitute operation?
By adding more commands to sed using semicolon (;), like this:
diff new_jumps/true.jump old_jumps/true.jump | sed -n "/^[<>]/ s:\(.\) \(\S\+\) \(.\+\):\2 \1,\3:; s/</New/gp; s/>/Old/gp"
With awk I get a faster response. Try this:
diff new_jumps/true.jump old_jumps/true.jump | awk '{ if($1=="<" || $1==">"){($1=="<")?temp="New,":temp="Old,";print $2,temp$3,$4,$5}}'
Here's another solution suggested by Jidder:
awk '/^</{i="old,"}/^>/{i="new,"}i{$2=$2" "i;print;i=0}'
#volcano: here is a one-liner solution in sed, but relies in the interaction with the shell. IMHO if you want to have only one sed substitution command, you cannot avoid that behavior: you have to output to the shell the information of which first character has been seen on the line, the shell somewhat does the mapping to "Old" or "New" strings, and gives the result back to sed.
So the one-liner is not exactly a one-liner because we have to define things in the shell... ;)
replace() { if [ "$1" == ">" ] ; then echo -n "Old"; else echo -n "New" ; fi }
export -f replace
sed -n '/^[<>]/ s:\(.\) \(\S\+\) \(.\+\):echo "\2 $(replace \\\1),\3";:ep' yourfile
Please note that the e flag to the substitution command is a GNU sed extension, we use it here to avoid calling the shell explicitly. If you don't use GNU sed, you can simply replace the last line above by the following:
sed -n '/^[<>]/ s:\(.\) \(\S\+\) \(.\+\):echo "\2 $(replace \\\1),\3";:p' yourfile | bash
The solution I am giving here has been inspired by that other one.
Please also note that all this gymnastics is avoidable if you accept to replace your three-letter tokens "Old" and "New" by their initials, because then we can neatly use the y command to first act in a tr fashion, likewise:
sed -n '/^[<>]/ y/<>/ON/; s:\(.\) \(\S\+\) \(.\+\):\2 \1,\3:p' yourfile

generate a random number/string or an iterator in sed 's/'

I adapted Jan Goyvaerts's e-mail regex to a bash function to be used in pipes to anonymize e-mail addresses:
function remove_emails {
sed -r "s|\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b|email.address#removed.com|gI";
}
which I'm using in a bash pipe:
mysqldump \
-uuser \
-ppass \
db_name \
| remove_emails \
| gzip -c \
| cat \
> tmp.sql.gz
works fine but now, I'd like to have different random e-mails, I'd be satisfied with:
email.address1#removed.com
email.address2#removed.com
or
eiyyzhupzftrvjwehbqp#removed.com
kwmbrshzmxqlrqatqpff#removed.com
or anything that differs and is unique
I'm quite comfortable with bash but using counters, process substitution and so fails as sed is invoked only once, so
sed "s,sth,$(echo $RANDOM),g"
and similar won't work,
Is there anything to generate random stuff or counters in sed itself?
This might work for you (GNU sed):
<<<'Here is a random number.' sed 's/random number/& $RANDOM/;s/.*/echo "&"/e'
or if you prefer:
<<<'Here is a random number.' sed 's/random number/& $RANDOM/;s/.*/echo "&"/' | sh
I experimented with potong's correct answer and found a way to implement an iterator which answers the other part of my question:
remove_emails() {
sed -r 's|\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b|test$(( iterator++ ))#example.com|gI;s|.*|echo "&"|' | bash
}
iterator=0
test_data='some.e.mail.address.#domain.com\nsome.other#email.co.uk\nwhatever#man.biz\nsed#sed.com\n'
echo -e "before:\n${test_data}"
echo -e "after: \n${test_data}" | remove_emails
You could do it by repeatedly invoking sed in a while loop as shown below:
remove_emails() {
while read line
do
sed -r "s|\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b|email.address${RANDOM}#removed.com|gI" <<< "$line"
done
}

Manipulating the output string in bash shell

I have written a line that finds and returns the full path to a desired file. The output is as follows:
/home/ke/Desktop/b/o/r/files.txt:am.torrent
/home/ke/Desktop/y/u/n/u/s/files.txt:asd.torrent
I have to modify the output like this:
bor
yunus
How do I do that?
Thanks in advance.
This should work for you:
your_script.sh | sed 's,.*Desktop,,' | sed 's,[^/]*$,,' | sed s,/,,g
or, even better:
your_script.sh | sed 's,.*Desktop,,;s,[^/]*$,,;s,/,,g'
With sed. echo '/home/ke/Desktop/b/o/r/files.txt:am.torrent' | sed -e 's+/++g' -e 's/^.*Desktop//' -e 's/files.txt:.*$//'. This is a fairly trivial solution, and I'm sure there are better ones.
Id resort to awk:
BEGIN { FS="/" }
{
for(i=1;i<NF;i++)
if (length($i) == 1)
a[NR]=a[NR]""$i
}
END {
for (i in a)
print a[i]
}
use it like this:
$ awk -f script.awk input
bor
yunus
or if you have your data in a variable:
$ awk -f script.awk <<< $data
it's not a nice/tidy solution, but bash parameter expansion is a powerful tool. So could not resist providing an example
[]l="/home/ke/Desktop/b/o/r/files.txt:am.torrent"
[]m=${l##*Desktop/}
[]n=${m%%/files.txt*}
[]k=${n//\//}
[]echo $m
b/o/r/files.txt:am.torrent
[]echo $n
b/o/r
[]echo $k
bor
You can see how nicely bash is replacing the variable step by step without using any external program (btw [] is PS1, prompt)
There can be many more ways to do it. I got another one while writing the first
[]l="/home/ke/Desktop/b/o/r/files.txt:am.torrent"
[]m=${l/*Desktop\//}
[]n=${m/\/files.txt*/}
[]k=${n//\//}
[]echo $m
b/o/r/files.txt:am.torrent
[]echo $n
b/o/r
[]echo $k
bor
Try some more,

Resources