Why does sed not replace overlapping patterns - shell

I have a database unload file with field separated with the <TAB> character. I am running this file through sed to replace any occurences of <TAB><TAB> with <TAB>\N<TAB>. This is so that when the file is loaded into MySQL the \N in interpreted as NULL.
The sed command 's/\t\t/\t\N\t/g;' almost works except that it only replaces the first instance e.g. "...<TAB><TAB><TAB>..." becomes "...<TAB>\N<TAB><TAB>...".
If I use 's/\t\t/\t\N\t/g;s/\t\t/\t\N\t/g;' it replaces more instances.
I have a notion that despite the /g modifier this is something to do with the end of one match being the start of another.
Could anyone explain what is happening and suggest a sed command that would work or do I need to loop.
I know I could probably switch to awk, perl, python but I want to know what is happening in sed.

Not dissimilar to the perl solution, this works for me using pure sed
With #Robin A. Meade improvement
sed ':repeat;
s|\t\t|\t\n\t|g;
t repeat'
Explanation
:repeat is a label, used for branch commands, similar to batch
s|\t\t|\t\n\t|g; - Standard replace 2 tabs with tab-newline-tab. I still use the global flag because if you have, say, 15 tabs, you will only need to loop twice, rather than 14 times.
t repeat means if the "s" command did any replaces, then goto the label repeat, else it goes onto the next line and starts over again.
So it goes like this. Keep repeating (goto repeat) as long as there is a match for the pattern of 2 tabs.
While the argument can be made that you could just do two identical global replaces and call it good, this same technique could work in more complicated scenarios.
As #thorn-blake points out, sed just doesn't support advanced features like lookahead, so you need to do a loop like this.
Original Answer
sed ':repeat;
/\t\t/{
s|\t\t|\t\n\t|g;
b repeat
}'
Explanation
:repeat is a label, used for branch commands, similar to batch
/\t\t/ means match the pattern 2 tabs. If the pattern it matched, the command following the second / is executed.
{} - In this case the command following the match command is a group. So all of the commands in the group are executed if the match pattern is met.
s|\t\t|\t\n\t|g; - Standard replace 2 tabs with tab-newline-tab. I still use the global because if you have say 15 tabs, you will only need to loop twice, rather than 14 times.
b repeat means always goto (branch) the label repeat
Short version
Which can be shortened to
sed ':r;s|\t\t|\t\n\t|g; t r'
# Original answer
# sed ':r;/\t\t/{s|\t\t|\t\n\t|g; b r}'
MacOS
And the Mac (yet still Linux/Windows compatible) version:
sed $':r\ns|\t\t|\t\\\n\t|g; t r'
# Original answer
# sed $':r\n/\t\t/{ s|\t\t|\t\\\n\t|g; b r\n}'
Tabs need to be literal in BSD sed
Newlines need to be both literal and escaped at the same time, hence the single slash (that's \ before it is processed by the $, making it a single literal slash ) plus the \n which becomes an actual newline
Both label names (:r) and branch commands (b r when not the end of the expression) must end in a newline. Special characters like semicolons and spaces are consumed by the label name/branch command in BSD, which makes it all very confusing.

I know you want sed, but sed doesn't like this at all, it seems that it specifically (see here) won't do what you want. However, perl will do it (AFAIK):
perl -pe 'while (s#\t\t#\t\n\t#) {}' <filename>

As a workaround, replace every tab with tab + \N; then remove all occurrences of \N which are not immediately followed by a tab.
sed -e 's/\t/\t\\N/g' -e 's/\\N\([^\t]\)/\1/g'
... provided your sed uses backslash before grouping parentheses (there are sed dialects which don't want the backslashes; try without them if this doesn't work for you.)

Right, even with /g, sed will not match the text it replaced again. Thus, it's read <TAB><TAB> and output <TAB>\N<TAB> and then reads the next thing in from the input stream. See http://www.grymoire.com/Unix/Sed.html#uh-7
In a regex language that supports lookaheads, you can get around this with a lookahead.

Well, sed simply works as designed. The input line is scanned once, not multiple times. Maybe it helps to look at the consequences if sed used rescanning the input line to deal with overlapping patterns by default: in this case even simple substitutions would work quite differently--some might say counter-intuitively--, e.g.
s/^/ / inserting a space at the beginning of a line would never terminate
s/$/foo/ appending foo to each line - likewise
s/[A-Z][A-Z]*/CENSORED/ replacing uppercase words with CENSORED - likewise
There are probably many other situations. Of course these could all be remedied with, say, a substitution modifier, but at the time sed was designed, the current behavior was chosen.

Related

sed command for inserting text inside single quote

Suppose there's a text file with the following line:
export MYSQL_ADMIN=''
I want to insert text inside that single quote using the sed command, so that it changes to something like this for example:
export MYSQL_ADMIN='abc1'
What is the appropriate sed command for that in Linux?
I tried
sed -i -e ''/MYSQL_ADMIN/s/''/'abc1'/g"
but it didn't work.
Something like sed -i "s;export MYSQL_ADMIN=.*;export MYSQL_ADMIN='abc1';" /path/to/file.ext
-i modify file in place
s means substitute,
First block is what you are matching as an regular expression - the .* matches everything to the end of the line, this ensures you don't keep any text on that line after the substitue - and second block is what you are replacing with that match.
Always check the file after each run of sed if there is no error and check what changed.
To get the single quotes to print you may have to do ""'"" like ""'""abc1""'""
It is important to understand that although
I want to insert text inside that single quote using the sed command
is a perfectly good characterization of the effect you want to achieve, it does not map directly onto operations from sed's repertoire. With sed, the appropriate tool for most line modifications is the s command, which substitutes specified text for one or more matches to a specified regular expression. That would be the most natural thing to use for your case.
Additionally, it is important with sed to understand how and when to bind commands to specific lines. If you don't do that for a given command then it is applied to all lines. Sometimes that's fine, but other times it will produce unwanted results.
I tried
sed -i -e ''/MYSQL_ADMIN/s/''/'abc1'/g"
but it didn't work.
The two leading single quotes in that sed expression match each other, leaving the trailing double quote unmatched. Also, you do not specify the name of the file to modify. This variation would at least be valid shell syntax, and it would have the desired effect on the specified line appearing in file my_script:
sed -i -e "/MYSQL_ADMIN/s/''/'abc1'/g" my_script
That might also make other, unwanted changes, however.
You need to make some assumptions about the content of the file in order to do such a thing at all. The above depends on the text MYSQL_ADMIN and '' to appear on the same line only in the line(s) you want to modify. That may turn out to hold, but it seems unnecessarily risky. An assumption more likely to hold in general would be that there will be only one assignment to variable MYSQL_ADMIN, or that it is acceptable to modify all such assignments that assign a single-quote-delimited empty value.
Going with the latter, one might end up with this:
sed -i -e "s/\<MYSQL_ADMIN=''\(\s\|$\)/MYSQL_ADMIN='abc1'\1/g" my_script
The pattern \<MYSQL_ADMIN=''\(\s\|$\) improves on your plain MYSQL_ADMIN in these significant ways:
the \< causes it to match only immediately after a word boundary -- start of line, whitesepace, or punctuation. This prevents substitutions for other variables whose names happen to end with MYSQL_ADMIN. If you prefer, it would be even stronger to instead anchor the match to the beginning of the line with ^.
including the ='' in the pattern distinguishes between MYSQL_ADMIN and variables whose names contain that as an initial substring. It also ensures that the '' that gets replaced, if any, goes with the variable and does not merely appear somewhere else on the line.
the \(\s\|$\) both matches and captures either a whitespace character or the empty string at the end of a line. This distinguishes between assignments of an empty value and assignments of values that are merely prefixed by '' (which is valid if the file is a shell script). Having included it in the match, the capture allows the matched text, if any, to be preserved in the output (via the \1 in the replacement).
Because that matches the whole assignment, a complete assignment must appear in the replacement, too. On the other hand, this means that (probably) you can apply the command to every line, as shown, with no particular loss of efficiency relative to the previous command.
Even that might produce changes you didn't want, however, such as in comment lines or quoted text.

What is the meaning of this BASH SED command?

Example of tnum ... HYH19986_T_DRIVER_BAG_PRESSURE__78ms_546ms
tnum=`echo $1 | sed -e 's/_.*$//'`
The end result is that tnum will eventually become HYH19986. I have absolutely no experience of BASH but a quick search found that SED is the stream editor and essentially a find an replace too.
Please could someone explain to me what everything means from the -e onwards? Thank you.
Sed is the "stream editor". It is a non-interactive text editor, that takes commands to edit text. It's most commonly used command is "s", short for "substitute". This takes two expressions and optionally some options, and replaces the first expression with the second one.
The character after the "s" is the delimiter - it separates the expressions. Typically this is "/", but if you are working e.g. with paths it might be nicer to use something different like : or _ so you don't need to escape every /.
The _.*$ is a regular expression. Sed matches this, and replaces it with the second expression, the bit between the second and third slash, i.e. nothing in this case.
_ is a literal underline, .* is "any number of characters" and $ is the end of the line.
After that third slash you could also give options, like "g" (I remember it as "global"), which would cause this to be run multiple times per line. That's missing, but in this case the expression matches to the end of the line anyway, so nothing would change.
So this substitutes anything after an underline with nothing, which results in trimming it.
s/pattern/repl/ replaces the first occurrence of the pattern with the string repl. _.*$ matches a literal _ followed by the longest string of zero or more of any character (.*) up to the end of the line ($). So this just deletes everything from and including the first underscore to the end of the line.

extract data between similar patterns

I am trying to use sed to print the contents between two patterns including the first one. I was using this answer as a source.
My file looks like this:
>item_1
abcabcabacabcabcabcabcabacabcabcabcabcabacabcabc
>item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
>item_3
cdecde
>item_4
defdefdefdefdefdefdef
I want it to start searching from item_2 (and include) and finish at next occuring > (not include). So my code is sed -n '/item_2/,/>/{/>/!p;}'.
The result wanted is:
item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
but I get it without item_2.
Any ideas?
Using awk, split input by >s and print part(s) matching item_2.
$ awk 'BEGIN{RS=">";ORS=""} /item_2/' file
item_2
bcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdbbcdbcdbcdbcdb
I would go for the awk method suggested by oguz for its simplicity. Now if you are interested in a sed way, out of curiosity, you could fix what you have already tried with a minor change :
sed -n '/^>item_2/ s/.// ; //,/>/ { />/! p }' input_file
The empty regex // recalls the previous regex, which is handy here to avoid duplicating /item_2/. But keep in mind that // is actually dynamic, it recalls the latest regex evaluated at runtime, which is not necessarily the closest regex on its left (although it's often the case). Depending on the program flow (branching, address range), the content of the same // can change and... actually here we have an interesting example ! (and I'm not saying that because it's my baby ^^)
On a line where /^>item_2/ matches, the s/.// command is executed and the latest regex before // becomes /./, so the following address range is equivalent to /./,/>/.
On a line where /^>item_2/ does not match, the latest regex before // is /^>item_2/ so the range is equivalent to /^>item_2/,/>/.
To avoid confusion here as the effect of // changes during execution, it's important to note that an address range evaluates only its left side when not triggered and only its right side when triggered.
This might work for you (GNU sed):
sed -n ':a;/^>item_2/{s/.//;:b;p;n;/^>/!bb;ba}' file
Turn off implicit printing -n.
If a line begins >item_2, remove the first character, print the line and fetch the next line
If that line does not begins with a >, repeat the last two instructions.
Otherwise, repeat the whole set of instructions.
If there will always be only one line following >item_2, then:
sed '/^>item_2/!d;s/.//;n' file

How to decrement (subtract) number in file with sed

I've got some source code like the following where I call a function in C:
void myFunction (
&((int) table[1, 0]),
&((int) table[2, 0]),
&((int) table[3, 0])
);
...the only problem is that the function has >300 parameters (it's an auto-generated wrapper for initialising and calling a whole module; it was given to me and I cannot change it). And as you can see: I began accessing the array with a 1 instead of a 0... Great times, modifying all the 300 parameters, i.e. decrasing 300 x the x-coordinate of the array, by hand.
The solution I am looking for is how I could force sed to to do the work for me ;)
EDIT: Please note that the syntax above for accessing a two-dimensional array in C is wrong anyway! Of course it should be [1][0]... (so don't just copy-and-paste ;))
Basically, the command I came up with, was the following:
sed -r 's/(.*)(table\[)([0-9]+)(,)(.*)/echo "\1\2$((\3-1))\4\5"/ge' inputfile.c > outputfile.c
Well, this does not look very intuitive on the first sight - and I was missing good explanations for nearly every example I found.
So I will try to give a detailed explanation on this:
sed
--> basic command
-r
--> most examples you find are using -e; however, the -r parameter (only works with GNU sed) enables extended regular expressions and brings support for the + in a regex. It basically means "one or more matches".
's/input/output/ge'
--> this is the basic replacement syntax. It basically means "replace 'input' by 'output'". The /g is a "global" flag, i.e. sed will replace all occurences and not only the first one. You can add an additional e to execute the result in the bash. This is what we want to do here to handle the calculation.
(.*)
--> this matches "everthing" from the last match to the next match
(table\[)
--> the \ is to escape the bracket. This part of the expression will match Strings like table[
([0-9]+)
--> this one matches numbers with at least one digit, however, it can also match higher numbers with more than only one digit.
(,)
--> this simply matches the comma ,
(.*)
--> and again: the rest of the line
And now the interesting part:
echo "\1\2$((\3-1))\4\5"
the echo is a bash command
the \n (you can use every value from \1 up to \9) is some kind of "variable" for the inputs: \1 will contain the first match, \2 the seconds match, ... --> this helps you to preserve parts of the input string
the $((1+1)) is a simple bash syntax to calculate the value of the term inside the double brackets (in the complete sed command above, the \3 will of course be automatically replaced by the 3rd match, i.e. the 1st part inside the brackets to access the table's cells)
please note that we use quotation marks around the echo content to also be able to process lines with characters like & which would otherwise not work
The already mentioned e of \ge at the end will trigger the execution of the result in the bash. E.g. the first two lines of the example source code in the question would produce the following bash statements:
echo "void myFunction ("
echo " &((int) table[$((1-1)), 0]),"
which is being executed and results in the following output:
void myFunction (
&((int) table[0, 0]),
...which is exatcly what I wanted :)
BTW:
text > output.c
is simple bash syntax to output text (or in this case the sed-processed source code) to a file called output.c.
Good links about this topic are:
sed basics
regular expressions basics
Ahh and one more thing: You can also use sed in the git-Bash on Windows - if you are "forced" to use Windows at work like me ;)
PS: In the meantime I could have easily done this by hand but using sed was a lot more fun ;)
Here's another way you could do it, using Perl:
perl -pe 's/(table\[)(\d+)(,)/$1.($2-1).$3/e' file.c
This uses the e modifier to execute an expression in the replacement. The capture groups are concatenated together but the middle group has 1 subtracted from its value.
This will output to standard output so you can check that it does what you want. When you're happy, you can add the -i switch to overwrite the original file.

Understanding 'sed' command

I am currently trying to install GCC-4.1.2 on my machine: Fedora 20.
In the instruction, the first three commands involve using 'sed' commands, for Makefile modification. However, I am having difficulty in using those commands properly for my case. The website link for GCC-4.1.2.
The commands are:
sed -i 's/install_to_$(INSTALL_DEST) //' libiberty/Makefile.in &&
sed -i 's#\./fixinc\.sh#-c true#' gcc/Makefile.in &&
sed -i 's/#have_mktemp_command#/yes/' gcc/gccbug.in &&
I am trying to understand them by reading the 'sed' man page, but it is not so easy to do so. Any help/tip would be appreciated!
First, the shell part: &&. That just chains the commands together, so each subsequent line will only be run if the prior one is run successfully.
sed -i means "run these commands inline on the file", that is, modify the file directly instead of printing the changed contents to STDOUT. Each sed command here (the string) is a substitute command, which we can tell because the command starts with s.
Substitute looks for a piece of text in the file, and then replaces it. So the order is always s/needle/replacement/. See how the first and last lines have those same forward-slashes? That's the traditional delimiter between the command (substitute), the needle to find in the haystack (install_to_$(INSTALL_DEST), and the text to replace it with ().
So, the first one looks for the string and deletes it (the empty replacement). The last one looks for #have_mktemp_command# and replaces it with yes.
The middle one is a bit weird. See how it starts with s# instead of s/? Well, sed will let you use any delimiter you like to separate the needle from the replacement. Since this needle had a / in it (\./fixinc\.sh), it made sense to use a different delimiter than /. It will replace the text ./fixinc.sh with -c true.
Last note: Why does the second needle have \. instead of .? Well, in a Regular Expression like the needle is (but not used in your example), some characters are magical and do magical fairy dust operations. One of those magic characters is .. To avoid the magic, we put a \ in front of it, escaping away from the magic. (The magic is "match any character", and we want a literal period. That's why.)

Resources