What is the meaning of this BASH SED command? - bash

Example of tnum ... HYH19986_T_DRIVER_BAG_PRESSURE__78ms_546ms
tnum=`echo $1 | sed -e 's/_.*$//'`
The end result is that tnum will eventually become HYH19986. I have absolutely no experience of BASH but a quick search found that SED is the stream editor and essentially a find an replace too.
Please could someone explain to me what everything means from the -e onwards? Thank you.

Sed is the "stream editor". It is a non-interactive text editor, that takes commands to edit text. It's most commonly used command is "s", short for "substitute". This takes two expressions and optionally some options, and replaces the first expression with the second one.
The character after the "s" is the delimiter - it separates the expressions. Typically this is "/", but if you are working e.g. with paths it might be nicer to use something different like : or _ so you don't need to escape every /.
The _.*$ is a regular expression. Sed matches this, and replaces it with the second expression, the bit between the second and third slash, i.e. nothing in this case.
_ is a literal underline, .* is "any number of characters" and $ is the end of the line.
After that third slash you could also give options, like "g" (I remember it as "global"), which would cause this to be run multiple times per line. That's missing, but in this case the expression matches to the end of the line anyway, so nothing would change.
So this substitutes anything after an underline with nothing, which results in trimming it.

s/pattern/repl/ replaces the first occurrence of the pattern with the string repl. _.*$ matches a literal _ followed by the longest string of zero or more of any character (.*) up to the end of the line ($). So this just deletes everything from and including the first underscore to the end of the line.

Related

sed command for inserting text inside single quote

Suppose there's a text file with the following line:
export MYSQL_ADMIN=''
I want to insert text inside that single quote using the sed command, so that it changes to something like this for example:
export MYSQL_ADMIN='abc1'
What is the appropriate sed command for that in Linux?
I tried
sed -i -e ''/MYSQL_ADMIN/s/''/'abc1'/g"
but it didn't work.
Something like sed -i "s;export MYSQL_ADMIN=.*;export MYSQL_ADMIN='abc1';" /path/to/file.ext
-i modify file in place
s means substitute,
First block is what you are matching as an regular expression - the .* matches everything to the end of the line, this ensures you don't keep any text on that line after the substitue - and second block is what you are replacing with that match.
Always check the file after each run of sed if there is no error and check what changed.
To get the single quotes to print you may have to do ""'"" like ""'""abc1""'""
It is important to understand that although
I want to insert text inside that single quote using the sed command
is a perfectly good characterization of the effect you want to achieve, it does not map directly onto operations from sed's repertoire. With sed, the appropriate tool for most line modifications is the s command, which substitutes specified text for one or more matches to a specified regular expression. That would be the most natural thing to use for your case.
Additionally, it is important with sed to understand how and when to bind commands to specific lines. If you don't do that for a given command then it is applied to all lines. Sometimes that's fine, but other times it will produce unwanted results.
I tried
sed -i -e ''/MYSQL_ADMIN/s/''/'abc1'/g"
but it didn't work.
The two leading single quotes in that sed expression match each other, leaving the trailing double quote unmatched. Also, you do not specify the name of the file to modify. This variation would at least be valid shell syntax, and it would have the desired effect on the specified line appearing in file my_script:
sed -i -e "/MYSQL_ADMIN/s/''/'abc1'/g" my_script
That might also make other, unwanted changes, however.
You need to make some assumptions about the content of the file in order to do such a thing at all. The above depends on the text MYSQL_ADMIN and '' to appear on the same line only in the line(s) you want to modify. That may turn out to hold, but it seems unnecessarily risky. An assumption more likely to hold in general would be that there will be only one assignment to variable MYSQL_ADMIN, or that it is acceptable to modify all such assignments that assign a single-quote-delimited empty value.
Going with the latter, one might end up with this:
sed -i -e "s/\<MYSQL_ADMIN=''\(\s\|$\)/MYSQL_ADMIN='abc1'\1/g" my_script
The pattern \<MYSQL_ADMIN=''\(\s\|$\) improves on your plain MYSQL_ADMIN in these significant ways:
the \< causes it to match only immediately after a word boundary -- start of line, whitesepace, or punctuation. This prevents substitutions for other variables whose names happen to end with MYSQL_ADMIN. If you prefer, it would be even stronger to instead anchor the match to the beginning of the line with ^.
including the ='' in the pattern distinguishes between MYSQL_ADMIN and variables whose names contain that as an initial substring. It also ensures that the '' that gets replaced, if any, goes with the variable and does not merely appear somewhere else on the line.
the \(\s\|$\) both matches and captures either a whitespace character or the empty string at the end of a line. This distinguishes between assignments of an empty value and assignments of values that are merely prefixed by '' (which is valid if the file is a shell script). Having included it in the match, the capture allows the matched text, if any, to be preserved in the output (via the \1 in the replacement).
Because that matches the whole assignment, a complete assignment must appear in the replacement, too. On the other hand, this means that (probably) you can apply the command to every line, as shown, with no particular loss of efficiency relative to the previous command.
Even that might produce changes you didn't want, however, such as in comment lines or quoted text.

Delete a column in a log, that changes position in the line, Powershell

I have a log file, and I want to get rid of the third column that start with "external", this column is not always in the third place so I need to find the word "external" and then delete it with the string that follows the colon.
I was thinking in using -replace for that, but does "-replace" accept some regex to delete the rest of the string (after the semicolons) that is always changing?
or maybe there is a better way to do this?
02/02/2020 name:VAL_NATURE external:af2045b2-5992-432e-b790-c1ad4743038 status:good
cat mylog.log | %{$_ -replace "external???",""}
With any delimited file, the first thought I have is to break it at the delimiters (in your case, the white space) and treat it like an object. Deleting a column is trivial if you do that, and it lets you have easy access to the data for other purposes.
If, however, your only task is to remove that column with 'external' + colon + all text up to the next bit of white space, that is an easy thing to do with a regex replace.
$line = '02/02/2020 name:VAL_NATURE external:af2045b2-5992-432e-b790-c1ad4743038 status:good'
$line -replace 'external:.*\s',''
EDIT: Tested the code above, and got this output:
02/02/2020 name:VAL_NATURE status:good
The . is any character, and .* says "any character zero or more times" it continues matching until it gets to whitespace, which is represented by the \s. So this regex matches the word 'external' followed by a ':' followed by zero or more other characters followed by whitespace (space/tab/etc).

Bash script " & " symbol creating an issues

I have a tag
<string name="currencysym">$</string>
in string.xml And what to change the $ symbol dynamically. Use the below command but didn't work:
currencysym=₹
sed -i '' 's|<string name="currencysym">\(.*\)<\/string>|<string name="currencysym">'"<!\[CDATA\[${currencysym}\]\]>"'<\/string>|g'
Getting OUTPUT:
<string name="currencysym"><![CDATA[<string name="currencysym">$</string>#x20B9;]]></string>
" & " Has Removed...
But I need:
<string name="currencysym"><![CDATA[₹]]></string>
using xml-parser/tool to handle xml is first choice
instead of <..>\(.*\)<..> better use <..>\([^<]*\)<..> in case you have many tags in one line
& in replacement has special meaning, it indicates the whole match (\0) of the pattern. That's why you see <...>..</..> came to your output. If you want it to be literal, you should escape it -> \&
First problem is the line
currencysym=₹
This actually reads as "assign empty to currencysym and start no process in the background":
In bash you can set an environment variable (or variables) just or one run of a process by doing VAR=value command. This is how currencysym= is being interpreted.
The & symbol means start process in the background, except there is no command specified, so nothing happens.
Everything after # is interpreted as a comment, so #x20B9; is just whitespace from Bash's point of view.
Also, ; is a command separator, like &, which means "run in foreground". It is not used here because it is commented out by #.
You have to either escape &, # and ;, or just put your string into single quotes: currencysym=\&\#x20B9\; or currencysym='₹'.
Now on top of that, & has a special meaning in sed, so you will need to escape it before using it in the sed command. You can do this directly in the definition like currencysym=\\\&\#x20B9\; or currencysym='\₹', or you can do it in your call to sed using builtin bash functionality. Instead of accessing ${currencysym}, reference ${currencysym/&/\&}.
You should use double-quotes around variables in your sed command to ensure that your environment variables are expanded, but you should not double-quote exclamation marks without escaping them.
Finally, you do not need to capture the original currency symbol since you are going to replace it. You should make your pattern more specific though since the * quantifier is greedy and will go to the last closing tag on the line if there is more than one:
sed 's|<string name="currencysym">[^<]*</string>|<string name="currencysym"><![CDATA['"${currencysym/&/\&}"']]></string>|' test.xml
Yields
<string name="currencysym"><![CDATA[₹]]></string>
EDIT
As #fedorqui points out, you can use this example to show off correct use of capture groups. You could capture the parts that you want to repeat exactly (the tags), and place them back into the output as-is:
sed 's|\(<string name="currencysym">\)[^<]*\(</string>\)|\1<![CDATA['"${currencysym/&/\&}"']]>\2|' test.xml
sed -i '' 's|\(<string name="currencysym">\)[^<]*<|\1<![CDATA[\₹]]><|g' YourFile
the group you keep in buffer is the wrong one in your code, i keep first part not the &
grouping a .* is not the same as all untl first < you need. especially with the goption meaning several occurence could occur and in this case everithing between first string name and last is the middle part (your group).
carrefull with & alone (not escaped) that mean 'whole search pattern find' in replacement part

Remove everything but brackets with sed, then indent

I have a huge file, a really huge file (some 600+MB of text). In fact they are jsons. Each json is on a new line and only comes in a few flavours.
They look like:
{"text":{"some nested words":"Some more","something else":"Yeah more stuff","some list":["itemA","ItemB","itemEtc"]},"One last object":{"a thing":"and it's value"}}
And what I want is it go through with sed, suck out the text, and for each nexted pair put in some indent, so we get:
{
-{
--[]
-}
--{}
-}
}
(I'm not 100% sure I got the nesting right on the output, I think it's right)
Is this possible? I saw this, which was the closest I could imagine it being, but that gets rid of the brackets two.
I've noticed the answer there uses braching, so I think I need that, and I'll need to do some kind of s/pattern/newline+tab/space/g type command but I can't figure out how or what to make that...
Could someone help please? It needn't be pure sed but that is prefered.
This will not be pretty... =) Here is my solution as a sed script. Notice that it requires that the first line notifies the shell how to invoke sed to execute our script. As you can see, the "-n" flag is used so we force sed only to print what we explicitly command it to through the "p" or "P" commands. The "-f" option tells sed to read the commands from a file, with the name following the option. As the file name of the script is concatenated by the shell into the final command, it will properly read commands from the script (ie. if you run "./myscript.sed" the shell will execute "/bin/sed -nf myscript.sed").
#!/bin/sed -nf
s/[^][{}]//g
t loop
: loop
t dummy
: dummy
s/^\s*[[{]/&/
t open
s/^\s*[]}]/&\
/
t close
d
: open
s/^\(\s*\)[[]\s*[]]/\1[]\
/
s/^\(\s*\)[{]\s*[}]/\1{}\
/
t will_loop
b only_open
: will_loop
P
s/.*\n//
b loop
: only_open
s/^\s*[[{]/&\
/
P
s/.*\n//
s/[][{}]/ &/g
b loop
: close
s/ \([][{}]\)/\1/g
P
s/.*\n//
b loop
Before we start, we must first strip everything into brackets and square brackets. That's the responsibility of the first "s" command. It tells sed to replace every character that isn't a bracket or a square bracket with nothing, ie. remove it. Notice that the square brackets in the match represent a group of characters to match, but when the first character inside them is a "^", it will actually match any character except the ones specified after the "^". Because we want to match the closing square bracket and we need to close with a square bracket the group of characters to ignore, we tell that a closing square bracket should be included in the group by making it the first character following the "^". We can then specify the rest of the characters: opening square bracket, open bracket and close bracket (group of ignored characters: "][{}"), and then close the group with the closing square bracket. I tried to detail more here because this can be confusing.
Now for the actual logic. The algorithm is pretty simple:
while line isn't empty
if line starts with optional spaces followed by [ or {
if after the [ or { there are optional spaces followed by a respective ] or }
print the respective pair, with only the indentation spaces, followed by a newline
else
print the opening square or normal bracket, followed by a newline
remove what was printed from the pattern space (a.k.a. the buffer)
add a space before every open or close bracket (normal or square)
end-if
else
remove a space before every open or close bracket (normal or square)
print the closing square or normal bracket, followed by a newline
remove what was printed from the pattern space
end-if
end-while
But there are a couple of quirks. First of all, sed doesn't support a "while" loop or an "if" statement directly. The closest we can get to is the "b" and "t" commands. The "b" command branches (jumps) to a predefined label, similar to a C goto statement. The "t" also branches to a predefined label, but only if a substitution has happened since the start of the script running on the current line or since the last "t" command. Labels are written with the ":" command.
Because it is very likely that the first command actually performs at least one substitution, the first "t" command that follows it will cause a branch. Because we need to test for some other substitutions, we need to make sure that the next "t" command won't automatically succeed because of that first command. That is why we start with a "t" command to a line just above it (ie. if it branches or not, it will still continue at the same point), so we can "reset" the internal flag used by "t" commands.
Because the "loop" label will be branched to from at least one "b" command, it is possible that the same flag will be set when the "b" is executed, because only "t" commands can clear it. Therefore, we need to do the same workaround to reset the flag, this time by using a "dummy" label.
We now start the algorithm by checking for the presence of an open square bracket or an open close bracket. Because we only want to test for their presence, we must replace the match with itself, which is what "&" represents, and sed will automatically set the internal flag for the "t" command if the match succeeds. If the match succeeds, we use the "t" command to branch into into "open" label.
If it doesn't succeed, we need to see if we match a close square or normal the bracket. The command is nearly identical, but now we append a newline after the closing bracket. We do this by adding an escaped newline (ie. a backslash followed by an actual newline) after where we place the match (ie. after the "&"). Similarly to above, we use the "t" command to branch to the "close" label if the match succeeds. If it doesn't succeed, we will consider the line as invalid, and promptly empty the pattern space (buffer) and restart the script on the next line, all with the single "d" command.
Entering the "open" label, we will first handle the case of a pair of matching open and close brackets. If we do match them, we will print them with the indentation spaces preceding them, without any spaces between them, and ending with a newline. There is one specific command for each type of bracket pair (square or normal), but they are analogous. Because we have to keep track of how many indentation spaces there are we must store them in a special "variable". We do this by using the group capture, which will store the part of the match that starts after the "(" and ends before the ")". Therefore, we use it to capture the spaces after the start of the line and before the open bracket. We then proceed to match the open bracket followed by spaces and the respective close bracket. When we write the replacement, we make sure to reinsert the spaces by using the special variable "\1", which contains the data stored by the first group capture in the match. We then write the respective pair of open and close brackets and the escaped newline.
If we managed to do any of the replacements, we must print what we have just written, remove it from the pattern space and restart the loop with the remaining characters of the line. Because of this, we first branch with the "t" command to the "will_loop" label. Otherwise, we branch to the "only_open" label, which will handle the case of only an open bracket, without the consecutive respective close bracket.
Inside the "will_loop" label, we just print everything in the pattern space up to the first newline (which we manually added) with the "P" command. We then manually remove everything up to that first newline, so we can proceed with the rest of the line. This is similar to what the "D" command does, but without restarting the execution of the script. Finally we branch to the start of the loop again.
Inside the "only_open" label, we match an open bracket in a similar fashion as previously, but now we rewrite it appended with a newline. We then print that line and remove it from the pattern space. Now we replace all brackets (open or close, square or normal) with itself preceded by a single space character. This is so we can increment the indentation. Finally we branch to the beginning of the loop again.
The final label "close" will handle a closing bracket. We first remove every single space before a bracket, effectively decrementing the indentation. To do this, we need to use captures, because although we want to match the space and the bracket that follows, we only want to write back the bracket. Finally, print everything up to the newline that we manually added before entering the "close" label, remove what we printed from pattern space and restart the loop.
Some observations:
This doesn't check for the syntactic correctness of the code (ie. {{[}] would be accepted)
It will add and remove indentation as brackets are encountered, regardless of their type. This means that when it adds an indentation, it will remove it even if the encountered close bracket is not of the same type.
Hope this helps, and sorry for the long post =)
This might work for you (GNU sed):
sed 's/[^][{}]//g;s/./&\n/g;s/.$//' file |
sed -r '/[[{]/{G;s/(.)\n(.*)/\2\1/;x;s/^/\t/;x;b};x;s/.//;x;G;s/(.)\n(.*)/\2\1/' |
sed -r '$!N;s/((\{).*(\}))|((\[).*(\]))/\2\5\3\6/;P;D'
Explanation:
The first sed command produces a stream of curly/square brackets each on its own line
The second sed command indents each bracket
The third sed command reduces those paired brackets to a single line
If your happy with correctly indented brackets, the third command can be omitted.
I think you're expected output should look like:
{
-{
--[]
-}
-{
-}
}
Here's one way using GNU awk:
awk -f script.awk file.txt
Contents of script.awk:
BEGIN {
FS=""
flag = 0
}
{
for (i=1; i<=NF; i++) {
if ($i == "{" || $i == "[") {
flag = flag + 1
build_tree(flag, $i)
printf (flag <=2) ? "\n" : ""
}
if ($i == "}" || $i == "]") {
flag = flag - 1
printf (flag >= 2) ? $i : \
build_tree(flag + 1, $i); \
printf "\n"
}
}
}
function build_tree (num, brace) {
for (j=1; j<=num - 1; j++) {
printf "-"
}
printf brace
}
I know this is an ancient thread and nobody is looking anyway, but there's a simpler way now.
cat file.txt | jq '.' | sed 's/ /-/g' | tr -dc '[[]{}()]\n-' | sed '/^-*$/d'
There are 2 spaces in the first sed.

Why does sed not replace overlapping patterns

I have a database unload file with field separated with the <TAB> character. I am running this file through sed to replace any occurences of <TAB><TAB> with <TAB>\N<TAB>. This is so that when the file is loaded into MySQL the \N in interpreted as NULL.
The sed command 's/\t\t/\t\N\t/g;' almost works except that it only replaces the first instance e.g. "...<TAB><TAB><TAB>..." becomes "...<TAB>\N<TAB><TAB>...".
If I use 's/\t\t/\t\N\t/g;s/\t\t/\t\N\t/g;' it replaces more instances.
I have a notion that despite the /g modifier this is something to do with the end of one match being the start of another.
Could anyone explain what is happening and suggest a sed command that would work or do I need to loop.
I know I could probably switch to awk, perl, python but I want to know what is happening in sed.
Not dissimilar to the perl solution, this works for me using pure sed
With #Robin A. Meade improvement
sed ':repeat;
s|\t\t|\t\n\t|g;
t repeat'
Explanation
:repeat is a label, used for branch commands, similar to batch
s|\t\t|\t\n\t|g; - Standard replace 2 tabs with tab-newline-tab. I still use the global flag because if you have, say, 15 tabs, you will only need to loop twice, rather than 14 times.
t repeat means if the "s" command did any replaces, then goto the label repeat, else it goes onto the next line and starts over again.
So it goes like this. Keep repeating (goto repeat) as long as there is a match for the pattern of 2 tabs.
While the argument can be made that you could just do two identical global replaces and call it good, this same technique could work in more complicated scenarios.
As #thorn-blake points out, sed just doesn't support advanced features like lookahead, so you need to do a loop like this.
Original Answer
sed ':repeat;
/\t\t/{
s|\t\t|\t\n\t|g;
b repeat
}'
Explanation
:repeat is a label, used for branch commands, similar to batch
/\t\t/ means match the pattern 2 tabs. If the pattern it matched, the command following the second / is executed.
{} - In this case the command following the match command is a group. So all of the commands in the group are executed if the match pattern is met.
s|\t\t|\t\n\t|g; - Standard replace 2 tabs with tab-newline-tab. I still use the global because if you have say 15 tabs, you will only need to loop twice, rather than 14 times.
b repeat means always goto (branch) the label repeat
Short version
Which can be shortened to
sed ':r;s|\t\t|\t\n\t|g; t r'
# Original answer
# sed ':r;/\t\t/{s|\t\t|\t\n\t|g; b r}'
MacOS
And the Mac (yet still Linux/Windows compatible) version:
sed $':r\ns|\t\t|\t\\\n\t|g; t r'
# Original answer
# sed $':r\n/\t\t/{ s|\t\t|\t\\\n\t|g; b r\n}'
Tabs need to be literal in BSD sed
Newlines need to be both literal and escaped at the same time, hence the single slash (that's \ before it is processed by the $, making it a single literal slash ) plus the \n which becomes an actual newline
Both label names (:r) and branch commands (b r when not the end of the expression) must end in a newline. Special characters like semicolons and spaces are consumed by the label name/branch command in BSD, which makes it all very confusing.
I know you want sed, but sed doesn't like this at all, it seems that it specifically (see here) won't do what you want. However, perl will do it (AFAIK):
perl -pe 'while (s#\t\t#\t\n\t#) {}' <filename>
As a workaround, replace every tab with tab + \N; then remove all occurrences of \N which are not immediately followed by a tab.
sed -e 's/\t/\t\\N/g' -e 's/\\N\([^\t]\)/\1/g'
... provided your sed uses backslash before grouping parentheses (there are sed dialects which don't want the backslashes; try without them if this doesn't work for you.)
Right, even with /g, sed will not match the text it replaced again. Thus, it's read <TAB><TAB> and output <TAB>\N<TAB> and then reads the next thing in from the input stream. See http://www.grymoire.com/Unix/Sed.html#uh-7
In a regex language that supports lookaheads, you can get around this with a lookahead.
Well, sed simply works as designed. The input line is scanned once, not multiple times. Maybe it helps to look at the consequences if sed used rescanning the input line to deal with overlapping patterns by default: in this case even simple substitutions would work quite differently--some might say counter-intuitively--, e.g.
s/^/ / inserting a space at the beginning of a line would never terminate
s/$/foo/ appending foo to each line - likewise
s/[A-Z][A-Z]*/CENSORED/ replacing uppercase words with CENSORED - likewise
There are probably many other situations. Of course these could all be remedied with, say, a substitution modifier, but at the time sed was designed, the current behavior was chosen.

Resources