I am writing a script for formatting a Fortran source code.
Simple formatting, like having all keywords in capitals or in small letters, etc.
Here is the main command
sed -i -e "/^\!/! s/$small\s/$cap /gI" $filein
It replaces every keyword $small (followed by a space) by a keyword $caps. And the replacement happens only if the line does not start with the "!".
It does what it should. Question:
How to avoid replacement if "!" is encountered in the middle of a line.
Or more generally, how to replace patterns everywhere, but not after a specific symbol, which can be either in the beginning of the line or somewhere else.
Example:
Program test ! It should not change the next program to caps
! Hi, is anything changing here? like program ?
This line does not have any key words
This line has Program and not exclamation mark.
"program" is a keyword. After running the script the result is:
PROGRAM test ! It should not change the next PROGRAM to caps
! Hi, is anything changed here? like program ?
This line does not have any key words
This line has PROGRAM and not exclamation mark.
I want:
PROGRAM test ! It should not change the next program to caps
! Hi, is anything changed here? like program ?
This line does not have any key words
This line has PROGRAM and not exclamation mark.
So far, I've failed to find a nice solution, which does the trick, hopefully with the sed command.
The typicall way in sed is to:
split the string into two parts - save one part in hold space.
do operations on pattern space
get hold space and shuffle for output.
Would be something along:
sed '/!/!b;/[^!]/{b};h;s/.*!//;x;s/!.*//;s/program/PROGRAM/gI;G;s/\n/!/'
/!/!b; - if the line has no !, then print it and start over.
h;s/.*!//;x;s/!.*// - put part after ! in hold space, part before ! in pattern space
s/program/PROGRAM/gI; - do the substitution on part of the string
G;s/\n/!/ - grab the part from hold space and shuffle output - it's easy here.
Assumptions:
OP needs to convert multiple keywords to uppercase
keywords to be capitalized do not include white space (eg, program name will need to be processed as two separate strings program and name)
input delimiter is white space
keywords with 'attached' non-alphanums will be ignored (eg, Program, will be ignored since , will be picked up as part of string) unless OP specifically includes the non-alphanum as part of the keyword definition (eg, keywords includes Program,)
all keywords to be converted to uppercase (ie, not going to worry about any flags to switch between lowercase, uppercase, camelcase, etc)
Sample input data:
$ cat source.txt
Program test ! It should not change the next program to caps # change first 'Program'
! Hi, is anything changing here? like program or MarK? # change nothing
This line does not have any key words ! except here - pRoGraM Mark # change nothing
This line has Program and not exclamation mARk plus MarKer. # change 'Program' and 'mARk' but not MarKer
Hi, hi, hI # change 'Hi,' and 'hi,' but not 'hI'
List of keywords provided in a separate file (whitespace delimited);
$ cat keywords.dat
program
mark hi, # 2 separate keywords: 'mark' and 'hi,' (comma included)
One awk idea:
awk -v comment="!" ' # define character after which conversions are to be ignored
FNR==NR { for ( i=1; i<=NF; i++) # first file contains keywords; process each field as a separate keywork
keywords[toupper($i)] # convert to uppercase and use as index in associative array keywords[]
next
}
{ for ( i=1; i<=NF; i++ ) # second file, process each field separately
{ if ( $i == comment ) # if field is our comment character then stop processing rest of line else ...
break
if ( toupper($i) in keywords ) # if current field is a keyword then convert to uppercase
$i=toupper($i)
}
print # print the current line
}
' keywords.dat source.txt
This generates:
PROGRAM test ! It should not change the next program to caps
! Hi, is anything changing here? like program or MarK?
This line does not have any key words ! except here - pRoGraM Mark
This line has PROGRAM and not exclamation MARK plus MarKer.
HI, HI, hI
NOTES:
while GNU awk can be told to overwrite the input file (eg, awk -i inplace == sed -i), this will require a different approach for processing the keywords.dat file (to keep from overwriting with nothing)
(quite a bit) of additional logic could be added to support uppercase vs lowercase vs camelcase vs whatever ... ignore or include non-alphanums in comparisons ... using multiple/different 'comment' characters ... standardizing other portions of (Fortran) code (eg, indentation) ... etc
This might work for you (GNU sed):
small='Program ' caps='PROGRAM '
sed -E ':a;s/^([^!]*)('"$small"')/\1\n/;ta;s/\n/'"$caps"'/g' file
Replace any occurrence of the variable $small before the symbol ! with a newline, then replace all newlines by the variable $caps.
N.B. The newline is chosen because it can not normally exist in any line presented by sed as it is the delimiter sed uses to present lines in the pattern space. Secondly, the words matching $small are iteratively replaced by a newline, then all newlines globally replaced by $caps. This allows for the replacement to by a superset of the first. If this were not the order of operations, the iterative process may become an endless loop.
If $small is to represent a case insensitive match, add the i flag to the first substitution.
I've tried suggested options, but all of them did not work as expected for the whole file.
I have ended up with multiple sed commands; I am sure that it is not the best solution, but it works for me and does what I need.
My main problem was to avoid replacement after "!" if it appears somewhere in the middle of the line.
So I switched this problem to the one I could handle.
sed -i -e "/^\!/! s/!/!c7u!!c7u!/" $filein # 1. If a line does NOT start with !, search next "!" and replace it with "!c7u!!c7u!"
sed -i "s/!c7u!/\n/" $filein # 2. Move that comment to a new line
for ((i=0; i<$nwords; i++ )); do # Loop through all keywords
word=${words[$i]} # Take a keyword from the list
small=${word,,} # Write it in small letters
cap=${word^^} # Write it in capitals
sed -i -e "/^\!/! s/$small\b/$cap/gI" $filein # 3. Actual replacement in lines not starting with "!"
done
sed -i -e :a -e '$!N;s/\n!c7u//;ta' -e 'P;D' $filein # 4. Undo step 1-2, moving inline comments back
Related
On the UNIX command line we can do simple record-oriented file work using simple field delimiters (or field separators). Common delimiters are space, tab, or vertical bar, but any character can be the delimiter. The commands sort, join, cut, etc. all take the field delimiter as an option -t or -d, and the shell (bourne or bourne again) accepts the IFS environment variable for the read -a command to parse a line into an array or the set -- command to parse a line into the special command line argument variables $0, $1, ....
The simple field separator approach is easy and the only thing to take care of is that the separator character does not occur in the data itself. Ideally not at all. This can work for specific data sets, but it cannot work in general. This is why on the UNIX shell, and C language (and from there C++, Java) the backslash escape sequences are sometimes used to mark such separators as part of the data (typical \_ when you have a file name with spaces for example. But that isn't n any way supported by the record and field oriented commands such as sort, cut, and join.
Now, often we get to download a "comma separated values" (CSV) file, which is a format apparently emanating from the Windows world. In it the comma is uses as the separator (a bad choice normally because comma is very likely to be found in the actual data values), and instead of escape sequence, the double quote is used around a data field if it might contain commas (or even spaces). Then inside such a quoted value, if the quote is part of the value, it is "escaped" by doubling it "".
Now I am looking for the easiest way to transform a CSV file to a simple delimited file. Any delimiter character can be chosen that doesn't occur in the data.
The difficulty is that the CSV quotation rules require a very simple stateful parser. You are either inside or outside a quoted value. If inside, you need to read the repeated quote "" as a quote.
I could not find the best answer here and on general internet search I found some things but they were incorrect or used too much tools.
Let's turn this into a contest. The most simple and elegant one-liner that runs on a bourne shell or bash with sed alone (and possibly grep and tr) wins the accepted answer. AWK is permitted if the result is more elegant and if it does not depend on one special version of AWK. Perl is not permitted nor a C program.
I will try my own answer of course.
UPDATE: People who don't even bother with sed and move right to awk are having the advantage obviously. If someone can do it elegantly in sed they would be the winner. My own attempt in sed is not elegant.
I have discovered that CSV files may contain line breaks inside quoted fields. That needs to be considered. Since we are trying to create a simple record & field format for UNIX shell processing these embedded line breaks should be converted to \n.
PS: people have asked: why a "one-liner". It doesn't have to be strictly a one liner, the point it that you are able to create that on the command line. Why not Perl? Because most UNIX systems come with the shell and sed and awk, but Perl needs to be installed (and there are all these different versions), same or worse for Python. Before I'd go with Perl or Python I would just write it in C. And no we don't want just any language, it should run on a bare bones UNIX setup without installing a bunch of stuff.
Starting with the vanilla awk CSV tokenizer: https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html
Small modification to replace double quotes inside quoted string with single quotes.
#! /usr/bin/awk -f
BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"
OFS = "|"
Q = "\""
}
{
for (i = 1; i <= NF; i++) {
v = $i
if ( $i ~ Q ) v = gensub(Q Q, Q, "g", substr(v, 2, length(v)-2))
printf "%s%s", v, (i<NF?OFS:ORS)
}
}
I'm still working to compact this into one liner ... It's going to be a long line :-).
???
My (first?) approach is according to the following outline:
Determine the best field separator (delimiter) character;
replace the (few) occurrences of the chosen delimiter with some (sequence of) other character(s) A that doesn't exist anywhere in the data;
replace any nested line breaks inside the quotes with \n;
replace the repeated quote "" with some (sequence of) other character(s) B that doesn't exist anywhere in the data;
replace the comma nested inside a quoted field with some (sequence of) other character(s) C that doesn't exist anywhere in the data;
remove the quotes around the quoted fields (i.e., remove all remaining quotes, as there shouldn't be any left);
replace the remaining commas with the chosen delimiter;
replace the replacement (sequence of) character(s) B for the repeated double quote with the single double quote
replace the replacement (sequence of) character(s) C for the comma inside a quoted value with the comma.
That is it. The steps 2, 3, and 4 are steps that depend on determining character sequences that do not appear anywhere in the file. That could be ~~, ^^, or $$ or anything. So this is determined with a series of tests. For example:
fgrep '|' data.csv
and finding only a small number of hits, I now replace | with $$ because I determine that $$ does not occur at all:
fgrep '$$' data.csv
In the same way I determine the replacement for the repeated double quote "", say with ^^ and the comma nested inside the quotes I would replace with ##.
Now I have the data that I need. And with that, the plan above is almost done with:
sed <data.csv \
-e 's/|/$$/g' \
-e ???????????????? \
-e 's/""/^^/g' \
-e 's/???????/???????/g' \
-e 's/"//g' \
-e 's/,/|/g' \
-e 's/^^/"/g' \
-e 's/##/,/g'
You can see each of the numbered steps 2 to 9 in one line each of this sed command. So it's all very clear. Except steps 3 and 5 with the ????????````, the hardest of them all, to replace line breaks and commas nested inside the quotes with the chosen replacement\nand$$``` respectively.
How would I do that? I need a regex (that sed can actually do), which replaces a comma inside a quoted string with something else, and without getting the quotes confused.
If all we wanted to do is completely remove the quoted strings we could say
-e 's/,"[^"]*",/,REMOVED,/g' \
Instead I do:
-e 's/,"\([^,"]*\),\([^"]*\)",/,"\1##\2",/g'
this would replace it once. I can now repeat that same sed command step many times to catch cases with more than one nested comma:
-e 's/,"\([^,"]*\),\([^"]*\)",/,"\1##\2",/g'
-e 's/,"\([^,"]*\),\([^"]*\)",/,"\1##\2",/g'
-e 's/,"\([^,"]*\),\([^"]*\)",/,"\1##\2",/g'
-e 's/,"\([^,"]*\),\([^"]*\)",/,"\1##\2",/g'
...
The problem is I don't know how often I have to replace this. But we can use a more advanced feature of sed: define a label and then jump back to the label when a replacement was made:
:c
s/,"\([^,"]*\),\([^"]*\)",/,"\1##\2",/g
tc
defines a label "a" and when the replacement was made, then jump to the label. Or in short on one line:
:c;s/,"\([^,"]*\),\([^"]*\)",/,"\1##\2",/g;tc
Finally the joining of the lines separated by newline inside quotes is done with a similar trick:
-e ':n;$!N;s/,"\([^"]*\)\n/,"\1\\n/g;tn'
the only additional trick here is the $!N which is $ last line, $! all but the last line, and N is append next line to pattern space so that the regex can search for the line break \n and replace it with the literal \n.
LANG=C sed <data.csv \
-e 's/|/$$/g' \
-e ':n;$!N;s/,"\([^"]*\)\n/,"\1\\n/g;tn' \
-e 's/""/^^/g' \
-e ':c;s/,"\([^,"]*\),\([^"]*\)"/,"\1##\2",/g;tc' \
-e 's/"//g' \
-e 's/,/|/g' \
-e 's/\^\^/"/g' \
-e 's/##/,/g'
So this is now quite a concise approach compared to what I had in the first revision of this answer (see the previous versions for how much better it is now).
PS There may still be errors. Especially I do not currently allow my quoted values to appear as the first field, right now the opening quote " is only recognized after a comma.
Alternative solution, doing character by character processing, keeping states (z-inside quoted string). Needless to say, it assumed input follow rules above.
Not sure if this will qualify as one-liner. ~200 characters.
#! /usr/bin/awk -f
BEGIN {
Q="\""
FS=","
OFS="|"
}
{
n=split($0,a,"")
r=""
for (i=1;i<=n;i++ ) {
c=a[i]
if (c==Q) if(a[i+1]==Q) i++ ; else { z=!z ; c="" } ; if (!z&&c==FS) { c=OFS }
r = r c
}
print r
}
Trying to clean up several dozen redundant nagios config files, but sed isn't working for me (yes I'm fairly new to bash), here's the string I want to replace:
use store-service
host_name myhost
service_description HTTP_JVM_SYM_DS
check_command check_http!'-p 8080 -N -u /SymmetricDS/app'
check_interval 1
with this:
use my-template-service
host_name myhost
just the host_name should stay unchanged since it'll be different for each file. Any help will be greatly appreciated. Tried escaping the ' and !, but get this error -bash: !'-p: event not found
Thanks
Disclaimer: This question is somewhat light on info and rings a bit like "write my code for me". In good faith I'm assuming that it's not that, so I am answering in hopes that this can be used to learn more about text processing/regex substitutions in general, and not just to be copy-pasted somewhere and forgotten.
I suggest using perl instead of sed. While sed is often the right tool for the job, in this case I think Perl's better, for the following reasons:
Perl lets you easily do multi-line matches on a regex. This is possible with sed, but difficult (see this question for more info).
With multiple lines and complex delimiters and quote characters, sed starts to display different behavior depending on what platform you're using it on. For example, trying to do this with sed in "sorta multiline" mode gave me different results on OSX versus Linux (really GNU sed vs BSD sed). When using semi-advanced functionality like that, I'd stick with a tool that behaves consistently across platforms, which Perl does in this case.
Perl lets you deal with ASCII values and other special characters without a ton of "toothpick tower" escaping or subshelling. Since it's convenient to use ASCII values to match the single quotes in your pattern (we could use mixed double and single quotes instead, but that makes it harder to copy/paste this command into, say, a subshell or an eval'd part of a script), it's better to use a tool that supports this without extra hassle. It's possible with sed, but tricky; see this article for more info.
In sed/BRE, doing something as simple as a "one or more" match usually requires escaping special characters, aka [[:space:]]\{1,\}, which gets tedious. Since it's convenient to use a lot of repetition/grouping characters in this pattern, I prefer Perl for conciseness in this case, since it improves clarity of the matching code.
Perl lets you write comments in regex statements in one-liner mode via the x modifier. For big, multiline patterns like this one, having the pattern broken up and commented for readability really helps if you ever need to go back and change it. sed has comments too, but using them in single-pasteable-command mode (as opposed to a file of sed script code) can be tricky, and can result in less readable commands.
Anyway, following is the matcher I came up with. It's commented inline as much as I can make it, but the non-commented parts are explained here:
The -0777 switch tells perl to consume input files whole before processing them, rather than operating line-by-line. See perlrun for more info on this and the other flags. Thanks to #glennjackman for pointing this out in the comments on the original question!
The -p switch tells Perl to read STDIN until it sees a delimiter (which is end-of-input as set by -0777), run the program supplied, and print that program's return value before shutting down. Since our "program" is just a string substitution statement, its return value is the substituted string.
The -e switch tells perl to evaluate the next string argument for a program to run, rather than finding a script file or similar.
Input is piped from mytext.txt, which could be a file containing your pattern. You could also pipe input to Perl e.g. via cat mytext.txt | perl ... and it would work exactly the same way.
The regex modifiers work as follows: I use the multiline m modifier to match more than one \n-delimited statement, and the extended x modifier so we can have comments and turn off matching of literal whitespace, for clarity. You could get rid of comments and literal whitespace and splat it all into one line if you wanted, but good luck making any changes after you've forgotten what it does. See perlre for more info on these modifiers.
This command will replace the literal string you supplied, in a file that contains it (it can have more than just that string before/after it; only that block of text will be manipulated). It is less than literal in one minor way: it allows any number (one or more) of space characters between the first and second words in each line. If I remember Nagios configs, the number of spaces doesn't particularly matter anyway.
This command will not change the contents of a file it is supplied. If a file does not match the pattern, its contents will be printed out unchanged by this command. If it contains that pattern, the replaced contents will be printed out. You can write those contents to a new file, or do anything you like with them.
perl -0777pe '
# Use the pipe "|" character as an expression delimiter, since
# the pattern contains slashes.
s|
# 'use', one or more space-equivalent characters, and then 'store-service',
# on one line.
use \s+ store-service \n
# Open a capturing group.
(
# Capture the host name line in its entirety, then close the group.
host_name \s+ \S+
# Close the group and end the line.
) \n
service_description \s+ HTTP_JVM_SYM_DS \n
# Look for check_command, spaces, and check_http!, but keep matching on the
# same line.
check_command \s+ check_http!
# Look for a single quote character by ASCII value, since shell
# escaping these can be ugly/tricky, and makes your code less copy-
# pasteable in/out of scripts/subcommands.
\047
# Look for the arguments to check_http, delimited by explicit \s
# spaces, since we are in "extended" mode in order to be able to write
# these comments and the expression on multiple lines.
-p \s 8080 \s -N \s -u \s /SymmetricDS/app
# Look for another single quote and the end of the line.
\047 \n
check_interval \s+ 1\n
# Replace all of the matched text with the "use my-template-service" line,
# followed by the contents of the first matching group (the host_name line).
# You could capture the "use" statement in another group, or use e.g.
# sprintf() to align fields here instead of a big literal space line, but
# this is the simplest, most obvious way to get the replacement done.
|use my-template-service\n$1|mx
' < mytext.txt
Assuming you can glob the files to select on the log files of interest, I would first filter the files that you want to replace to be limited to five lines.
You can do that with Bash and awk:
for fn in *; do # make that glob apply to your files...
[[ -e "$fn" && -f "$fn" && -s "$fn" ]] || continue
line_cnt=$(awk 'FNR==NR{next}
END {print NR}' "$fn")
(( line_cnt == 5 )) || continue
# at this point you only have files with 5 lines of text...
done
Once you have done that, you can add another awk to the loop to make the replacements:
for fn in *; do
[[ -e "$fn" && -f "$fn" && -s "$fn" ]] || continue
line_cnt=$(awk -v l=5 'FNR==NR{next}
END {print NR}' "$fn")
(( line_cnt == 5 )) || continue
awk 'BEGIN{tgt["use"]="my-template-service"
tgt["host_name"]=""}
$1 in tgt { if (tgt[$1]=="") s=$2
else s=tgt[$1]
printf "%-33s%s\n", $1, s
}
' "$fn"
done
This is the GNU sed solution, check it. Backup your files before testing.
#!/bin/bash
# You should escape all special characters in this string (like $, ^, /, {, }, etc),
# which you need interpreted literally, not as regex - by the backslash.
# Your original string was contained only slashes from this list, but
# I decide don't escape them by backslashes, but change sed's s/pattern/replace/
# command to the s|patter|replace|. You can pick any more fittable character.
needle="use\s{1,}store-service\n\
host_name\s{1,}myhost\n\
service_description\s{1,}HTTP_JVM_SYM_DS\n\
check_command\s{1,}check_http!'-p 8080 -N -u /SymmetricDS/app'\n\
check_interval\s{1,}1"
replacement="use my-template-service\n\
host_name myhost"
# This echo command displays the generated substitute command,
# which will be used by sed
# uncomment it for viewing
# echo "s/$needle/$replacement/"
# for changing the file in place add the -i option.
sed -r "
/use\s{1,}store-service/ {
N;N;N;N;
s|$needle|$replacement|
}" input.txt
Input
one
two
use store-service
host_name myhost
service_description HTTP_JVM_SYM_DS
check_command check_http!'-p 8080 -N -u /SymmetricDS/app'
check_interval 1
three
four
Output
one
two
use my-template-service
host_name myhost
three
four
Please excuse if the question is too naive. I am new to shell scripting and am not able to find any good resource to understand the specifics. I am trying to make sense of a legacy script. Please can someone tell me what the following command does:
sed "s#s3AtlasExtractName#$i#g" load_xyz.sql >> load_abc.sql;
This command will replace all occurrences of s3AtlasExtractName with whatever $i is.
s - Substitute
# - Delimiter
s3AtlasExtractName - Word that needs substituting
# - Delimiter
$i - i variable that will be used to replace s3AtlasExtractName
# - Delimiter
g - Global Replace all instance of s3AtlasExtractName in a single line and not just the first occurrence of it
So this will parse through load_xyz.sql and change all occurrences of s3AtlasExtractName to the value of $i and append the whole of the contents of load_xyz.sql to a file called load_abc.sql with the sed substitutions.
sed is a command line stream editor. You can find information about it here:
http://www.computerhope.com/unix/used.htm
An easy example is shown below where sed is used to replace the word "test" with the word "example" in myfile.txt but output is sent to newfile.txt
sed 's/test/example/g' myfile.txt > newfile.txt
It seems that your script is performing a similar function by replacing the content of the load_xyz.sql file and storing it in a new file load_abc.sql Without more code I am just guessing but it seems that the parameter $i could be used as counter to insert similar but new values into the load_abc.sql file.
In short, this reads load_xyz.sql and replaces every occurrence of "s3AtlasExtractName" by whatever has been stored in the shell variable "i".
The long version is that sed accepts many subcommands with different formattings. Any "simple" sed command will look like 'sed '. The first letter of the subcommand tells you which operation sed is going to do with your files.
The "s" operation stands for "substitution" and is the most commonly used. It is followed by a Perl-like regexp: separator, regexp to look for, separator, value to substitute, separator, PREG flags. In your case, the separator is '#' which is pretty unusual but not forbidden, so the command substitues '$i' to every instance of 's3AtlasExtractName'. The 'g' PREG flag tells sed to replace every occurrence of the pattern (the default is to only replace its first occurrence on every line in the input).
Finally, the use of "$i" inside a double-quote-delimited string tells the shell to actually expand the shell variable 'i' so you'll want to look for a shell statement setting that (possibly a 'for' statement).
Hope this helps.
edit: I focused on the 'sed' part and kinda missed the redirection part. The '>>' token tells the shell to take the output of the sed command (i.e. the contents of load_xyz.sql with all occurrences of s3AtlasExtractName replaced by the contents of $i) and append it to the file 'load_abc.sql'.
Here is an example of nicely indented Python regex (taken from here):
charref = re.compile(r"""
&[#] # Start of a numeric entity reference
(
0[0-7]+ # Octal form
| [0-9]+ # Decimal form
| x[0-9a-fA-F]+ # Hexadecimal form
)
; # Trailing semicolon
""", re.VERBOSE)
Now, I would like to use the same technique for bash regexes (i.e. sed or grep), but can't find any reference to similar features so far. Is it even possible to indent (and comment) something like this?
echo "$MULTILINE | sed -re 's/(expr1|expr2)|(expr3|expr4)/expr5/g'
You can use bash's line continuation, possibly:
echo "start of a line \
continues the previous line \
yet another continuation
oops. this is a brand new line"
Note the backslashes at the end of the first two lines. they essentially 'escape' the newline/linebreak that would otherwise tell bash you're starting a new line, which also implicitly terminate the statement being defined.
How would you delete all comments using sed from a file(defined with #) with respect to '#' being in a string?
This helped out a lot except for the string portion.
If # always means comment, and can appear anywhere on a line (like after some code):
sed 's:#.*$::g' <file-name>
If you want to change it in place, add the -i switch:
sed -i 's:#.*$::g' <file-name>
This will delete from any # to the end of the line, ignoring any context. If you use # anywhere where it's not a comment (like in a string), it will delete that too.
If comments can only start at the beginning of a line, do something like this:
sed 's:^#.*$::g' <file-name>
If they may be preceded by whitespace, but nothing else, do:
sed 's:^\s*#.*$::g' <file-name>
These two will be a little safer because they likely won't delete valid usage of # in your code, such as in strings.
Edit:
There's not really a nice way of detecting whether something is in a string. I'd use the last two if that would satisfy the constraints of your language.
The problem with detecting whether you're in a string is that regular expressions can't do everything. There are a few problems:
Strings can likely span lines
A regular expression can't tell the difference between apostrophies and single quotes
A regular expression can't match nested quotes (these cases will confuse the regex):
# "hello there"
# hello there"
"# hello there"
If double quotes are the only way strings are defined, double quotes will never appear in a comment, and strings cannot span multiple lines, try something like this:
sed 's:#[^"]*$::g' <file-name>
That's a lot of pre-conditions, but if they all hold, you're in business. Otherwise, I'm afraid you're SOL, and you'd be better off writing it in something like Python, where you can do more advanced logic.
This might work for you (GNU sed):
sed '/#/!b;s/^/\n/;ta;:a;s/\n$//;t;s/\n\(\("[^"]*"\)\|\('\''[^'\'']*'\''\)\)/\1\n/;ta;s/\n\([^#]\)/\1\n/;ta;s/\n.*//' file
/#/!b if the line does not contain a # bail out
s/^/\n/ insert a unique marker (\n)
ta;:a jump to a loop label (resets the substitute true/false flag)
s/\n$//;t if marker at the end of the line, remove and bail out
s/\n\(\("[^"]*"\)\|\('\''[^'\'']*'\''\)\)/\1\n/;ta if the string following the marker is a quoted one, bump the marker forward of it and loop.
s/\n\([^#]\)/\1\n/;ta if the character following the marker is not a #, bump the marker forward of it and loop.
s/\n.*// the remainder of the line is comment, remove the marker and the rest of line.
Since there is no sample input provided by asker, I will assume a couple of cases and Bash is the input file because bash is used as the tag of the question.
Case 1: entire line is the comment
The following should be sufficient enough in most case:
sed '/^\s*#/d' file
It matches any line has which has none or at least one leading white-space characters (space, tab, or a few others, see man isspace), followed by a #, then delete the line by d command.
Any lines like:
# comment started from beginning.
# any number of white-space character before
# or 'quote' in "here"
They will be deleted.
But
a="foobar in #comment"
will not be deleted, which is the desired result.
Case 2: comment after actual code
For example:
if [[ $foo == "#bar" ]]; then # comment here
The comment part can be removed by
sed "s/\s*#*[^\"']*$//" file
[^\"'] is used to prevent quoted string confusion, however, it also means that comments with quotations ' or " will not to be removed.
Final sed
sed "/^\s*#/d;s/\s*#[^\"']*$//" file
To remove comment lines (lines whose first non-whitespace character is #) but not shebang lines (lines whose first characters are #!):
sed '/^[[:space:]]*#[^!]/d; /#$/d' file
The first argument to sed is a string containing a sed program consisting of two delete-line commands of the form /regex/d. Commands are separated by ;. The first command deletes comment lines but not shebang lines. The second command deletes any remaining empty comment lines. It does not handle trailing comments.
The last argument to sed is a file to use as input. In Bash, you can also operate on a string variable like this:
sed '/^[[:space:]]*#[^!]/d; /#$/d' <<< "${MYSTRING}"
Example:
# test.sh
S0=$(cat << HERE
#!/usr/bin/env bash
# comment
# indented comment
echo 'FOO' # trailing comment
# last line is an empty, indented comment
#
HERE
)
printf "\nBEFORE removal:\n\n${S0}\n\n"
S1=$(sed '/^[[:space:]]*#[^!]/d; /#$/d' <<< "${S0}")
printf "\nAFTER removal:\n\n${S1}\n\n"
Output:
$ bash test.sh
BEFORE removal:
#!/usr/bin/env bash
# comment
# indented comment
echo 'FOO' # trailing comment
# last line is an empty, indented comment
#
AFTER removal:
#!/usr/bin/env bash
echo 'FOO' # trailing comment
Supposing "being in a string" means "occurs between a pair of quotes, either single or double", the question can be rephrased as "remove everything after the first unquoted #". You can define the quoted strings, in turn, as anything between two quotes, excepting backslashed quotes. As a minor refinement, replace the entire line with everything up through just before the first unquoted #.
So we get something like [^\"'#] for the trivial case -- a piece of string which is neither a comment sign, nor a backslash, nor an opening quote. Then we can accept a backslash followed by anything: \\. -- that's not a literal dot, that's a literal backslash, followed by a dot metacharacter which matches any character.
Then we can allow zero or more repetitions of a quoted string. In order to accept either single or double quotes, allow zero or more of each. A quoted string shall be defined as an opening quote, followed by zero or more of either a backslashed arbitrary character, or any character except the closing quote: "\(\\.\|[^\"]\)*" or similarly for single-quoted strings '\(\\.\|[^\']\)*'.
Piecing all of this together, your sed script could look something like this:
s/^\([^\"'#]*\|\\.\|"\(\\.\|[^\"]\)*"\|'\(\\.\|[^\']\)*'\)*\)#.*/\1/
But because it needs to be quoted, and both single and double quotes are included in the string, we need one more additional complication. Recall that the shell allows you to glue together strings like "foo"'bar' gets replaced with foobar -- foo in double quotes, and bar in single quotes. Thus you can include single quotes by putting them in double quotes adjacent to your single-quoted string -- '"foo"'"'" is "foo" in single quotes next to ' in double quotes, thus "foo"'; and "' can be expressed as '"' adjacent to "'". And so a single-quoted string containing both double quotes foo"'bar can be quoted with 'foo"' adjacent to "'bar" or, perhaps more realistically for this case 'foo"' adjacent to "'" adjacent to another single-quoted string 'bar', yielding 'foo'"'"'bar'.
sed 's/^\(\(\\.\|[^\#"'"'"']*\|"\(\\.\|[^\"]\)*"\|'"'"'\(\\.\|[^\'"'"']\)*'"'"'\)*\)#.*/\1/p' file
This was tested on Linux; on other platforms, the sed dialect may be slightly different. For example, you may need to omit the backslashes before the grouping and alteration operators.
Alas, if you may have multi-line quoted strings, this will not work; sed, by design, only examines one input line at a time. You could build a complex script which collects multiple lines into memory, but by then, switching to e.g. Perl starts to make a lot of sense.
As you have pointed out, sed won't work well if any parts of a script look like comments but actually aren't. For example, you could find a # inside a string, or the rather common $# and ${#param}.
I wrote a shell formatter called shfmt, which has a feature to minify code. That includes removing comments, among other things:
$ cat foo.sh
echo $# # inline comment
# lone comment
echo '# this is not a comment'
[mvdan#carbon:12] [0] [/home/mvdan]
$ shfmt -mn foo.sh
echo $#
echo '# this is not a comment'
The parser and printer are Go packages, so if you'd like a custom solution, it should be fairly easy to write a 20-line Go program to remove comments in the exact way that you want.
sed 's:^#\(.*\)$:\1:g' filename
Supposing the lines starts with single # comment, Above command removes all comments from file.