skip over a pattern in sed - bash

I wrote a correctly working sed script which replaces multiple spaces with single space between tokens (it skips lines with # or //) :
#!/bin/sed -f
/.*#/ !{
/\/\//n
# handle more than one space between tokens
s/\([^ ]\)\s\+/\1 /g
}
i run it on ubuntu like this: ./spaces.sed < spa.txt
spa.txt:
/** spa.txt text
date : some date
hih+jjhh jgjg
if ( hjh>=hjhjh )
y **/
# this is a comment
// this is a comment
lines begins here ;
/****** this line is comment ****/
some more lines
// again comment
more lines words
/** again multi line co
mmment it
comment line
follows till here**/
file ends
now i want to add the functionality that script should skip over lines between a pattern (pattern can be distributed in multiple lines). This is the pattern: /* and */
I tried many things but of no use:
#!/bin/sed -f
/.*#/ !{
/\/\*/,/\*\// {
/\/\*/n #it skips successfully the /* line
n #also skips next line
/\*\// !{
}
}
/\/\//n
# handle more than one space between tokens
s/\([^ ]\)\s\+/\1 /g
}
but script isn't working as expected.
Expected output:
/** spa.txt text
date : some date
hih+jjhh jgjg
if ( hjh>=hjhjh )
y **/
# this is a comment
// this is a comment
lines begins here ;
/****** this line is comment ****/
some more lines
// again comment
more lines words
/** again multi line co
mmment it
comment line
follows till here**/
file ends
suggestions?
Thanks

I'd re-engineer the script a bit, to handle # and // comments on their own. With the /* … */ comments, you have to deal with single-line and multi-line variants separately. I'd also use the [[:space:]] notation to spot spaces or tabs. I prefer to avoid backslashes (an aversion caused by working with troff in the days of my youth — if you've never needed 16 backslashes in a row to get the desired effect, you've not suffered enough), so I use \%…% to choose the % character as the search marker instead of / (which means there's no need to escape the slashes in the pattern with a backslash), and I use [*] instead of \*. The { p; d; } notation prints the current line and then deletes it and moves onto the next line. (Using n appends the next line to the current line; it isn't what you need.). The second semicolon isn't required by GNU sed but is by BSD (macOS) sed. The spaces in those braces are optional but make it easier to read.
Putting this together, you might have spaces.sed like this:
#!/bin/sed -f
# Comments with a #
/#/ { p; d; }
# Comments with //
\%//% { p; d; }
# Single line /* ... */ comments
\%/[*].*[*]/% { p; d; }
# Multi-line /* ... */ comments
\%/[*]%,\%[*]/% { p; d; }
s/\([^[:space:]]\)[[:space:]]\{2,\}/\1 /g
On your sample data (thanks for including it!), this produces:
/** spa.txt text
date : some date
hih+jjhh jgjg
if ( hjh>=hjhjh )
y **/
# this is a comment
// this is a comment
lines begins here ;
/****** this line is comment ****/
some more lines
// again comment
more lines words
/** again multi line co
mmment it
comment line
follows till here**/
file ends
That looks like what you wanted.
Limitations
It doesn't remove multiple spaces at the start of a line.
the leading blanks are not removed.
If you have a line with multiple spaces and // or #, the multiple spaces remain:
these spaces // survive
so do # these
If you have multiple single line comments on a single line, you don't get spaces removed in between them:
/* these */ spaces are not /* removed */
If you have a single-line comment and the start of a multi-line comment on a single line, the multi-line comment is not spotted. Similarly, if you have a multi-line comment that ends on a line and has a single-line comment starting after it, then if there are any multiple spaces between the end of the one comment and the start of the next, they are not handled.
/* this */ is not /* handled
very well */ nor are these /* spaces */
This doesn't deal with the subtleties of backslash-newline in the middle of a start or end comment symbol, nor with backslash-newline at the end of a // comment. Only brain-dead programs (or programmers) produce such comments, so it shouldn't be a real problem. Fortunately, you're not writing a compiler; those have to deal with the nonsense. And don't get me started on trigraphs!
It doesn't handle comment-like sequences inside strings (or multi-character character constants):
"/* this is not a comment */"
'/*', ' ', '*/'
However, most of these issues are subtle enough that you're probably OK without dealing with them. If you must deal with them, then you need a program, not a sed script (assuming you value your sanity).

Related

How to print comments in lex?

So the title might be a little bit misleading, but I can't think of any better way to phrase it.
Basically, I'm writing a lexical-scanner using cygwin/lex. A part of the code reads a token /* . It the goes into a predefined state C_COMMENT, and ends when C_COMMENT"/*". Below is the actual code
"/*" {BEGIN(C_COMMENT); printf("%d: /*", linenum++);}
<C_COMMENT>"*/" { BEGIN(INITIAL); printf("*/\n"); }
<C_COMMENT>. {printf("%s",yytext);}
The code works when the comment is in a single line, such as
/* * Example of comment */
It will print the current line number, with the comment behind. But it doesn't work if the comment spans multiple lines. Rewriting the 3rd line into
<C_COMMENT>. {printf("%s",yytext);
printf("\n");}
doesn't work. It will result in \n printed for every letter in the comment. I'm guessing it has something to do with C having no strings or maybe I'm using the states wrong.
Hope someone will be able to help me out :)
Also if there's any other info you need, just ask, and I'll provide.
The easiest way to echo the token scanned by a pattern is to use the special action ECHO:
"/*" { printf("%d: ", linenum++); ECHO; BEGIN(C_COMMENT); }
<C_COMMENT>"*/" { ECHO; BEGIN(INITIAL); }
<C_COMMENT>. { ECHO; }
None of the above rules matches a newline inside a comment, because in (f)lex . doesn't match newlines:
<C_COMMENT>\n { linenum++; ECHO; }
A faster way of recognizing C comments is with a single regular expression, although it's a little hard to read:
[/][*][^*]*[*]+([^/*][^*][*]+)*[/]
In this case, you'll have to rescan the comment to count newlines, unless you get flex to do the line number counting.
flex scanners maintain a line number count in yylineno, if you request that feature (using %option yylineno). It's often more efficient and always more reliable than keeping the count yourself. However, in the action, the value of yylineno is the line number count at the end of the pattern, not at the beginning, which can be misleading for multiline patterns. A common workaround is to save the value of yylineno in another variable at the beginning of the token scan.

Sed function (Bash) - Changing way of commented lines

I have such task to do but I have no idea how to write it with sed function.
I have to change the way on commenting in a file from:
//something6
//something4
//something5
//something3
//something2
to
/*something6
* something4
* something5
* something3
* something2*/
from
//something6
//something4
//something5
//something3
//something2
to
/*something6
something4
something5
something3
something2*/
from
/*something6
* something4
* something5
* something3
* something2*/
to
//something6
//something4
//something5
//something3
//something2
from
/*something6
something4
something5
something3
something2*/
to
//something6
//something4
//something5
//something3
//something2
Those 4 patterns must be made by sed function (I guess but not sure about that).
Tried doing it but without luck. I can replace single words to other ones but how to change the way of commenting? No clue. Would be very gratefull for help and assisstance.
Given that the task is:
Please write a script that allows to change style of comments in source files for example : /* .... */ goes to // .... The style of comment is an argument of the script.
I have tried to use just typical:
sed -i 's/'"$lookingfor"'/'"$changing"'/g' $filename
In this context, either $lookingfor or $changing or both will contain slashes, so that simple formulation doesn't work, as you correctly observe.
The conversion of // comments to /* comments is easy as long as you know that you can choose an arbitrary character to separate the sections of the s/// command, such as %. So, for example, you could use:
sed -i.bak -e 's%// *\(.*\)%/*\1 */%'
This looks for a double-slash followed by zero or more spaces and anything and converts it to /* anything */.
The conversion of /* comments is much harder. There are two cases to be concerned about:
/* A single line comment */
/*
** A multiline comment
*/
That's before you get into:
/* OK */ "/* OK */" /* Really?! */
which is a single line containing two comments and a string containing text that outside a string would look like a comment. This I am studiously ignoring! Or, more accurately, I am studiously deciding that it will be OK when converted to:
// OK */ "/* OK */" /* Really?!
which isn't the same at all, but serves you right for writing convoluted C in the first place.
You can deal with the first case with something like:
sed -e '\%/\*\(.*\)\*/% { s%%//\1%; n; }'
I have the grouping braces and the n command in there so that single line comments don't also match the second case:
-e '\%/\*%,\%\*/% {
\%/\*% { s%/\*\(.*\)%//\1%; n; }
\%\*/% { s%\(.*\)\*/%//\1%; n; }
s%^\( *\)%\1//%
}'
The first line selects a range of lines between one matching /* and the next matching */. The \% tells sed to use the % instead of / as the search delimiter. There are three operations within the outer grouping { … }:
Convert /*anything into //anything and start on the next line.
Convert anything*/ into //anything and start on the next line.
Convert any other line so that it preserves leading blanks but puts // after them.
This is still ridiculously easy to subvert if the comments are maliciously formed. For example:
/* a comment */ int x = 0;
is mapped to:
// a comment int x = 0;
Fixing problems like that, and the example with a string, is something I'd not even start trying in sed. And that's before you get onto the legal but implausible C comments, like:
/\
\
* comment
*\
\
/
/\
/\
noisiness \
commentary \
continued
Which contains just two comments (but does contain two comments!). And before you decide to deal with trigraphs (??/ is a backslash). Etc.
So, a moderate approximation to a C to C++ comment conversion is:
sed -e '\%/\*\(.*\)\*/% { s%%//\1%; n; }' \
-e '\%/\*%,\%\*/% {
\%/\*% { s%/\*\(.*\)%//\1%; n; }
\%\*/% { s%\(.*\)\*/%//\1%; n; }
s%^\( *\)%\1//%
}' \
-i.bak "$#"
I'm assuming you aren't using a C shell; if you are, you need more backslashes at the ends of the lines in the script so that the multi-line single-quoted sed command is treated correctly.

What is a regular expression for finding lines with uncommented Java code?

I'm working on a simple Ruby program that should count of the lines of text in a Java file that contain actual Java code. The line gets counted even if it has comments in it, so basically only lines that are just comments won't get counted.
I was thinking of using a regular expression to approach this problem. My program will just iterate line by line and compare it to a "regexp", like:
while line = file.gets
if line =~ regex
count+=1
end
end
I'm not sure what regexp format to use for that, though. Any ideas?
Getting the count for "Lines of code" can be a little subjective. Should auto-generated stuff like imports and package name really count? A person usually didn't write it. Does a line with just a closing curly brace count? There's not really any executing logic on that line.
I typically use this regex for counting Java lines of code:
^(?![ \s]*\r?\n|import|package|[ \s]*}\r?\n|[ \s]*//|[ \s]*/\*|[ \s]*\*).*\r?\n
This will omit:
Blank lines
Imports
Lines with the package name
Lines with just a }
Lines with single line comments //
Opening multi-line comments ((whitespace)/* whatever)
Continuation of multi-line comments ((whitespace)* whatever)
It will also match against either \n or \r\n newlines (since your source code could contain either depending on your OS).
While not perfect, it seems to come pretty close to matching against all, what I would consider, "legitimate" lines of code.
count = 0
file.each_line do |ln|
# Manage multiline and single line comments.
# Exclude single line if and only if there isn't code on that line
next if ln =~ %r{^\s*(//|/\*[^*]*\*/$|$)} or (ln =~ %r{/\*} .. ln =~ %r{\*/})
count += 1
end
There's only a problem with lines that have a multilines comment but also code, for example:
someCall(); /* Start comment
this a comment
even this
*/ thisShouldBeCounted();
However:
imCounted(); // Comment
meToo(); /* comment */
/* comment */ yesImCounted();
// i'm not
/* Nor
we
are
*/
EDIT
The following version is a bit more cumbersome but correctly count all cases.
count = 0
comment_start = false
file.each_line do |ln|
# Manage multiline and single line comments.
# Exclude single line if and only if there isn't code on that line
next if ln =~ %r{^\s*(//|/\*[^*]*\*/$|$)} or (ln =~ %r{^\s*/\*} .. ln =~ %r{\*/}) or (comment_start and not ln.include? '*/')
count += 1 unless comment_start and ln =~ %r{\*/\s*$}
comment_start = ln.include? '/*'
end

Using regex to find all code identified with 4 spaces

Given a textarea, similar to StackOverflow, I'd like to wrap code (indented by 4 spaces) with a pre/code block. I'm trying to use the following regex to find the code:
re = / # Match a MARKDOWN CODE section.
(\r?\n) # $1: CODE must be preceded by blank line
( # $2: CODE contents
(?: # Group for multiple lines of code.
(?:\r?\n)+ # Each line preceded by a newline,
(?:[ ]{4}|\t).* # and begins with four spaces or tab.
)+ # One or more CODE lines
\r?\n # CODE folowed by blank line.
) # End $2: CODE contents
(?=\r?\n) # CODE folowed by blank line.
/x
result = subject.gsub(re, '\1<pre>\2</pre>')
But this isn't working, here's the example in Rubular:
http://rubular.com/r/l5faSjR8ya
Any suggestions on how to have this Regex, match the code allow me to wrap a pre/code tags around the code? Thanks
I think there is an escape out of the code mode with any trailing newline not followed by tab or 4 spaces. Not sure but successive newlines would not be included in the code block.
I don't get Ruby's regex options too well, but this seems to work: http://rubular.com/r/BlbreoO3sn
((?:^(?:[ ]{4}|\t).*$(?:\r?\n|\z))+) Theorhetically, its in multi-line mode.
Just make the replacement <pre>\1</pre>
EDIT
#Rachela Meadows - After further examination, this is a fairly difficult regex.
I managed to exactly duplicate the functionality of the <pre><code> block features of the online editor here on SO.
After obtaining each block and before wrapping in a <pre><code>, all markup entities should be converted (ie; like < to <, etc). That being said, I didn't do that step in the Ruby code sample below. I do have the regex's to do that though.
A special note about trimming: The main regex below does not include residual trailing newlines. Nor does the SO functionality. So the code block is correct top to bottom.
However, the leading 4 spaces (or tab) that could be contained in the body can't be trimmed (and they should be) in the main regex. For that it needs a callback.
Playing around with the gsub block mode, its easy to trim those leading spaces/tab.
Let me know if you have any problems with this.
Links -
Rubular (for the regex): http://rubular.com/r/pp9oRLQ0xo
Ideone (for the working Ruby code): http://ideone.com/aA9it
Regex compressed -
(^\s*$\n|\A)(^(?:[ ]{4}|\t).*[^\s].*$\n?(?:(?:^\s*$\n?)*^(?:[ ]{4}|\t).*[^\s].*$\n?)*)
Regex expanded -
(^\s*$\n|\A) # Capt grp 1, block is preceeded by a blank line or begin of string
( # Begin "Capture group 2", start of pre/code block
^(?:[ ]{4}|\t) .* [^\s] .* $ \n? # First line of code block (note - lines must contain at least 1 non-whitespace character)
(?: # Start "Optionally, get more lines of code"
(?: ^ \s* $ \n? )* # Many optional blank lines
^(?:[ ]{4}|\t) .* [^\s] .* $ \n? # Another line of code
)* # End "Optionally, get more lines of code", do 0 or more times
) # End "Capture group 2", end of pre/code block
Ruby code -
regex = /(^\s*$\n|\A)(^(?:[ ]{4}|\t).*[^\s].*$\n?(?:(?:^\s*$\n?)*^(?:[ ]{4}|\t).*[^\s].*$\n?)*)/;
data = '
Hello Worldsasdasdffasdfasdf asdf
thisdqweee
asdfasdfasdfasdf
sdfg
#YYYY {
height: 100%;
min-height: 800px;
margin-right: 20px;
position: relative;
}
#ZZZZZZ {
height: 100%;
overflow: hidden;
}';
# ---
result = data.gsub(regex) {
||
x=$2;
## Construct the return value '\1<pre&gt<code&gt\2</code&gt</pre&gt'.
## But, trim each line with 1 to 4 leading spaces (or a tab with regex on the bottom).
## They are not necessary now, they are replaced with a code block.
$1 + '<pre&gt<code&gt' + x.gsub(/^[ ]{1,4}/, '') + '</code&gt</pre&gt'
};
# Note - Tabs can be trimed too, use : x.gsub(/^(?:[ ]{1,4}|\t)/,'') in the above
print result;
If you're looking to match full lines, don't explicitly match for (?:\r?\n)+, rather use ^ and $. Try
(\r?\n)((?:(?:^[ ]{4}|\t).*$)+)(?=\r?\n)
Im think your pattern require two new lines in the beginning to match.
Maybe like this? ((?:(?:[ ]{4}|\t).*(?:\r?\n|$))+)?
$ is used to match if last line is indented and have not new line)
http://rubular.com/r/Vg9HnJpjbw
Ruby:
s = "before\n indent1\n indent2\nmiddle\n indent1\nafter"
p s.gsub(/((?:(?:[ ]{4}|\t).*(?:\r?\n|$))+)/x, '<pre>\1</pre>')
Output:
"before\n<pre> indent1\n indent2\n</pre>middle\n<pre> indent1\n</pre>after"
I think one of your newline captures is redundant. You can use ^ and $ with the s flag turned off to match EOL rather than EOL, this is a better pattern than trying to match newlines.
Try this pattern:
/(?:^(?:[ ]{4}|\t).*$[\n\r]*)+/

How do I fix this multiline regular expression in Ruby?

I have a regular expression in Ruby that isn't working properly in multiline mode.
I'm trying to convert Markdown text into the Textile-eque markup used in Redmine. The problem is in my regular expression for converting code blocks. It should find any lines leading with 4 spaces or a tab, then wrap them in pre tags.
markdownText = '# header
some text that precedes code
var foo = 9;
var fn = function() {}
fn();
some post text'
puts markdownText.gsub!(/(^(?:\s{4}|\t).*?$)+/m,"<pre>\n\\1\n</pre>")
Intended result:
# header
some text that precedes code
<pre>
var foo = 9;
var fn = function() {}
fn();
</pre>
some post text
The problem is that the closing pre tag is printed at the end of the document instead of after "fn();". I tried some variations of the following expression but it doesn't match:
gsub!(/(^(?:\s{4}|\t).*?$)+^(\S)/m, "<pre>\n\\1\n</pre>\\2")
How do I get the regular expression to match just the indented code block? You can test this regular expression on Rubular here.
First, note that 'm' multi-line mode in Ruby is equivalent to 's' single-line mode of other languages. In other words; 'm' mode in Ruby means: "dot matches all".
This regex will do a pretty good job of matching a markdown-like code section:
re = / # Match a MARKDOWN CODE section.
(\r?\n) # $1: CODE must be preceded by blank line
( # $2: CODE contents
(?: # Group for multiple lines of code.
(?:\r?\n)+ # Each line preceded by a newline,
(?:[ ]{4}|\t).* # and begins with four spaces or tab.
)+ # One or more CODE lines
\r?\n # CODE folowed by blank line.
) # End $2: CODE contents
(?=\r?\n) # CODE folowed by blank line.
/x
result = subject.gsub(re, '\1<pre>\2</pre>')
This requires a blank line before and after the code section and allows blank lines within the code section itself. It allows for either \r\n or \n line terminations. Note that this does not strip the leading 4 spaces (or tab) before each line. Doing that will require more code complexity. (I am not a ruby guy so can't help out with that.)
I would recommend looking at the markdown source itself to see how its really being done.
/^(\s{4}|\t)+.+\;\n$/m
works a little better, still picks up a newline that we don't want.
here it is on rubular.
This is working for me with your sample input.
markdownText.gsub(/\n?((\s{4}.+)+)/, "\n<pre>#{$1}\n</pre>")
Here's another one that captures all the indented lines in a single block
((?:^(?: {4}|\t)[^\n]*$\n?)+)

Resources