In Mathematica a comment starts with (* and ends with *) and comments can be nested. My current approach of scanning a comment with JFlex contains the following code
%xstate IN_COMMENT
"(*" { yypushstate(IN_COMMENT); return MathematicaElementTypes.COMMENT;}
<IN_COMMENT> {
"(*" {yypushstate(IN_COMMENT); return MathematicaElementTypes.COMMENT;}
[^\*\)\(]* {return MathematicaElementTypes.COMMENT;}
"*)" {yypopstate(); return MathematicaElementTypes.COMMENT;}
[\*\)\(] {return MathematicaElementTypes.COMMENT;}
. {return MathematicaElementTypes.BAD_CHARACTER;}
}
where the methods yypushstate and yypopstate are defined as
private final LinkedList<Integer> states = new LinkedList();
private void yypushstate(int state) {
states.addFirst(yystate());
yybegin(state);
}
private void yypopstate() {
final int state = states.removeFirst();
yybegin(state);
}
to give me the opportunity to track how many nested levels of comment I'm dealing with.
Unfortunately, this results in several COMMENT tokens for one comment, because I have to match nested comment starts and comment ends.
Question: Is it possible with JFlex to use its API with methods like yypushback or advance() etc. to return exactly one token over the whole comment range, even if comments are nested?
It seems the bounty was uncalled for as the solution is so simple that I just did not consider it. Let me explain. When scanning a simple nested comment
(* (*..*) *)
I have to track, how many opening comment tokens I see so that I finally, on the last real closing comment can return the whole comment as one token.
What I did not realise was, that JFlex does not need to be told to advance to the next portion when it matches something. After careful review I saw that this is explained here but somewhat hidden in a section I didn't care for:
Because we do not yet return a value to the parser, our scanner proceeds immediately.
Therefore, a rule in flex file like this
[^\(\*\)]+ { }
reads all characters except those that could probably be a comment start/end and does nothing but it advances to the next token.
This means that I can simply do the following. In the YYINITIAL state, I have a rule that matches a beginning comment but it does nothing else then switch the lexer to the IN_COMMENT state. In particular, it does not return any token:
{CommentStart} { yypushstate(IN_COMMENT);}
Now, we are in the IN_COMMENT state and there, I do the same. I eat up all characters but never return a token. When I hit a new opening comment, I carefully push it onto a stack but do nothing. Only, when I hit the last closing comment, I know I'm leaving the IN_COMMENT state and this is the only point, where I, finally, return the token. Let's look at the rules:
<IN_COMMENT> {
{CommentStart} { yypushstate(IN_COMMENT);}
[^\(\*\)]+ { }
{CommentEnd} { yypopstate();
if(yystate() != IN_COMMENT)
return MathematicaElementTypes.COMMENT_CONTENT;
}
[\*\)\(] { }
. { return MathematicaElementTypes.BAD_CHARACTER; }
}
That's it. Now, no matter how deep your comment is nested, you will always get one single token that contains the entire comment.
Now, I'm embarrassed and I'm sorry for such a simple question.
Final note
If you are doing something like this, you have to remember that you only return a token from when you hit the correct closing "character". Therefore, you definitely should make a rule that catches the end of file. In IDEA that default behavior is to just return the comment token, so you need another line (useful or not, I want to end gracefully):
<<EOF>> { yyclearstack(); yybegin(YYINITIAL);
return MathematicaElementTypes.COMMENT;}
When I wrote the answer first I had even not realized that one of the existing answers was of the questioner itself. On the other hand, I seldom find a bounty in the rather small SO lex community. Thus, this seemed to me worth to learn enough Java and jflex to write a sample:
/* JFlex scanner: to recognize nested comments in Mathematica style
*/
%%
%{
/* counter for open (nested) comments */
int open = 0;
%}
%state IN_COMMENT
%%
/* any state */
"(*" { if (!open++) yybegin(IN_COMMENT); }
"*)" {
if (open) {
if (!--open) {
yybegin(YYINITIAL);
return MathematicaElementTypes.COMMENT;
}
} else {
/* or return MathematicaElementTypes.BAD_CHARACTER;
/* or: throw new Error("'*)' without '(*'!"); */
}
}
<IN_COMMENT> {
. |
\n { }
}
<<EOF>> {
if (open) {
/* This is obsolete if the scanner is instanced new for
* each invocation.
*/
open = 0; yybegin(IN_COMMENT);
/* Notify about syntax error, e.g. */
throw new Error("Premature end of file! ("
+ open + " open comments not closed.)");
}
return MathematicaElementTypes.EOF; /* just a guess */
}
There might be typos and stupid errors although I tried to be carefully and did my best.
As a "proof of concept" I leave my original implementation here which is done with flex and C/C++.
This scanner
handles comment (with printf())
echoes everything else.
My solution is essentially based on the fact that flex rules may end with break or return. Therefore, the token is simply not returned until the rule for the pattern is matched closing the outmost comment. Contents in comments is simply "recorded" in a buffer – in my case a std::string.
(AFAIK, string is even a built-in type in Java. Therefore, I decided to mix C and C++ which I usally wouldn't.)
My source scan-nested-comments.l:
%{
#include <cstdio>
#include <string>
// counter for open (nested) comments
static int open = 0;
// buffer for collected comments
static std::string comment;
%}
/* make never interactive (prevent usage of certain C functions) */
%option never-interactive
/* force lexer to process 8 bit ASCIIs (unsigned characters) */
%option 8bit
/* prevent usage of yywrap */
%option noyywrap
%s IN_COMMENT
%%
"(*" {
if (!open++) BEGIN(IN_COMMENT);
comment += "(*";
}
"*)" {
if (open) {
comment += "*)";
if (!--open) {
BEGIN(INITIAL);
printf("EMIT TOKEN COMMENT(lexem: '%s')\n", comment.c_str());
comment.clear();
}
} else {
printf("ERROR: '*)' without '(*'!\n");
}
}
<IN_COMMENT>{
. |
"\n" { comment += *yytext; }
}
<<EOF>> {
if (open) {
printf("ERROR: Premature end of file!\n"
"(%d open comments not closed.)\n", open);
return 1;
}
return 0;
}
%%
int main(int argc, char **argv)
{
if (argc > 1) {
yyin = fopen(argv[1], "r");
if (!yyin) {
printf("Cannot open file '%s'!\n", argv[1]);
return 1;
}
} else yyin = stdin;
return yylex();
}
I compiled it with flex and g++ in cygwin on Windows 10 (64 bit):
$ flex -oscan-nested-comments.cc scan-nested-comments.l ; g++ -o scan-nested-comments scan-nested-comments.cc
scan-nested-comments.cc:398:0: warning: "yywrap" redefined
^
scan-nested-comments.cc:74:0: note: this is the location of the previous definition
^
$
The warning appears due to %option noyywrap. I guess it does not mean any harm and can be ignored.
Now, I made some tests:
$ cat >good-text.txt <<EOF
> Test for nested comments.
> (* a comment *)
> (* a (* nested *) comment *)
> No comment.
> (* a
> (* nested
> (* multiline *)
> *)
> comment *)
> End of file.
> EOF
$ cat good-text | ./scan-nested-comments
Test for nested comments.
EMIT TOKEN COMMENT(lexem: '(* a comment *)')
EMIT TOKEN COMMENT(lexem: '(* a (* nested *) comment *)')
No comment.
EMIT TOKEN COMMENT(lexem: '(* a
(* nested
(* multiline *)
*)
comment *)')
End of file.
$ cat >bad-text-1.txt <<EOF
> Test for wrong comment.
> (* a comment *)
> with wrong nesting *)
> End of file.
> EOF
$ cat >bad-text-1.txt | ./scan-nested-comments
Test for wrong comment.
EMIT TOKEN COMMENT(lexem: '(* a comment *)')
with wrong nesting ERROR: '*)' without '(*'!
End of file.
$ cat >bad-text-2.txt <<EOF
> Test for wrong comment.
> (* a comment
> which is not closed.
> End of file.
> EOF
$ cat >bad-text-2.txt | ./scan-nested-comments
Test for wrong comment.
ERROR: Premature end of file!
(1 open comments not closed.)
$
The Java traditional comment is defined in the sample grammar with
TraditionalComment = "/*" [^*] ~"*/" | "/*" "*"+ "/"
I suppose this expression should work for Mathematica comments too.
Related
I'm writing a little Domain Specific Language for my program, using JUCE::JavascriptEngine as the scripting engine. This takes a string as input and then parses it, but I need to do some pre-processing on the string to adapt it from my DSL to JavaScript. The pre-processing mainly consists of wrapping some terms inside functions, and placing object names in front of functions. So, for instance, I want to do something like this:
take some special string input "~/1/2"...
wrap it inside a function: "find("~/1/2")"...
and then attach an object to it: "someObject.find("~/1/2")" (the object name has to be a variable).
I've been using regex for this (now I have two problems...). The regexes are getting complicated and unreadable, and it's missing a lot of special cases. Since what I'm doing is grammatical, I thought I'd upgrade from regex to a proper parser (now I have three problems...). After quite a lot of research, I chose Boost.Spirit. I've been going through the documentation, but it's not taking me in the right direction. Can someone suggest how I might use this library to manipulate strings in the way I am looking for? Given that I am only trying to manipulate a string and am not interested in storing the parsed data, do I need to use karma for the output, or can I output the string with qi or x3, during the parsing process?
If I'm headed down the wrong path here, please feel free to re-direct me.
This seems too broad to answer.
What you're doing is parsing input, and transforming it to something else. What you're not doing is find/replace (otherwise you'd be fine using regular expressions).
Of course you can do what regular expressions do, but I'm not sure it buys you anything:
template <typename It, typename Out>
Out preprocess(It f, It l, Out out) {
namespace qi = boost::spirit::qi;
using boost::spirit::repository::qi::seek;
auto passthrough = [&out](boost::iterator_range<It> ignored, auto&&...) {
for (auto ch : ignored) *out++ = ch;
};
auto transform = [&out](std::string const& literal, auto&&...) {
for (auto ch : "someObject.find(\"~"s) *out++ = ch;
for (auto ch : literal) *out++ = ch;
for (auto ch : "\")"s) *out++ = ch;
};
auto pattern = qi::copy("\"~" >> (*~qi::char_('"')) >> '"');
qi::rule<It> ignore = qi::raw[+(!pattern >> qi::char_)] [passthrough];
qi::parse(f, l, -qi::as_string[pattern][transform] % ignore);
return out;
}
The nice thing about this way of writing it, is that it will work with any source iterator:
for (std::string const input : {
R"(function foo(a, b) { var path = "~/1/2"; })",
})
{
std::cout << "Input: " << input << "\n";
std::string result;
preprocess(begin(input), end(input), back_inserter(result));
std::cout << "Result: " << result << "\n";
}
std::cout << "\n -- Or directly transformed stdin to stdout:\n";
preprocess(
boost::spirit::istream_iterator(std::cin >> std::noskipws), {},
std::ostreambuf_iterator<char>(std::cout));
See it Live On Coliru, printing the output:
Input: function foo(a, b) { var path = "~/1/2"; }
Result: function foo(a, b) { var path = someObject.find("~/1/2"); }
-- Or directly transformed stdin to stdout:
function bar(c, d) { var path = someObject.find("~/1/42"); }
But this is very limited since it will not even do the right thing if such things are parts of comments or multiline strings etc.
So instead you probably want a dedicated library that knows how to parse javascript and use it to do your transformation, such as (one of the first hits when googling tooling library preprocess javascript transform): https://clojurescript.org/reference/javascript-library-preprocessing
I have several text files (utf-8) that I want to process in shell script. They aren't excactly the same format, but if I could only break them up into edible chunks I can handle that.
This could be programmed in C or python, but I prefer not.
EDIT: I wrote a solution in C; see my own answer. I think this may be the simplest approach after all. If you think I'm wrong please test your solution against the more complicated example input from my answer below.
-- jcxz100
For clarity (and to be able to debug more easily) I want the chunks to be saved as separate text files in a sub-folder.
All types of input files consist of:
junk lines
lines with junk text followed by start brackets or parentheses - i.e. '[' '{' '<' or '(' - and possibly followed by payload
payload lines
lines with brackets or parentheses nested within the top-level pairs; treated as payload too
payload lines with end brackets or parantheses - i.e. ']' '}' '>' or ')' - possibly followed by something (junk text and/or start of a new payload)
I want to break up the input according to only the matching pairs of top-level brackets/parantheses.
Payload inside these pairs must not be altered (including newlines and whitespace).
Everything outside the toplevel pairs should be discarded as junk.
Any junk or payload inside double-quotes must be considered atomic (handled as raw text, thus any brackets or parentheses inside should also be treated as text).
Here is an example (using only {} pairs):
junk text
"atomic junk"
some junk text followed by a start bracket { here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
} trailing junk
intermittent junk
{
payload that goes in second output file }
end junk
...sorry: Some of the input files really are as messy as that.
The first output file should be:
{ here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
}
... and the second output file:
{
payload that goes in second output file }
Note:
I haven't quite decided whether it's necesary to keep the pair of start/end characters in the output or if they themselves should be discarded as junk.
I think a solution that keeps them in is more general use.
There can be a mix of types of top-level bracket/paranthesis pairs in the same input file.
Beware: There are * and $ characters in the input files, so please avoid confusing bash ;-)
I prefer readability over brevity; but not at an exponential cost of speed.
Nice-to-haves:
There are backslash-escaped double-quotes inside the text; preferably they should be handled
(I have a hack, but it's not pretty).
The script oughtn't break over mismatched pairs of brackets/parentheses in junk and/or payload (note: inside the atomics they must be allowed!)
More-far-out-nice-to-haves:
I haven't seen it yet, but one could speculate that some input might have single-quotes rather than double-quotes to denote atomic content... or even a mix of both.
It would be nice if the script could be easily modified to parse input of similar structure but with different start/end characters or strings.
I can see this is quite a mouthful, but I think it wouldn't give a robust solution if I broke it down into simpler questions.
The main problem is splitting up the input correctly - everything else can be ignored or "solved" with hacks, so
feel free to ignore the nice-to-haves and the more-far-out-nice-to-haves.
Given:
$ cat file
junk text
"atomic junk"
some junk text followed by a start bracket { here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
} trailing junk
intermittent junk
{
payload that goes in second output file }
end junk
This perl file will extract the blocks you describe into files block_1, block_2, etc:
#!/usr/bin/perl
use v5.10;
use warnings;
use strict;
use Text::Balanced qw(extract_multiple extract_bracketed);
my $txt;
while (<>){$txt.=$_;} # slurp the file
my #blocks = extract_multiple(
$txt,
[
# Extract {...}
sub { extract_bracketed($_[0], '{}') },
],
# Return all the fields
undef,
# Throw out anything which does not match
1
);
chdir "/tmp";
my $base="block_";
my $cnt=1;
for my $block (#blocks){ my $fn="$base$cnt";
say "writing $fn";
open (my $fh, '>', $fn) or die "Could not open file '$fn' $!";
print $fh "$block\n";
close $fh;
$cnt++;}
Now the files:
$ cat block_1
{ here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
}
$ cat block_2
{
payload that goes in second output file }
Using Text::Balanced is robust and likely the best solution.
You can do the blocks with a single Perl regex:
$ perl -0777 -nlE 'while (/(\{(?:(?1)|[^{}]*+)++\})|[^{}\s]++/g) {if ($1) {$cnt++; say "block $cnt:== start:\n$1\n== end";}}' file
block 1:== start:
{ here is the actual payload
more payload
"atomic payload"
nested start bracket { - all of this line is untouchable payload too
here is more payload
"yet more atomic payload; this one's got a smiley ;-)"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
}
== end
block 2:== start:
{
payload that goes in second output file }
== end
But that is a little more fragile than using a proper parser like Text::Balanced...
I have a solution in C. It would seem there's too much complexity for this to be easily achieved in shell script.
The program isn't overly complicated but nevertheless has more than 200 lines of code, which include error checking, some speed optimization, and other niceties.
Source file split-brackets-to-chunks.c:
#include <stdio.h>
/* Example code by jcxz100 - your problem if you use it! */
#define BUFF_IN_MAX 255
#define BUFF_IN_SIZE (BUFF_IN_MAX+1)
#define OUT_NAME_MAX 31
#define OUT_NAME_SIZE (OUT_NAME_MAX+1)
#define NO_CHAR '\0'
int main()
{
char pcBuff[BUFF_IN_SIZE];
size_t iReadActual;
FILE *pFileIn, *pFileOut;
int iNumberOfOutputFiles;
char pszOutName[OUT_NAME_SIZE];
char cLiteralChar, cAtomicChar, cChunkStartChar, cChunkEndChar;
int iChunkNesting;
char *pcOutputStart;
size_t iOutputLen;
pcBuff[BUFF_IN_MAX] = '\0'; /* ... just to be sure. */
iReadActual = 0;
pFileIn = pFileOut = NULL;
iNumberOfOutputFiles = 0;
pszOutName[OUT_NAME_MAX] = '\0'; /* ... just to be sure. */
cLiteralChar = cAtomicChar = cChunkStartChar = cChunkEndChar = NO_CHAR;
iChunkNesting = 0;
pcOutputStart = (char*)pcBuff;
iOutputLen = 0;
if ((pFileIn = fopen("input-utf-8.txt", "r")) == NULL)
{
printf("What? Where?\n");
return 1;
}
while ((iReadActual = fread(pcBuff, sizeof(char), BUFF_IN_MAX, pFileIn)) > 0)
{
char *pcPivot, *pcStop;
pcBuff[iReadActual] = '\0'; /* ... just to be sure. */
pcPivot = (char*)pcBuff;
pcStop = (char*)pcBuff + iReadActual;
while (pcPivot < pcStop)
{
if (cLiteralChar != NO_CHAR) /* Ignore this char? */
{
/* Yes, ignore this char. */
if (cChunkStartChar != NO_CHAR)
{
/* ... just write it out: */
fprintf(pFileOut, "%c", *pcPivot);
}
pcPivot++;
cLiteralChar = NO_CHAR;
/* End of "Yes, ignore this char." */
}
else if (cAtomicChar != NO_CHAR) /* Are we inside an atomic string? */
{
/* Yup; we are inside an atomic string. */
int bBreakInnerWhile;
bBreakInnerWhile = 0;
pcOutputStart = pcPivot;
while (bBreakInnerWhile == 0)
{
if (*pcPivot == '\\') /* Treat next char as literal? */
{
cLiteralChar = '\\'; /* Yes. */
bBreakInnerWhile = 1;
}
else if (*pcPivot == cAtomicChar) /* End of atomic? */
{
cAtomicChar = NO_CHAR; /* Yes. */
bBreakInnerWhile = 1;
}
if (++pcPivot == pcStop) bBreakInnerWhile = 1;
}
if (cChunkStartChar != NO_CHAR)
{
/* The atomic string is part of a chunk. */
iOutputLen = (size_t)(pcPivot-pcOutputStart);
fprintf(pFileOut, "%.*s", iOutputLen, pcOutputStart);
}
/* End of "Yup; we are inside an atomic string." */
}
else if (cChunkStartChar == NO_CHAR) /* Are we inside a chunk? */
{
/* No, we are outside a chunk. */
int bBreakInnerWhile;
bBreakInnerWhile = 0;
while (bBreakInnerWhile == 0)
{
/* Detect start of anything interesting: */
switch (*pcPivot)
{
/* Start of atomic? */
case '"':
case '\'':
cAtomicChar = *pcPivot;
bBreakInnerWhile = 1;
break;
/* Start of chunk? */
case '{':
cChunkStartChar = *pcPivot;
cChunkEndChar = '}';
break;
case '[':
cChunkStartChar = *pcPivot;
cChunkEndChar = ']';
break;
case '(':
cChunkStartChar = *pcPivot;
cChunkEndChar = ')';
break;
case '<':
cChunkStartChar = *pcPivot;
cChunkEndChar = '>';
break;
}
if (cChunkStartChar != NO_CHAR)
{
iNumberOfOutputFiles++;
printf("Start '%c' '%c' chunk (file %04d.txt)\n", *pcPivot, cChunkEndChar, iNumberOfOutputFiles);
sprintf((char*)pszOutName, "output/%04d.txt", iNumberOfOutputFiles);
if ((pFileOut = fopen(pszOutName, "w")) == NULL)
{
printf("What? How?\n");
fclose(pFileIn);
return 2;
}
bBreakInnerWhile = 1;
}
else if (++pcPivot == pcStop)
{
bBreakInnerWhile = 1;
}
}
/* End of "No, we are outside a chunk." */
}
else
{
/* Yes, we are inside a chunk. */
int bBreakInnerWhile;
bBreakInnerWhile = 0;
pcOutputStart = pcPivot;
while (bBreakInnerWhile == 0)
{
if (*pcPivot == cChunkStartChar)
{
/* Increase level of brackets/parantheses: */
iChunkNesting++;
}
else if (*pcPivot == cChunkEndChar)
{
/* Decrease level of brackets/parantheses: */
iChunkNesting--;
if (iChunkNesting == 0)
{
/* We are now outside chunk. */
bBreakInnerWhile = 1;
}
}
else
{
/* Detect atomic start: */
switch (*pcPivot)
{
case '"':
case '\'':
cAtomicChar = *pcPivot;
bBreakInnerWhile = 1;
break;
}
}
if (++pcPivot == pcStop) bBreakInnerWhile = 1;
}
iOutputLen = (size_t)(pcPivot-pcOutputStart);
fprintf(pFileOut, "%.*s", iOutputLen, pcOutputStart);
if (iChunkNesting == 0)
{
printf("File done.\n");
cChunkStartChar = cChunkEndChar = NO_CHAR;
fclose(pFileOut);
pFileOut = NULL;
}
/* End of "Yes, we are inside a chunk." */
}
}
}
if (cChunkStartChar != NO_CHAR)
{
printf("Chunk exceeds end-of-file. Exiting gracefully.\n");
fclose(pFileOut);
pFileOut = NULL;
}
if (iNumberOfOutputFiles == 0) printf("Nothing to do...\n");
else printf("All done.\n");
fclose(pFileIn);
return 0;
}
I've solved the nice-to-haves and one of the more-far-out-nice-to-haves.
To show this the input is a little more complex than the example in the question:
junk text
"atomic junk"
some junk text followed by a start bracket { here is the actual payload
more payload
'atomic payload { with start bracket that should be ignored'
nested start bracket { - all of this line is untouchable payload too
here is more payload
"this atomic has a literal double-quote \" inside"
"yet more atomic payload; this one's got a smiley ;-) and a heart <3"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
"here's a totally unprovoked $ sign and an * asterisk"
} trailing junk
intermittent junk
<
payload that goes in second output file } mismatched end bracket should be ignored >
end junk
Resulting file output/0001.txt:
{ here is the actual payload
more payload
'atomic payload { with start bracket that should be ignored'
nested start bracket { - all of this line is untouchable payload too
here is more payload
"this atomic has a literal double-quote \" inside"
"yet more atomic payload; this one's got a smiley ;-) and a heart <3"
end of nested bracket pair } - all of this line is untouchable payload too
this is payload too
"here's a totally unprovoked $ sign and an * asterisk"
}
... and resulting file output/0002.txt:
<
payload that goes in second output file } mismatched end bracket should be ignored >
Thanks #dawg for your help :)
I would like to write my output to a file if a file name is avaliable or on the screen (stdout) otherwise. So I've read posts on this forum and found a code, which below I wrapped into a method:
std::shared_ptr<std::ostream> out_stream(const std::string & fname) {
std::streambuf * buf;
std::ofstream of;
if (fname.length() > 0) {
of.open(fname);
buf = of.rdbuf();
} else
buf = std::cout.rdbuf();
std::shared_ptr<std::ostream> p(new std::ostream(buf));
return p;
}
The code works perfectly when used in-place. Unfortunately it behaves oddly when wrapped into a separate method (as given above). Is it because the the objects defined within the method (of, buff) are destroyed once the call is finished?
I am using this part of code in several places and it really should be extracted as a separate non-repeating fragment: a method or a class. How can I achieve this?
You're correct that the problems you're having come from the destruction of of. Wouldn't something like this (untested) work?
std::shared_ptr<std::ostream>
out_stream(const std::string &fname) {
if (fname.length() > 0)
std::shared_ptr<std::ostream> p(new std::ofstream(fname));
else
std::shared_ptr<std::ostream> p(new std::ostream(std::cout.rdbuf()));
}
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
My group is having some discussion and strong feelings about for loop construction.
I have favored loops like:
size_t x;
for (x = 0; x < LIMIT; ++x) {
if (something) {
break;
}
...
}
// If we found what we're looking for, process it.
if (x < LIMIT) {
...
}
But others seem to prefer a Boolean flag like:
size_t x;
bool found = false;
for (x = 0; x < LIMIT && !found; ++x) {
if (something) {
found = true;
}
else {
...
}
}
// If we found what we're looking for, process it.
if (found) {
...
}
(And, where the language allows, using "for (int x = 0; ...".)
The first style has one less variable to keep track of and a simpler loop header. Albeit at the cost of "overloading" the loop control variable and (some would complain), the use of break.
The second style has clearly defined roles for the variables but a more complex loop condition and loop body (either an else, or a continue after found is set, or a "if (!found)" in the balance of the loop).
I think that the first style wins on code complexity. I'm looking for opinions from a broader audience. Pointers to actual research on which is easier to read and maintain would be even better. "It doesn't matter, take it out of your standard" is a fine answer, too.
OTOH, this may be the wrong question. I'm beginning to think that the right rule is "if you have to break out of a for, it's really a while."
bool found = false;
x = 0;
while (!found && x < LIMIT) {
if (something) {
found = true;
...handle the thing...
}
else {
...
}
++x;
}
Does what the first two examples do but in fewer lines. It does divide the initialization, test, and increment of x across three lines, though.
I'd actually dare to suggest consideration of GOTO to break out of loops in such cases:
for (size_t x = 0; x < LIMIT && !found; ++x) {
if (something)
goto found;
else {
...
}
}
// not found
...
return;
found:
...
return;
I consider this form to be both succint and readable. It may do some good in many simple cases (say, when there is no common processing in this function, in both found/unfound cases).
And about the general frowning goto receives, I find it to be a common misinterpretation of Dijkstra's original claims: his arguments favoured structured loop clauses, as for or while, over a primitive loop-via-goto, that still had a lot of presence circa 1968. Even the almighty Knuth eventualy says -
The new morality that I propose may
perhaps be stated thus: "Certain go to
statements which arise in connection with
well-understood transformations are acceptable, provided that the program documentation explains what the transformation was."
Others here occasionaly think the same.
While I disagree that an extra else really makes the 2nd more complicated, I think it's primarily a matter of aesthetics and keeping to your standard.
Personally, I have a probably irrational dislike of breaks and continues, so I'm MUCH more likely to use the found variable.
Also, note that you CAN add the found variable to the 1st implementation and do
if(something)
{
found = true;
break;
}
if you want to avoid the variable overloading problem at the expense of the extra variable, but still want the simple loop terminator...
The former example duplicates the x < LIMIT condition, whereas the latter doesn't.
With the former, if you want to change that condition, you have to remember to do it in two places.
I would prefer a different one altogether:
for (int x = 0; x < LIMIT; ++x) {
if (something) {
// If we found what we're looking for, process it.
...
break;
}
...
}
It seems you have not any trouble you mention about one or the other... ;-)
no duplication of condition, or readability problem
no additional variable
I don't have any references to hand (-1! -1!), but I seem to recall that having multiple exit points (from a function, from a loop) has been shown to cause issues with maintainability (I used to know someone who wrote code for the UK military and it was Verboten to do so). But more importantly, as RichieHindle points out, having a duplicate condition is a Bad Thing, it cries out for introducing bugs by changing one and not the other.
If you weren't using the condition later, I wouldn't be bothered either way. Since you are, the second is the way to go.
This sort of argument has been fought out here before (probably many times) such as in this question.
There are those that will argue that purity of code is all-important and they'll complain bitterly that your first option doesn't have identical post-conditions for all cases.
What I would answer is "Twaddle!". I'm a pragmatist, not a purist. I'm as against too much spaghetti code as much as the next engineer but some of the hideous terminating conditions I've seen in for loops are far worse than using a couple of breaks within your loop.
I will always go for readability of code over "purity" simply because I have to maintain it.
This looks like a place for a while loop. For loops are Syntactic Sugar on top of a While loop anyway. The general rule is that if you have to break out of a For loop, then use a While loop instead.
package com.company;
import java.io.*;
import java.util.Scanner;
public class Main {
// "line.separator" is a system property that is a platform independent and it is one way
// of getting a newline from your environment.
private static String NEWLINE = System.getProperty("line.separator");
public static void main(String[] args) {
// write your code here
boolean itsdone = false;
String userInputFileName;
String FirstName = null;
String LastName = null;
String user_junk;
String userOutputFileName;
String outString;
int Age = -1;
int rint = 0;
int myMAX = 100;
int MyArr2[] = new int[myMAX];
int itemCount = 0;
double average = 0;
double total = 0;
boolean ageDone = false;
Scanner inScan = new Scanner(System.in);
System.out.println("Enter First Name");
FirstName = inScan.next();
System.out.println("Enter Last Name");
LastName = inScan.next();
ageDone = false;
while (!ageDone) {
System.out.println("Enter Your Age");
if (inScan.hasNextInt()) {
Age = inScan.nextInt();
System.out.println(FirstName + " " + LastName + " " + "is " + Age + " Years old");
ageDone = true;
} else {
System.out.println("Your Age Needs to Have an Integer Value... Enter an Integer Value");
user_junk = inScan.next();
ageDone = false;
}
}
try {
File outputFile = new File("firstOutFile.txt");
if (outputFile.createNewFile()){
System.out.println("firstOutFile.txt was created"); // if file was created
}
else {
System.out.println("firstOutFile.txt existed and is being overwritten."); // if file had already existed
}
// --------------------------------
// If the file creation of access permissions to write into it
// are incorrect the program throws an exception
//
if ((outputFile.isFile()|| outputFile.canWrite())){
BufferedWriter fileOut = new BufferedWriter(new FileWriter(outputFile));
fileOut.write("==================================================================");
fileOut.write(NEWLINE + NEWLINE +" You Information is..." + NEWLINE + NEWLINE);
fileOut.write(NEWLINE + FirstName + " " + LastName + " " + Age + NEWLINE);
fileOut.write("==================================================================");
fileOut.close();
}
else {
throw new IOException();
}
} // end of try
catch (IOException e) { // in case for some reason the output file could not be created
System.err.format("IOException: %s%n", e);
e.printStackTrace();
}
} // end main method
}
The most egregiously redundant code construct I often see involves using the code sequence
if (condition)
return true;
else
return false;
instead of simply writing
return (condition);
I've seen this beginner error in all sorts of languages: from Pascal and C to PHP and Java. What other such constructs would you flag in a code review?
if (foo == true)
{
do stuff
}
I keep telling the developer that does that that it should be
if ((foo == true) == true)
{
do stuff
}
but he hasn't gotten the hint yet.
if (condition == true)
{
...
}
instead of
if (condition)
{
...
}
Edit:
or even worse and turning around the conditional test:
if (condition == false)
{
...
}
which is easily read as
if (condition) then ...
Using comments instead of source control:
-Commenting out or renaming functions instead of deleting them and trusting that source control can get them back for you if needed.
-Adding comments like "RWF Change" instead of just making the change and letting source control assign the blame.
Somewhere I’ve spotted this thing, which I find to be the pinnacle of boolean redundancy:
return (test == 1)? ((test == 0) ? 0 : 1) : ((test == 0) ? 0 : 1);
:-)
Redundant code is not in itself an error. But if you're really trying to save every character
return (condition);
is redundant too. You can write:
return condition;
Declaring separately from assignment in languages other than C:
int foo;
foo = GetFoo();
Returning uselessly at the end:
// stuff
return;
}
I once had a guy who repeatedly did this:
bool a;
bool b;
...
if (a == true)
b = true;
else
b = false;
void myfunction() {
if(condition) {
// Do some stuff
if(othercond) {
// Do more stuff
}
}
}
instead of
void myfunction() {
if(!condition)
return;
// Do some stuff
if(!othercond)
return;
// Do more stuff
}
Using .tostring on a string
Putting an exit statement as first statement in a function to disable the execution of that function, instead of one of the following options:
Completely removing the function
Commenting the function body
Keeping the function but deleting all the code
Using the exit as first statement makes it very hard to spot, you can easily read over it.
Fear of null (this also can lead to serious problems):
if (name != null)
person.Name = name;
Redundant if's (not using else):
if (!IsPostback)
{
// do something
}
if (IsPostback)
{
// do something else
}
Redundant checks (Split never returns null):
string[] words = sentence.Split(' ');
if (words != null)
More on checks (the second check is redundant if you are going to loop)
if (myArray != null && myArray.Length > 0)
foreach (string s in myArray)
And my favorite for ASP.NET: Scattered DataBinds all over the code in order to make the page render.
Copy paste redundancy:
if (x > 0)
{
// a lot of code to calculate z
y = x + z;
}
else
{
// a lot of code to calculate z
y = x - z;
}
instead of
if (x > 0)
y = x + CalcZ(x);
else
y = x - CalcZ(x);
or even better (or more obfuscated)
y = x + (x > 0 ? 1 : -1) * CalcZ(x)
Allocating elements on the heap instead of the stack.
{
char buff = malloc(1024);
/* ... */
free(buff);
}
instead of
{
char buff[1024];
/* ... */
}
or
{
struct foo *x = (struct foo *)malloc(sizeof(struct foo));
x->a = ...;
bar(x);
free(x);
}
instead of
{
struct foo x;
x.a = ...;
bar(&x);
}
The most common redundant code construct I see is code that is never called from anywhere in the program.
The other is design patterns used where there is no point in using them. For example, writing "new BobFactory().createBob()" everywhere, instead of just writing "new Bob()".
Deleting unused and unnecessary code can massively improve the quality of the system and the team's ability to maintain it. The benefits are often startling to teams who have never considered deleting unnecessary code from their system. I once performed a code review by sitting with a team and deleting over half the code in their project without changing the functionality of their system. I thought they'd be offended but they frequently asked me back for design advice and feedback after that.
I often run into the following:
function foo() {
if ( something ) {
return;
} else {
do_something();
}
}
But it doesn't help telling them that the else is useless here. It has to be either
function foo() {
if ( something ) {
return;
}
do_something();
}
or - depending on the length of checks that are done before do_something():
function foo() {
if ( !something ) {
do_something();
}
}
From nightmarish code reviews.....
char s[100];
followed by
memset(s,0,100);
followed by
s[strlen(s)] = 0;
with lots of nasty
if (strcmp(s, "1") == 0)
littered about the code.
Using an array when you want set behavior. You need to check everything to make sure its not in the array before you insert it, which makes your code longer and slower.
Redundant .ToString() invocations:
const int foo = 5;
Console.WriteLine("Number of Items: " + foo.ToString());
Unnecessary string formatting:
const int foo = 5;
Console.WriteLine("Number of Items: {0}", foo);