Huge number of calls to ANTLR38BitConsume library function - antlr3

I'm writing a lexer for ASN1 using ANTLR3 for C target, using the version 3.4 of the library. Actually the lexer is really slow, so I performed a perf record on the execution, and I found that the bottleneck is the library function ANTLR38BitConsume, which is pretty simple.
static void
antlr38BitConsume(pANTLR3_INT_STREAM is)
{
pANTLR3_INPUT_STREAM input;
input = ((pANTLR3_INPUT_STREAM) (is->super));
if ((pANTLR3_UINT8)(input->nextChar) < (((pANTLR3_UINT8)input->data) + input->sizeBuf))
{
/* Indicate one more character in this line
*/
input->charPositionInLine++;
if ((ANTLR3_UCHAR)(*((pANTLR3_UINT8)input->nextChar)) == input->newlineChar)
{
/* Reset for start of a new line of input
*/
input->line++;
input->charPositionInLine = 0;
input->currentLine = (void *)(((pANTLR3_UINT8)input->nextChar) + 1);
}
/* Increment to next character position
*/
input->nextChar = (void *)(((pANTLR3_UINT8)input->nextChar) + 1);
}
}
After an investigation, I have figured out that on a file with size 1.4KB, this function, which should be called at most one per bytes I think, it's called 24.5M times. Do you know if this is a known issue or if there is another explanation for this weird behaviour?
EDITED
After some trials, I have figured out the rules which cause this issue. I have some rules in order to recognize specific object identifiers, which are very simple:
OID1 : {counter==6}?=> 2 3 4 5 840 {counter=0;}
and a rule which matches every byte in the content field of ASN1 encoding:
VALUE : ({counter>0}?=> '\u0000'..'\u00FF' {counter--;})+
Since the lexer uses Longest Matching rule, OID1 will never be matched, because VALUE can match an arbitrary long value. Hence, in order to fool the lexer, I edit OID1 rule in this way:
OID1 : {counter==6}?=> 2 3 4 5 840 {counter=0;} VALUE?
The VALUE token at the end of the rule will never be matched, since that rule is active only when counter is greater than 0, but in this way the lexer will consider also OID1 in the matching, because it can be longer than VALUE.
However, this rule caused that huge number of calls to BitConsume function! As soon as I delete the VALUE? part, I got a number of calls precisely equal to the number of bytes of input file.
I think that this is an implementation bug of ANTLR, since it should try to match the input following OID1 with VALUE, but it should immediately stop and skip the token because counter is 0.

Related

Static Analysis erroneously reports out of bounds access

While reviewing a codebase, I came upon a particular piece of code that triggered a warning regarding an "out of bounds access". After looking at the code, I could not see a way for the reported access to happen - and tried to minimize the code to create a reproducible example. I then checked this example with two commercial static analysers that I have access to - and also with the open-source Frama-C.
All 3 of them see the same "out of bounds" access.
I don't. Let's have a look:
3 extern int checker(int id);
4 extern int checker2(int id);
5
6 int compute(int *q)
7 {
8 int res = 0, status;
9
10 status = checker2(12);
11 if (!status) {
12 status = 1;
13 *q = 2;
14 for(int i=0; i<2 && 0!=status; i++) {
15 if (checker(i)) {
16 res = i;
17 status=checker2(i);
18 }
19 }
20 }
21 if (!status)
22 *q = res;
23 return status;
24 }
25
26 int someFunc(int id)
27 {
28 int p;
29 extern int data[2];
30
31 int status = checker2(132);
32 status |= compute(&p);
33 if (status == 0) {
34 return data[p];
35 } else
36 return -1;
37 }
Please don't try to judge the quality of the code, or why it does things the way it does. This is a hacked, cropped and mutated version of the original, with the sole intent being to reach a small example that demonstrates the issue.
All analysers I have access to report the same thing - that the indexing in the caller at line 34, doing the return data[p] may read via the invalid index "2". Here's the output from Frama-C - but note that two commercial static analysers provide exactly the same assessment:
$ frama-c -val -main someFunc -rte why.c |& grep warning
...
why.c:34:[value] warning: accessing out of bounds index. assert p < 2;
Let's step the code in reverse, to see how this out of bounds access at line 34 can happen:
To end up in line 34, the returned status from both calls to checker2 and compute should be 0.
For compute to return 0 (at line 32 in the caller, line 23 in the callee), it means that we have performed the assignment at line 22 - since it is guarded at line 21 with a check for status being 0. So we wrote in the passed-in pointer q, whatever was stored in variable res. This pointer points to the variable used to perform the indexing - the supposed out-of-bounds index.
So, to experience an out of bounds access into the data, which is dimensioned to contain exactly two elements, we must have written a value that is neither 0 nor 1 into res.
We write into res via the for loop at 14; which will conditionally assign into res; if it does assign, the value it will write will be one of the two valid indexes 0 or 1 - because those are the values that the for loop allows to go through (it is bound with i<2).
Due to the initialization of status at line 12, if we do reach line 12, we will for sure enter the loop at least once. And if we do write into res, we will write a nice valid index.
What if we don't write into it, though? The "default" setup at line 13 has written a "2" into our target - which is probably what scares the analysers. Can that "2" indeed escape out into the caller?
Well, it doesn't seem so... if the status checks - at either line 11 or at line 21 fail, we will return with a non-zero status; so whatever value we wrote (or didn't, and left uninitialised) into the passed-in q is irrelevant; the caller will not read that value, due to the check at line 33.
So either I am missing something and there is indeed a scenario that leads to an out of bounds access with index 2 at line 34 (how?) or this is an example of the limits of mainstream formal verification.
Help?
When dealing with a case such as having to distinguish between == 0 and != 0 inside a range, such as [INT_MIN; INT_MAX], you need to tell Frama-C/Eva to split the cases.
By adding //# split annotations in the appropriate spots, you can tell Frama-C/Eva to maintain separate states, thus preventing merging them before status is evaluated.
Here's how your code would look like, in this case (courtesy of #Virgile):
extern int checker(int id);
extern int checker2(int id);
int compute(int *q)
{
int res = 0, status;
status = checker2(12);
//# split status <= 0;
//# split status == 0;
if (!status) {
status = 1;
*q = 2;
for(int i=0; i<2 && 0!=status; i++) {
if (checker(i)) {
res = i;
status=checker2(i);
}
}
}
//# split status <= 0;
//# split status == 0;
if (!status)
*q = res;
return status;
}
int someFunc(int id)
{
int p;
extern int data[2];
int status = checker2(132);
//# split status <= 0;
//# split status == 0;
status |= compute(&p);
if (status == 0) {
return data[p];
} else
return -1;
}
In each case, the first split annotation tells Eva to consider the cases status <= 0 and status > 0 separately; this allows "breaking" the interval [INT_MIN, INT_MAX] into [INT_MIN, 0] and [1, INT_MAX]; the second annotation allows separating [INT_MIN, 0] into [INT_MIN, -1] and [0, 0]. When these 3 states are propagated separately, Eva is able to precisely distinguish between the different situations in the code and avoid the spurious alarm.
You also need to allow Frama-C/Eva some margin for keeping the states separated (by default, Eva will optimize for efficiency, merging states somewhat aggressively); this is done by adding -eva-precision 1 (higher values may be required for your original scenario).
Related options: -eva-domains sign (previously -eva-sign-domain) and -eva-partition-history N
Frama-C/Eva also has other options which are related to splitting states; one of them is the signs domain, which computes information about sign of variables, and is useful to distinguish between 0 and non-zero values. In some cases (such as a slightly simplified version of your code, where status |= compute(&p); is replaced with status = compute(&p);), the sign domain may help splitting without the need for annotations. Enable it using -eva-domains sign (-eva-sign-domain for Frama-C <= 20).
Another related option is -eva-partition history N, which tells Frama-C to keep the states partitioned for longer.
Note that keeping states separated is a bit costly in terms of analysis, so it may not scale when applied to the "real" code, if it contains several more branches. Increasing the values given to -eva-precision and -eva-partition-history may help, as well as adding # split annotations.
I'd like to add some remarks which will hopefully be useful in the future:
Using Frama-C/Eva effectively
Frama-C contains several plug-ins and analyses. Here in particular, you are using the Eva plug-in. It performs an analysis based on abstract interpretation that reports all possible runtime errors (undefined behaviors, as the C standard puts it) in a program. Using -rte is thus unnecessary, and adds noise to the result. If Eva cannot be certain about the absence of some alarm, it will report it.
Replace the -val option with -eva. It's the same thing, but the former is deprecated.
If you want to improve precision (to remove false alarms), add -eva-precision N, where 0 <= N <= 11. In your example program, it doesn't change much, but in complex programs with multiple callstacks, extra precision will take longer but minimize the number of false alarms.
Also, consider providing a minimal specification for the external functions, to avoid warnings; here they contain no pointers, but if they did, you'd need to provide an assigns clause to explicitly tell Frama-C whether the functions modify such pointers (or any global variables, for instance).
Using the GUI and Studia
With the Frama-C graphical interface and the Studia plug-in (accessible by right-clicking an expression of interest and choosing the popup menu Studia -> Writes), and using the Values panel in the GUI, you can easily track what the analysis inferred, and better understand where the alarms and values come from. The only downside is that, it does not report exactly where merges happen. For the most precise results possible, you may need to add calls to an Eva built-in, Frama_C_show_each(exp), and put it inside a loop to get Eva to display, at each iteration of its analysis, the values contained in exp.
See section 9.3 (Displaying intermediate results) of the Eva user manual for more details, including similar built-ins (such as Frama_C_domain_show_each and Frama_C_dump_each, which show information about abstract domains). You may need to #include "__fc_builtin.h" in your program. You can use #ifdef __FRAMAC__ to allow the original code to compile when including this Frama-C-specific file.
Being nitpicky about the term erroneous reports
Frama-C is a semantic-based tool whose main analyses are exhaustive, but may contain false positives: Frama-C may report alarms when they do not happen, but it should never forget any possible alarm. It's a trade-off, you can't have an exact tool in all cases (though, in this example, with sufficient -eva-precision, Frama-C is exact, as in reporting only issues which may actually happen).
In this sense, erroneous would mean that Frama-C "forgot" to indicate some issue, and we'd be really concerned about it. Indicating an alarm where it may not happen is still problematic for the user (and we work to improve it, so such situations should happen less often), but not a bug in Frama-C, and so we prefer using the term imprecisely, e.g. "Frama-C/Eva imprecisely reports an out of bounds access".

Style preference for binary operators with long lines

Most style guides, such as Google's, for C++ recommend a maximum line length of 80 characters as well as relevant guidelines for function calls such that the function call is formatted correctly.
For example, here is what Google's style guide has to say on the matter:
Function calls have the following format:
bool retval = DoSomething(argument1, argument2, argument3);
If the arguments do not all fit on one line, they should be broken up
onto multiple lines, with each subsequent line aligned with the first
argument. Do not add spaces after the open paren or before the close
paren:
bool retval = DoSomething(averyveryveryverylongargument1,
argument2, argument3);
If the function has many arguments, consider having one per line if
this makes the code more readable:
bool retval = DoSomething(argument1,
argument2,
argument3,
argument4);
Arguments may optionally all be placed on subsequent lines, with one
line per argument:
if (...) {
...
...
if (...) {
DoSomething(
argument1, // 4 space indent
argument2,
argument3,
argument4);
}
In particular, this should be done if the function signature is so
long that it cannot fit within the maximum line length.
What isn't discussed is what is preferable for binary operators. This is a matter of style, so there's not a particularly correct answer, but I was hoping to get some opinions on what people prefer in the case of binary operators with long variable names or expressions.
For example, consider the simple case
result = reallyReallyReallyReallyLongVariableName * otherReallyReallyLongVariableName;
There are three formats that jump out at me:
// #1: The assignment operator is treated as a call for the sake of formatting.
result =
reallyReallyReallyReallyLongVariableName * otherReallyReallyLongVariableName;
// #2: The binary operator is positioned under the equal sign, my preference.
result = reallyReallyReallyReallyLongVariableName
* otherReallyReallyLongVariableName;
// #3: There is basic indentation of 4 spaces.
result = reallyReallyReallyReallyLongVariableName
* otherReallyReallyLongVariableName;
The first two are a bit more intuitive in my opinion, and Zend Framework Style Guide recommends the format of #2 for repeated concatenation of strings in PHP.
As for the third, Pear's PHP Style Guide recommends starting new lines with the -> operator in the case of repeated function calls, which is somewhat loosely analogous to a binary operator.
I personally prefer the second, but I was just wondering what others' opinions are.

Last byte in Huffman compression

I am wondering about what is the best way to handle the last byte in Huffman Copression. I have some nice code in C++, that can compress text files very well, but currently I must write to my coded file also number of coded chars (well, it equal to input file size), because of no idea how to handle last byte better.
For example, last char to compress is 'a', which code is 011 and I am just starting new byte to write, so the last byte will look like:
011 + some 5 bits of trash, I am making them zeros for example at the end.
And when I am encoding this coded file, it may happen that code 00000 (or with less zeros) is code for some char, so I will have some trash char at the end of my encoded file.
As I wrote in first paragraph, I am avoiding this by saving numbers of chars of input file in coded file, and while encoding, I am reading the coded file to reach that number (not to EndOfFile, to don't get to those example 5 zeros).
It's not really efficient, size of coded file is increased for long number.
How can I handle this in better way?
Your approach (write the number of encoded bytes the to the file) is a perfectly reasonable approach. If you want to try a different avenue, you could consider inventing a new "pseudo-EOF" character that marks the end of the input (I'll denote it as &square;). Whenever you want to compress a string s, you instead compress the string s&square;. This means that when you build up your encoding tree, you would include one copy of the &square; character so that you have a unique encoding for &square;. Then, when you write out the string to the file, you would write out the bits characters of the string as normal, then write out the bit pattern for &square;. If there are leftover bits, you can just leave them set arbitrarily.
The advantage to this approach is that as you decode the file, if at any point you find the &square; character, you can immediately stop decoding bits because you know that you have hit the end of the file. This does not require you to store the number of bytes that were written out anywhere - the encoding implicitly marks its own endpoint.
The disadvantage to this setup is that it might increase the length of the bit patterns used by certain characters, since you will need to assign a bit pattern to &square; in addition to all the other characters.
I teach an introductory programming course and we use Huffman encoding as one of our assignments. We have students use the above approach, since it's a bit easier than having to write out the number of bits or bytes before the file contents. For more details, you could take a look at this handout or these lecture slides from the course.
Hope this helps!
I know this is an old question, but still, there's an alternate, so it might help someone.
When you're writing your compressed file to output, you probably have some integer keeping track of where you are in the current byte (for bit shifting).
char c, p;
p = '\0';
int curr = 7;
while (infile.get(c))
{
std::string trav = GetTraversal(c);
for (int i = 0; i < trav.size(); i++)
{
if (trav[i] == '1')
p += (1 << curr);
if (--curr < 0)
{
outfile.put(p);
p = '\0';
curr = 7;
}
}
}
if (curr < 7)
outfile.put(p);
At the end of this block, (curr+1)%8 equals the number of trash bits in the last data byte. You can then store it at the end as a single extra byte, and just keep it in mind when you're decompressing.

Efficiently compute permutations of a given set of "blocks" in a line

I am working on an application where I have a number of blocks which should be positioned on a line. I.e. there are varying number of blocks, each with a different length which should be positioned on the line. There needs to be at least one empty element between blocks.
I would like to get all possible permutations of the blocks on the line efficiently.
For example I have a line of length 15 and would like to place blocks of 1, 6 and 1 size.
Order matters, i.e. in my example the 1-size blocks always should be left and right of the 6-size block.
Possible permutations are
X.XXXXXX.X.....
X..XXXXXX.X....
...
.....X.XXXXXX.X
How do I efficiently generate all possible permutations in a higher level language, e.g. Java?
One way to do this is to approach it recursively:
If the minimum total length required to store all the blocks with exactly one space in-between them exceeds the available space, there are no ways to place the blocks.
Otherwise, if you have no blocks to place, then the only way to place the blocks is to leave all squares unfilled.
Otherwise, there are two options. First, you could place the first block at the first position in the row, then recursively place the remaining blocks in the remaining space within the row after first leaving one extra blank space at the start of the row. Second, you could leave the first space in the row blank, then recursively place the same set of blocks in the remaining space in the row. Trying out both options and combining the results back together should give you the answer you're looking for.
Translating this recursive logic into actual Java should not be too difficult. The code below is designed for readability and can be optimized a bit:
public List<String> allBlockOrderings(int rowLength, List<Integer> blockSizes) {
/* Case 1: Not enough space left. */
if (spaceNeededFor(blockSizes) > rowLength)) return Collections.EMPTY_LIST;
List<String> result = new ArrayList<String>();
/* Case 2: Nothing to place. */
if (blockSizes.isEmpty()) {
result.add(stringOf('.', rowLength));
} else {
/* Case 3a: place the very first block at the beginning of the row. */
List<String> placeFirst = allBlockOrderings(rowLength - blockSizes.get(0) - 1,
blockSizes.subList(1, blockSizes.length()));
for (String rest: placeFirst) {
result.add(stringOf('X', blockSizes.get(0)) + rest);
}
/* Case 3b: leave the very first spot open. */
List<String> skipFirst = allBlockOrderings(rowLength - 1, blockSizes);
for (String rest: skipFirst) {
result.add('.' + rest);
}
}
return result;
}
You'll need to implement the spaceNeededFor method, which returns the length of the shortest row that could possibly hold a given list of blocks, and the stringOf method, which takes in a character and a number, then returns a string of that many copies of the given character.
Hope this helps!
To me it seems more easy to think about the problem in another way:
We have fixed blocks in a fixed order, separated by dots. We can create all permutations by distributing the remaining dots over the allowed positions.
The length of this fixed part of the line is:
fixed_len = length_of_all_blocks + number_of_blocks - 1
The number of remaining dots is
free_dots = length_of_line - fixed_len.
The number of open positions is
pos_count = number_of_blocks + 1
Now we have to find all permutations of how to put free_dots into pos_count.
It's quite hard to determine what an "efficient implementation" is since the output can be very large and therefore even a fast implementation won't be fast enough.
I'd use technics of dynamic programming and recursion for such task. The recursive fuoction should take two parameters - list of unused numbers and remaining length of the row. Inside it will be a simple loop. You should store the results you already know. I'm sure you can handle the details by yourself. Edit : Our friend has already done that for you :-).
By the way, what is the goal of such task? It remainds me about the pictures in a grid where you have such numbers for every row and column and you need to decode the picture. There are better ways to solve such problem.

When are numbers NOT Magic?

I have a function like this:
float_as_thousands_str_with_precision(value, precision)
If I use it like this:
float_as_thousands_str_with_precision(volts, 1)
float_as_thousands_str_with_precision(amps, 2)
float_as_thousands_str_with_precision(watts, 2)
Are those 1/2s magic numbers?
Yes, they are magic numbers. It's obvious that the numbers 1 and 2 specify precision in the code sample but not why. Why do you need amps and watts to be more precise than volts at that point?
Also, avoiding magic numbers allows you to centralize code changes rather than having to scour the code when for the literal number 2 when your precision needs to change.
I would propose something like:
HIGH_PRECISION = 3;
MED_PRECISION = 2;
LOW_PRECISION = 1;
And your client code would look like:
float_as_thousands_str_with_precision(volts, LOW_PRECISION )
float_as_thousands_str_with_precision(amps, MED_PRECISION )
float_as_thousands_str_with_precision(watts, MED_PRECISION )
Then, if in the future you do something like this:
HIGH_PRECISION = 6;
MED_PRECISION = 4;
LOW_PRECISION = 2;
All you do is change the constants...
But to try and answer the question in the OP title:
IMO the only numbers that can truly be used and not be considered "magic" are -1, 0 and 1 when used in iteration, testing lengths and sizes and many mathematical operations. Some examples where using constants would actually obfuscate code:
for (int i=0; i<someCollection.Length; i++) {...}
if (someCollection.Length == 0) {...}
if (someCollection.Length < 1) {...}
int MyRidiculousSignReversalFunction(int i) {return i * -1;}
Those are all pretty obvious examples. E.g. start and the first element and increment by one, testing to see whether a collection is empty and sign reversal... ridiculous but works as an example. Now replace all of the -1, 0 and 1 values with 2:
for (int i=2; i<50; i+=2) {...}
if (someCollection.Length == 2) {...}
if (someCollection.Length < 2) {...}
int MyRidiculousDoublinglFunction(int i) {return i * 2;}
Now you have start asking yourself: Why am I starting iteration on the 3rd element and checking every other? And what's so special about the number 50? What's so special about a collection with two elements? the doubler example actually makes sense here but you can see that the non -1, 0, 1 values of 2 and 50 immediately become magic because there's obviously something special in what they're doing and we have no idea why.
No, they aren't.
A magic number in that context would be a number that has an unexplained meaning. In your case, it specifies the precision, which clearly visible.
A magic number would be something like:
int calculateFoo(int input)
{
return 0x3557 * input;
}
You should be aware that the phrase "magic number" has multiple meanings. In this case, it specifies a number in source code, that is unexplainable by the surroundings. There are other cases where the phrase is used, for example in a file header, identifying it as a file of a certain type.
A literal numeral IS NOT a magic number when:
it is used one time, in one place, with very clear purpose based on its context
it is used with such common frequency and within such a limited context as to be widely accepted as not magic (e.g. the +1 or -1 in loops that people so frequently accept as being not magic).
some people accept the +1 of a zero offset as not magic. I do not. When I see variable + 1 I still want to know why, and ZERO_OFFSET cannot be mistaken.
As for the example scenario of:
float_as_thousands_str_with_precision(volts, 1)
And the proposed
float_as_thousands_str_with_precision(volts, HIGH_PRECISION)
The 1 is magic if that function for volts with 1 is going to be used repeatedly for the same purpose. Then sure, it's "magic" but not because the meaning is unclear, but because you simply have multiple occurences.
Paul's answer focused on the "unexplained meaning" part thinking HIGH_PRECISION = 3 explained the purpose. IMO, HIGH_PRECISION offers no more explanation or value than something like PRECISION_THREE or THREE or 3. Of course 3 is higher than 1, but it still doesn't explain WHY higher precision was needed, or why there's a difference in precision. The numerals offer every bit as much intent and clarity as the proposed labels.
Why is there a need for varying precision in the first place? As an engineering guy, I can assume there's three possible reasons: (a) a true engineering justification that the measurement itself is only valid to X precision, so therefore the display shoulld reflect that, or (b) there's only enough display space for X precision, or (c) the viewer won't care about anything higher that X precision even if its available.
Those are complex reasons difficult to capture in a constant label, and are probbaly better served by a comment (to explain why something is beng done).
IF the use of those functions were in one place, and one place only, I would not consider the numerals magic. The intent is clear.
For reference:
A literal numeral IS magic when
"Unique values with unexplained meaning or multiple occurrences which
could (preferably) be replaced with named constants." http://en.wikipedia.org/wiki/Magic_number_%28programming%29 (3rd bullet)

Resources