Parsed boost::spirit::qi::rule in memory mapped file - boost

I have several large logical expressions (length upto 300K) of the form -
( ( ( 1 ) and ( 5933 or 561 or 1641 ) ) or ( ( 71 or 1 or 15 or 20 ) and ( 436 ) ) or ( ( 398 or 22 or 33 ) ) )
that are parsed using Boost Spirit (as shown in the example here - Boolean expression (grammar) parser in c++)
The parsing takes more than a minute for each expression.
I would like to do the parsing offline which results in an expression represented by -
typedef boost::variant <var,
boost::recursive_wrapper<unop <op_not> >,
boost::recursive_wrapper<binop<op_and> >,
boost::recursive_wrapper<binop<op_or> >
> expr;
This expression needs to be propagated to multiple machines for real-time evaluation against inputs. These machines cannot be made to spend the necessary time for the initial parsing.
Is it possible to propagate the parsed expression in the boost::variant representation above via Boost Interprocess managed_mapped_file? I have tried doing this via a unique_ptr of the expression object and was able to write to a memory mapped file, but evaluation against this object on the other hand resulted in a segmentation fault.
Note that I have also tried boost serialization, which fails for very large expressions.
Looking forward to any advice around this.
Thanks!

Related

Boost Spirit sequential key value parser

Is there a better way of doing this with Spirit? I'm parsing a sequential series of key value pairs, with some line endings and other cruft in between. The format is not so consistent that I can just pull key values pairs out with a single rule. So I've got an adapter and production rules like:
BOOST_FUSION_ADAPT_STRUCT(
Record,
( std::string, messageHeader )
( double, field1 )
( std::string, field2 )
( Type3, field3 )
// ...
( TypeN, fieldN )
)
template< typename Iterator, typename Skipper >
class MyGrammar : public qi::grammar< Iterator, Record(), Skipper >
{
public:
MyGrammar() : MyGrammar::base_type{ record }
{
record =
qi::string( "Message header" )
>> field1 >> field2
// ...
>> fieldN;
field1 = qi::lit( "field 1:" ) >> qi::double_;
// ...
}
// field rule declarations...
};
This is a straightforward if tedious way of going about it, and I've already exceeded the compiler's rule complexity threshold once, which forced me to refactor the fields into separate rules. Also if there's an error parsing a message, the parser always shows the error being at the beginning of the string, like the rule doesn't give it enough context to figure out where the problem actually is. I assume this is from the way the >> operator works.
Edit:
In response to sehe's question, I've run into two problems with this approach and the MSVC 15 compiler. The first was a compiler error on my top-level production when it hit somewhere in the vicinity of 80 components separated by >>:
recursive type or function dependency context too complex
So I pushed everything I could down into subordinate rules to reduce the complexity. Unfortunately now, after adding still more rules, I'm running into:
fatal error C1060: compiler is out of heap space
So I find that I do need some way to further decompose the problem that's not just a long series of concatenated production rules...

Compact data structure for storing parsed log lines in Go (i.e. compact data structure for multiple enums in Go)

I'm working on a script that parses and graph information from a database logfile. Some examples loglines might be:
Tue Dec 2 03:21:09.543 [rsHealthPoll] DBClientCursor::init call() failed
Tue Dec 2 03:21:09.543 [rsHealthPoll] replset info example.com:27017 heartbeat failed, retrying
Thu Nov 20 00:05:13.189 [conn1264369] insert foobar.fs.chunks ninserted:1 keyUpdates:0 locks(micros) w:110298 110ms
Thu Nov 20 00:06:19.136 [conn1263135] update foobar.fs.chunks query: { files_id: ObjectId('54661657b23a225c1e4b00ac'), n: 0 } update: { $set: { data: BinData } } nscanned:1 nupdated:1 keyUpdates:0 locks(micros) w:675 137ms
Thu Nov 20 00:06:19.136 [conn1258266] update foobar.fs.chunks query: { files_id: ObjectId('54661657ae3a22741e0132df'), n: 0 } update: { $set: { data: BinData } } nscanned:1 nupdated:1 keyUpdates:0 locks(micros) w:687 186ms
Thu Nov 20 00:12:14.859 [conn1113639] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1416453003000|74 } } cursorid:7965836327322142721 ntoreturn:0 keyUpdates:0 numYields: 15 locks(micros) r:351042 nreturned:3311 reslen:56307 188ms
Not every logline contains all fields, but some of the fields we parse out include:
Datetime
Query Duration
Name of Thread
Connection Number (e.g. 1234, 532434, 53433)
Logging Level (e.g. Warning, Error, Info, Debug etc.)
Logging Component (e.g. Storage, Journal, Commands, Indexin etc.)
Type of operation (e.g. Query, Insert, Delete etc.)
Namespace
The total logfile can often be fairly large (several hundred MBs up to a coupe of GBs). Currently the script is in Python, and as well as the fields, it's also storing the original raw logline as well as a tokenised version - the resulting memory consumption though is actually several multiples of the original logfile size. Hence, memory consumption is one of the main things I'd like to improve.
For fun/learning, I thought I might try re-doing this in Go, and looking at whether we could use a more compact data structure.
Many of the fields are enumerations (enums) - for some of them the set of values is known in advance (e.g. logging leve, logging component). For others (e.g. name of thread, connection number, namespace), we'll work out the set at runtime as we parse the logfile.
Planned Changes
Firstly, many of these enums are stored as strings. So I'm guessing one improvement will be move to using something like an uint8 to store it, and then either using consts (for the ones we know in advance), or having some kind of mapping table back to the original string (for the ones we work out.) Or are there any other reaosns I'd prefer consts versus some kind of mapping structure?
Secondly, rather than storing the original logline as a string, we can probably store an offset back to the original file on disk.
Questions
Do you see any issues with either of the two planned changes above? Are these a good starting point?
Do you have any other tips/suggestions for optimising the memory consumption of how we store the loglines?
I know for bitmaps, there's things like Roaring Bitmaps (http://roaringbitmap.org/), which are compressed bitmaps which you can still access/modify normally whilst compressed. Apparently the overall term for things like this is succinct data structures.
However, are there any equivalents to roaring bitmaps but for enumerations? Or any other clever way of storing this compactly?
I also thought of bloom filters, and maybe using those to store whether each logline was in a set (i.e. logging level warning, logging level error) - however, it can only be in one of those sets, so I don't know if that makes sense. Also, not sure how to handle the false positives.
Thoughts?
Do you see any issues with either of the two planned changes above? Are these a good starting point?
No problems with either. If the logs are definitely line-delimited you can just store the line number, but it may be more robust to store the byte-offset. The standard io.Reader interface returns the number of bytes read so you can use that to gain the offset.
Do you have any other tips/suggestions for optimising the memory consumption of how we store the loglines?
It depends on what you want to use them for, but once they've been tokenized (and you've got the data you want from the line), why hold onto the line in memory? It's already in the file, and you've now got an offset to look it up again quickly.
are there any equivalents to roaring bitmaps but for enumerations? Or any other clever way of storing this compactly?
I'd tend to just define each enum type as an int, and use iota. Something like:
package main
import (
"fmt"
"time"
)
type LogLevel int
type LogComponent int
type Operation int
const (
Info LogLevel = iota
Warning
Debug
Error
)
const (
Storage LogComponent = iota
Journal
Commands
Indexin
)
const (
Query Operation = iota
Insert
Delete
)
type LogLine struct {
DateTime time.Time
QueryDuration time.Duration
ThreadName string
ConNum uint
Level LogLevel
Comp LogComponent
Op Operation
Namespace string
}
func main() {
l := &LogLine{
time.Now(),
10 * time.Second,
"query1",
1000,
Info,
Journal,
Delete,
"ns1",
}
fmt.Printf("%v\n", l)
}
Produces &{2009-11-10 23:00:00 +0000 UTC 10s query1 1000 0 1 2 ns1}.
Playground
You could pack some of the struct fields, but then you need to define bit-ranges for each field and you lose some open-endedness. For example define LogLevel as the first 2 bits, Component as the next 2 bits etc.
I also thought of bloom filters, and maybe using those to store whether each logline was in a set (i.e. logging level warning, logging level error) - however, it can only be in one of those sets, so I don't know if that makes sense. Also, not sure how to handle the false positives.
For your current example, bloom filters may be overkill. It may be easier to have a []int for each enum, or some other master "index" that keeps track of line-number to (for example) log level relationships. As you said, each log line can only be in one set. In fact, depending on the number of enum fields, it may be easier to use the packed enums as an identifier for something like a map[int][]int.
Set := make(map[int][]int)
Set[int(Delete) << 4 + int(Journal) << 2 + int(Debug)] = []int{7, 45, 900} // Line numbers in this set.
See here for a complete, although hackish example.

MRTG CPU and Memory together

So, I have an Adtran router and I'd like to monitor both CPU and memory utilization in a single graph. Unfortunately Adtran doesn't offer a percentage guage for memory utilization the way it does for CPU utilization. It does offer two OIDs: one that gives you the free memory in bytes and the other that gives you total memory in bytes.
I would like to create a cpu_memory target in my MRTG configuration that does the necessary math but I can't see a way to do it. Ideally it would work something like this:
# CPU Utilization OID: .1.3.6.1.4.1.664.5.53.1.4.1.0
# Total Memory OID: .1.3.6.1.4.1.664.5.53.1.4.7.0 (adGenAOSHeapSize)
# Free Memory OID: .1.3.6.1.4.1.664.5.53.1.4.8.0 (adGenAOSHeapFree)
Target[rtr-cpu_mem]: .1.3.6.1.4.1.664.5.53.1.4.1.0&( 100 - ( .1.3.6.1.4.1.664.5.53.1.4.8.0 / .1.3.6.1.4.1.664.5.53.1.4.7.0 ) ):public#router.local
# ... rest of config
Is this even possible? Or, will I have to have a separate graph for the memory?
This is not really possible to do in a single native Target, since calculations apply to both values. While you can use pseudoZero and pseudoOne to get around this in part, you can't manage it this way.
I would advise that you have one Target for CPU, and a separate Target for the Memory calculation, which makes it much simpler. You can then use the 'dorelpercent' option on the Memory Target and have it fetch the used and total into the separate values.
However, if you really, really, have to have then in the same target, there is an awkward way to kludge it -- custom data conversion functions.
You can define a custom perl function to multiply the second item by 100, if it is less than 1, and store this into a flie 'conversion.pl'
sub topercent {
my $value = shift;
return ($1 * 100) if( $value =~ /([01]\.\d*)/ and ($1<=1));
return $value;
}
Then, define your Target like this (replace cpuoid, totalmemoid and freememoid appropriately):
ConversionCode: /path/to/conversion.pl
Target[cpumem]: ( cpuoid&totalmemoid:comm#rtr - pseudoZero&freememoid:comm&rtr ) / ( pseudoOne&totalmemoid:comm#rtr ) |topercent
This results in In=cpupercent, Out=memusedpercent
I wouldn't advise doing it this way, though; best to stick to separate Targets for Memory and CPU. You can always combine these two targets into a single graph for display if you're using MRTG/RRD with Routers2 anyway.
Another alternative is to write a custom collection script that does the retrieval and processing, and define it like this:
Target[cpumem]: `myscript.sh community router`
and make myscript.sh output four lines; CPU percent, Memory percent, and two blank lines.
You can do separate computations on each value using PseudoOne and PseudoZero, e.g.:
( PseudoZero&PseudoOne:community#host * 100 - memUsed&cpuIdle:community#host )
* ( PseudoOne&PseudoZero:community#host * 99 + PseudoOne&PseudoOne:community#host )
* ( PseudoZero&PseudoOne:community#host - PseudoOne&PseudoZero:community#host )
/ memTotal&PseudoOne:community#host
This computes the following:
memPercent = 100 * memUsed / memTotal
cpuPercent = 100 - cpuIdle

Compare two datasets

not sure this is the right place to ask but I have one set of "strings" (compiled from to int converted char) that I need to convert to a char array.
This is an example of the 5 integers:
13497
13621
13673
13547
13691
And this is the same set I need:
132
0
52
182
5
Is there an easy to way to convert them?
Edit: For added difficulty I can't find how the first set was made exactly.

Compile an anonymous word in a compiled word

I'm currently working on a small interpreter written in Forth. For a small optimization, I'm trying to have a word which creates compiled words. For example, something which behaves like this:
: creator ( -- a )
:noname ( u -- u )
10 + ;
;
10 creator execute .
>> 20 ok
I tried several approaches so far and non worked (naïve like above, switching in interpretive mode, trying to compile a string of Forth source). Is this actually possible?
When you write compiling words, you have to be very careful about which words execute at compile time, and which execute at runtime. In this case, 10 + runs at compile time, and will not be compiled into you :noname definition.
I believe this is what you want:
: creator ( -- xt ) :noname ( n1 -- n2 )
10 postpone literal postpone + postpone ; ;
Also note that you may use CREATE DOES> in many cases. E.g. if you want your creator to accept a number which is used by the child word:
: creator ( n1 "name" -- ) create ,
does> ( n2 -- n1+n2 ) # + ;

Resources