FLEX when read large token, meet buffer overflow error - overflow

My parser meets below error when meet a very large token:
"input buffer overflow, can't enlarge buffer because scanner uses REJECT"
1) The default YY_BUF_SIZE is 16k, it's no error if I change this value bigger, but it can't guarantee it's OK next time since the input can have bigger token; Also I find if I enlarge YY_BUF_SIZE value, the parser performance is affected and it's slower than before.
2) yytext should be %pointer from the instructions here to overcome this error, but I tried it and it didn't work. Seems my yytext is already pointer.
Can anyone help me to fix this problem? I think Flex should allow user unlimited token size which just dependent on the system memory or dynamic memory stack capacity. Appreciate very much for any suggestion or idea!

The error message tells you what you need to know, I think. Normally, flex can resize the buffer up to the limit of available memory, but that's not the case if you use the REJECT action (because the scanner needs to maintain a state stack to implement REJECT and the state stack is not resizable).
In general, flex is not optimized for huge tokens, and in some cases huge tokens can slow flex down quite a lot. REJECT also slows flex down. So your best solution is to avoid matching huge tokens; for example, strings and comments can be tokenized one line at a time instead of being tokenized as one enormous token.
If you really need to be able to handle arbitrarily-long single tokens, you'll have to figure out how to avoid REJECT.

Related

Fortran95 access large files fast using direct access

I am currently working on a problem which requires me to store a large amount of well structured information in a file.
It is more data than I can keep in memory, but I need to access different parts of it very often and would like to do so as quickly as possible (of course).
Unfortunately, the file would be large enough that actually reading through it would take quite some time as well.
From what I have gathered so far, it seems to me that ACCESS="DIRECT" would be a good way of handling this problem. Do I understand correctly that with direct access, I am basically pointing at a specific chunk of memory and ask "What's in there?"? And do I correctly infer from that, that reading time does not depend on the overall file size?
Thank you very much in advance!
You can think of an ACCESS='DIRECT' file as a file consisting of a number of fixed size records. You can do operations like read or write record #N in O(1) time. That is, in order to access record #N you don't need to scan through all the preceding #M (M<N) records in the file.
If this maps reasonably well to the problem you're trying to solve, then ACCESS='DIRECT' might be the correct solution in your case. If not, ACCESS='STREAM' offers a little bit more flexibility in that the size of each record does not need to be fixed, though you need to be able to compute the correct file offset yourself. If you need even more flexibility there's things like NetCDF, or HDF5 like #HighPerformanceMark suggested, or even things like sqlite.

Performance-wise, is it worth it to rename every mongo key name for production? [duplicate]

This question already has answers here:
Is shortening MongoDB property names worthwhile?
(7 answers)
Closed 5 years ago.
As far as I know, every key name is stored "as-is" in the mongo database. It means that a field "name" will be stored using the 4 letters everywhere it is used.
Would it be wise, if I want my app to be ready to store a large amount of data, to rename every key in my mongo documents? For instance, "name" would become "n" and "description" would become "d".
I expect it to reduce significantly the space used by the database as well as reducing the amount of data sent to client (not to mention that it kinda uglify the mongo documents content). Am I right?
If I undertake the rename of every key in my code (no need to rename the existing data, I can rebuild it from scratch), is there a good practice or any additional advise I should know?
Note: this is mainly speculation, I don't have benchmarking results to back this up
While "minifying" your keys technically would reduce the size of your memory/diskspace footprint, I think the advantages of this are quite minimal if not actually disadvantageous.
The first thing to realize is that data stored in Mongodb is actually not stored in its raw JSON format, its actually stored as pure binary using a standard know as BSON. This allows Mongo to do all sorts of internal optimizationsm, such as compression if you're using WiredTiger as your storage engine (thanks for pointing that ouT #Jpaljasma).
Second, lets say you do minify your keys. Well then you need to minify your keys. Every time. Forever. Thats a lot of work on your application side. Plus you need to unminify your keys when you read (because users wont know what n is). Every time. Forever. All of a sudden your minor memory optimization becomes a major runtime slowdown.
Third, that minifying/unminifying process is kinda complicated. You need to maintain and test mappings between the two, keep it tested, up to date, and never having any overlap (if you do, thats the end of all your data pretty much). I wouldn't ever work on that.
So overall, I think its a pretty terrible idea to minify your keys to save a couple of characters. Its important to keep the big picture in mind: the VAST majority of your data will be not in the keys, but in the values. If you want to optimize data size, look there.
The full name of every field is included in every document. So when your field-names are long and your values rather short, you can end up with documents where the majority of the used space is occupied by redundant field names.
This affects the total storage size and decreases the number of documents which can be cached in RAM, which can negatively affect performance. But using descriptive field-names does of course improve readability of the database content and queries, which makes the whole application easier to develop, debug and maintain.
Depending on how flexible your driver is, it might also require quite a lot of boilerplate code to convert between your application field-names and the database field-names.
Whether or not this is worth it depends on how complex your database is and how important performance is to you.

What are the standard data structures that can be used to efficiently represent the world of minecraft? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I am thinking of something like a 3x3 matrix for each of the x,y,z coordinates. But that would be a waste of memory since a lot of block spaces are empty. Another solution would be to have a hashmap ((x,y,z) -> BlockObject), but that doesn't seem too efficient either.
When I say efficient, I do not mean optimal. It simply means that it would be enough to run smoothly on your modern day computer. Keep in mind, that the worlds generated by minecraft are quite huge, efficiency is important regardless. There's is also tons of meta-data that needs to be stored.
As noted in my comment, I have no idea how MineCraft does this, but a common efficient way of representing this sort of data is an Octree; http://en.wikipedia.org/wiki/Octree. The general idea is that it's like a binary tree but in three-space. You recursively divide each block of space in each dimension to get eight smaller blocks, and each block contains the pointers to the smaller blocks and a pointer to its parent block.
This allows you to be efficient about storing large blocks of the same material (e.g., "empty space"), because you can terminate the recursion whenever you get to a block that is made up of all the same thing, even if you haven't recursed down to the level of individual "cube" units.
Also, this means that you can efficiently find all the cubes in a given region by taking your current block and going up the tree just far enough to get to a block that contains all you can see -- and that way, you can very easily ignore all the cubes that are somewhere else.
If you're interested in exploring alternative means to represent Minecraft world (chunk)data, you can also look into the idea of bitstrings. Each 'chunk' is comprised of a volume 16*16*128, whereas 16*16 can adequately be represented by a single byte character and can be consolidated into a binary string.
As this approach is highly specific to a certain goal of trading client-computation vs highly optimized storage and transfer time, it seems imprudent to attempt to explain all the details, but I have created a specification for just this purpose, if you're interested.
Using this method, calculating storage cost is drastically different than the current 1byte-per-block, but instead is 'variable-bit-rate': ((1bit-per-block, rounded up to a multiple of 8) * (number of unique layers a blocktype appears in a chunk) + 2bytes)
This is then summed for the (unique number of blocktypes in that chunk).
Pretty much only in deliberate edgecases can this be more expensive than a normally structured chunk, in excess of 99% of Minecraft chunks are naturally generated and would benefit from this variable-bit-representation by a ratio of 8:1 or more in many of my tests.
Your best bet is to decompile Minecraft and look at the source. Modifying Minecraft: The Source Code is a nice walkthrough on how to do that.
Minecraft is very far from efficent. It just stores "chunks" of data.
Check out the "Map formats" in the Development Resources at Minecraft Wiki. AFAIK, the internal representation is exactly the same.

Ways of Efficiently Seeking in Custom File Formats

I've been wondering what kind of ways seek is implemented across different file formats and what would be a good way to construct a file that has a lot of data to enable efficient seeking. Some ways I've considered have been having equal sized packets, which allows quick skipping since you know what each data chunk is like, also preindexing whenever a file is loaded is also a thought.
This entirely depends on the kind of data, and what you're trying to seek to.
If you're trying to seek by record index, then sure: fixed size fields makes life easier, but wastes space. If you're trying to seek by anything else, keeping an index of key:location works well. If you want to be able to build the file up sequentially, you can put the index at the end but keep the first four bytes of the file (after the magic number or whatever) to represent the location of the index itself (assuming you can rewrite those first four bytes).
If you want to be able to perform a sort of binary chop on variable length blocks, then having a reasonably efficient way of detecting the start of a block helps - as does having next/previous pointers, as mentioned by Alexander.
Basically it's all about metadata, really - but the right kind of metadata will depend on the kind of data, and the use cases for seeking in the first place.
Well, giving each chunk a size offset to the next chunk is common and allows fast skipping of unknown data. Another way would be an index chunk at the beginning of the file, storing a table of all chunks in the file along with their offsets. Programs would simply read the index chunk into memory.

Payload performance in Lucene

I know there are several topics on the web, as well as on SO, regarding indexing and query performance within Lucene, but I have yet to find one that discusses whether or not (and if so, how much?) creating payloads will affect query performance...
Here's the scenario ...
Let's say I want to index a collection of documents (anywhere from 100K - 10M), and each document has a subsection that I want to be able to search separately (or perhaps rank higher, depending on whether a match was found within that section).
I'm considering adding a payload (during indexing) to any term that appears within that subsection, so I can efficiently make that determination at query-time.
Does anyone know of any performance issues related to using payloads, or even better, could you point me to any online documentation about this topic?
Thanks!
EDIT: I appreciate the alternative solutions to my scenario, but in case I do need to use payloads in the future, does anyone have any comments regarding the original question about query performance?
The textbook solution to what you want to do is index each original document as two fields: one for the full document, and the other for the subsection. You can boost the subsection field separately either during indexing or during retrieval.
Having said that, you can read about Lucene payloads here: Getting Started with Payloads.
Your use case doesn't fit well with the purpose of payloads -- it looks to me that any payload information would be redundant.
Payloads are attached to individual occurrences of terms in the document, not to document/term pairs. In order to store and access payloads, you have to use the offset of the term occurrence within the document. In your case, if you know the offset, you should be able to calculate which section the term occurrence is in, without using payload data.
The broader question is the effect of payloads on performance. My experience is that when properly used, the payload implementation takes up less space and is faster than whatever workaround I was previously using. The biggest impact on disk space will be wherever you currently use Field.setOmitTermFreqAndPositions(true) to reduce index size. You will need to include positions to use payloads, which potentially makes the index much larger.

Resources