When should I use dictionary encoding in parquet? - parquet

I see that parquet supports dictionary encoding on a per-column basis, and that dictionary encoding is described in the GitHub documentation:
Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
The dictionary encoding builds a dictionary of values encountered in a given column. The dictionary will be stored in a dictionary page per column chunk. The values are stored as integers using the RLE/Bit-Packing Hybrid encoding. If the dictionary grows too big, whether in size or number of distinct values, the encoding will fall back to the plain encoding. The dictionary page is written first, before the data pages of the column chunk.
Dictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding.
Data page format: the bit width used to encode the entry ids stored as 1 byte (max bit width = 32), followed by the values encoded using RLE/Bit packed described above (with the given bit width).
Using the PLAIN_DICTIONARY enum value is deprecated in the Parquet 2.0 specification. Prefer using RLE_DICTIONARY in a data page and PLAIN in a dictionary page for Parquet 2.0+ files.
Okay.... so how do I know when to use dictionary encoding or not?
Is there any rule of thumb to help? e.g. if 90% of values in a column are expected to be in some particular set I should use them?
I have a use case where I expect three different scenarios for different columns:
integer column where all values lie within a very small set → seems perfect for dictionary encoding
integer column where 99% of values lie within a very small set but 1% are unlikely to form any clustering → not sure
string column where no value is likely to be the same → seems like dictionary encoding is a bad idea
Is there any documentation explaining which strategy is appropriate under various conditions?

I'm not aware of any documentation (on the Arrow side at least) that recommends when to use or not dictionary encoding. It's a good question and your instincts are reasonable--maybe you can try writing those kinds of data both ways and comparing file size and read/write speed. I'd be interested to see what you find.

Related

How to use INDEX MATCH in Google Sheets when the two values are formatted differently (text versus numbers)?

I'm trying to match a string of numbers (like 370488004) using the typical INDEX MATCH formula. Unfortunately, in one range the numbers are formatted as plain text, and in the other range they are formatted differently. Usually 'Automatic' or 'Number'. I want to avoid having to update the formatting of both ranges whenever the values get updated (usually via a paste from an outside source). Especially since it's not always going to be me doing the updating.
Is there a way I can write my INDEX MATCH formula so that it ignores the formatting of the values it's attempting to match?
The value returned by the INDEX formula can be in any format. Plain text, number, doesn't matter. The problem is the two values I'm matching are in different formats. I need a formula that ignores that their formatting.
Here's an example sheet:
https://docs.google.com/spreadsheets/d/1cwO7HGtwR4mRnAqcjxqr1qbhGwJHLjBKkp7-iwzkOqY/edit?usp=sharing
You can use VALUE or INT to force it into a number value, or if you want to keep it text use TEXT. Example would be:
=INDEX(VALUE(D1:E4),MATCH(G1,E1:E4,FALSE),1)
The numbers in column D are in fact text, but utilizing VALUE first for the range puts them all in number format. It is finding the value associated with "Green" written in G1. Without seeing a working example sheet this is the best solution I can offer.
UPDATE:
You can use VLOOKUP array with a static range (otherwise error), or QUERY to have the range infinite.
=ARRAYFORMULA(VLOOKUP(VALUE($G3:$G5),$B3:$C,2,FALSE))
=QUERY(FILTER($B3:$C,$B3:$B=VALUE($G3:$G)),"Select Col2")

Is there a metalanguage, similar to BNF that can concisely describe self-describing data?

Say for instance I had a data set that was self describing. The first few well-structured records define data type IDs, which include the name and length of records, followed by content records, which start with the data IDs and contain a variable amount of data, depending on the ID.
It would be easy enough to describe the definition records using BNF, EBNF, or ABNF .. but how would one concisely describe the content records, whose length is defined in the definition records?
Here is an example of describing the classic NetCDF data format with a BNF-like notation, but not concisely because the lengths of the data recs is not specified as a function of data in the the earlier dim and var definitions.
Are you asking how to define the content of the content records? You made it clear that they're already defined in terms of the amount of data. If each data type ID implies not only a data length but also a data structure, it's straightforward, even in BNF, with one set of productions for each data type ID. Is that what you mean? (It's even likely to be LR(1).)
I am the creator of an Expert System, named XTRAN, that manipulates over 30 computer languages, as well as data and text. I got tired of writing parsers, so I created a parsing engine that executes EBNF at parse time, and I feed it the EBNF via the Expert System's rules language. Since EBNF itself is meta, the schema I use to parse and store it for execution at parse time is meta-meta.
XTRAN's rules language also provides a data base capability in which a data base is in-memory, content-addressable, and stored as a sparse matrix. It's effectively an n-space, with each cell addressed via a list of subscripts, with each subscript being either elided, an integer, or a text string. So I can construct the scenario you describe quickly, by storing the data descriptions in the same data base that contains the content records. It's loosely analogous to a relational data base describing its schema via its own contents.
FWIW, we call XTRAN's rules language meta-code, because it's a language that can manipulate other languages (as well as itself).

How does RethinkDB generate auto ids?

I'm writing a script which supposed to merge some data from sql-based db. Each row has a long-integer as a primary key (incremental). I was thinking about hashing these ids so that they'll somehow 'look' like the other ids already in my RethinkDB table. What I'm trying to achive here is to avoid dups in case of an attempt to merge the same data again, but keeping the original integers as ids along with the generated ids of the data saved directly to RethinkDB's table feels weird.
Can I do that?
How does RethinkDB generate auto ids anyways?
And am I approaching this correctly..?
RethinkDB uses a string-encoding of 128 bit UUIDs (basically hashed integers).
The string format looks like this: "HHHHHHHH-HHHH-HHHH-HHHH-HHHHHHHHHHHH" where every 'H' is a hexadecimal digit of the 128 bit integer. The characters 0-9 and a-f (lower case) are used.
If you want to generate such UUIDs from an existing integer, I recommend hashing the integer first. This will give you an even distribution over the whole key space (this makes sharding easier and avoids hotspots).
As a second step you have to format the hash value in a string of the format shown above. If you don't have enough digits, it's fine to leave some of the last 'H' as constant 0.
If you really want to go into the details of UUID generation, here are two links for further reading:
RFC 4122 "A Universally Unique IDentifier (UUID) URN Namespace" https://www.rfc-editor.org/rfc/rfc4122
RethinkDB's implementation of UUID generation and formatting https://github.com/rethinkdb/rethinkdb/blob/next/src/containers/uuid.cc

How to increase maximum size of csv field in Magento, where is this located

I have one field when importing that can contain large data, it seems that CSV has unofficial limitation of about 65000 (likely 65535*) character. as both libreoffice calc and magento truncating the data for that particular field. I have investigated well and I'm certain it is not because of a special character or quotes. the data pretty straight forward, the lines are similar in format to each other.
Question: How to increase that size? or at least where I should look to find it?
Note: I counted in libreoffice writer and it was about 65040. but probably with carriage return characters it could reach 65535
I change:
1) in table catalog_category_entity_text
type of field "value" from "text" to "longtext"
2) in file app/code/core/Mage/ImportExport/Model/Import/Entity/Abstract.php
const DB_MAX_TEXT_LENGTH = 65536;
to
const DB_MAX_TEXT_LENGTH = 16777215;
and all OK
You are right, there is a limitation in Magento, because it sets texts fields as TEXT in MySQL database and, according to MySQL docs, this kind of field supports a maximum of 65535 chars.
http://dev.mysql.com/doc/refman/5.0/es/storage-requirements.html
So you could change the column type in your Magento database to use MEDIUMTEXT. I guess the correct place is in the catalog_product_entity_text table, where you should modify the 'value' field type to match your needs. But please, keep in mind this is dangerous. Make a full backup before trying. And you may even need to play with core files... not recommended!
I'm having the same issue with 8 products from a list of more than 400, and I think I'm not going to mess with Magento core and database, we can reduce the description strings for those few products.
The CSV could care less. Due to Microsoft Access allowing Memo fields which can contain quite a bit of data, I've exported 2-3k descriptions in CSV format to be imported into Magento quite successfully.
Your limitation is either because you are using a spreadsheet that has a cell limitation or export limitation on cells or because the field you are trying to import into has a maximum character limitation set in its table for that field.
You can determine the latter by using phpMyAdmin to see what the maximum character setting is for that field.

Query core data store based on a transient calculated value

I'm fairly new to the more complex parts of Core Data.
My application has a core data store with 15K rows. There is a single entity.
I need to display a subset of those rows in a table view filtered on a calculated search criteria, and for each row displayed add a value that I calculate in real time but don't store in the entity.
The calculation needs to use a couple of values supplied by the user.
A hypothetical example:
Entity: contains fields "id", "first", and "second"
User inputs: 10 and 20
Search / Filter Criteria: only display records where the entity field "id" is a prime number between the two supplied numbers. (I need to build some sort of complex predicate method here I assume?)
Display: all fields of all records that meet the criteria, along with a derived field (not in the the core data entity) that is the sum of the "id" field and a random number, so each row in the tableview would contain 4 fields:
"id", "first", "second", -calculated value-
From my reading / Googling it seems that a transient property might be the way to go, but I can't work out how to do this given that the search criteria and the resultant property need to calculate based on user input.
Could anyone give me any pointers that will help me implement this code? I'm pretty lost right now, and the examples I can find in books etc. don't match my particular needs well enough for me to adapt them as far as I can tell.
Thanks
Darren.
The first thing you need to do is to stop thinking in terms of fields, rows and columns as none of those structures are actually part of Core Data. In this case, it is important because Core Data supports arbitrarily complex fetches but the sqlite store does not. So, if you use a sqlite store your fetches are restricted those supported by SQLite.
In this case, predicates aimed at SQLite can't perform complex operations such as calculating whether an attribute value is prime.
The best solution for your first case would be to add a boolean attribute of isPrime and then modify the setter for your id attribute to calculate whether the set id value is prime or not and then set the isPrime accordingly. That will be store in the SQLite store and can be fetched against e.g. isPrime==YES &&((first<=%#) && (second>=%#))
The second case would simply use a transient property for which you would supply a custom getter to calculate its value when the managed object was in memory.
One often overlooked option is to not use an sqlite store but to use an XML store instead. If the amount of data is relatively small e.g. a few thousand text attributes with a total memory footprint of a few dozen meg, then an XML store will be super fast and can handle more complex operations.
SQLite is sort of the stunted stepchild in Core Data. It's is useful for large data sets and low memory but with memory becoming ever more plentiful, its loosing its edge. I find myself using it less these days. You should consider whether you need sqlite in this particular case.

Resources