Parquet meta data of "has_dictionary_page" is false but column has "PLAIN_DICTIONARY" encoding - parquet

I used parquet of pyarrow ro read the meta data of parquet by this code:
from pyarrow import parquet
p_file = parquet.ParquetFile("v-c000.gz.parquet")
for rg_idx in range(p_file.metadata.num_row_groups):
rg = p_file.metadata.row_group(rg_idx)
for col_idx in range(rg.num_columns):
col = rg.column(col_idx)
print(col)
and got in the output: has_dictionary_page: False (for all the row group)
but according to my checks all the column chanks in all of row group are PLAIN_DICTIONARY encoded. furthermore I checked statistics about the dictionary and saw all the key and value over it. attaching part of it:
How is that possible that there is no dictionary page?

My best guess is that you are running into PARQUET-1547 which is described a bit more in this question.
In summary, some parquet readers did not write the dictionary_page_offset field correctly. Those parquet readers have workarounds in place to recognize the invalid write. However, parquet-cpp (which is used by pyarrow) does not have such a workaround in place.

Related

Trouble converting yaml file to toml file

I am trying to convert a yaml file into a toml file in python3.
My plan is to use toml.dumps() which expects a dictionary, then write to an output file.
I need to be able to match the input toml requirements for a tool that I need to plug into.
This tool expects inline tables at certain instances like this:
[states.Started]
outputs = { on = true, running = false }
[[states.Started.transitions]]
inputs = { go = true }
target = "Running"
I understand how to generate the tables [] and array of tables [[]] but I am having a hard time figuring out how to create the inline tables.
TOML documentation says that inline tables are the same as tables. So for example, with the array of tables above (states.Started.Transitions) I figured inputs would be a table within the overall list, however the TOML format will break it into separate tables at the output.
Can anyone help me figure out how to configure my dictionary to output the inline table?
EDIT***
I am not sure I fully understand what I am doing wrong. Here is my code
table = {'a':5,'b':3}
inline_table = toml.TomlDecoder().get_empty_inline_table()
inline_table['Values'] = table
encoder = toml.TomlPreserveInlineDictEncoder()
toml_config = toml.dumps(inline_table,encoder = encoder)
however this does not create an inline table, but a regular table in the output.
[Values]
a = 5
b = 3
This tool expects inline tables at certain instances like this
Then it is not a TOML tool. As you discovered yourself, inline tables are semantically equivalent to normal tables, so if the tool conforms to the TOML specification, it must not care whether the tables are inline or not. If it does, this is not a TOML question.
That being said, you can create the dicts that should become inline tables via
TomlDecoder().get_empty_inline_table()
The returned object is a dict subclass and can be used as such.
When you finished creating your structure, use TomlPreserveInlineDictEncoder for dumping and don't ask why this is suddenly called InlineDict instead of InlineTable.

Partial Vertical Caching of DataFrame

I use spark with parquet.
I'd like to be able to cache the columns we use most often for filtering, while keeping the other on disk.
I'm running something like:
myDataFrame.select("field1").cache
myDataFrame.select("field1").count
myDataFrame.select("field1").where($"field1">5).count
myDataFrame.select("field1", "field2").where($"field1">5).count
The fourth line doesn't use the cache.
Any simple solutions that can help here?
The reason this will not cache is that whenever you do a transformation on a dataframe (e.g. select) you are actually creating a new one. What you basically did is cached a dataframe containing only field1 and a dataframe containing only field1 where it is larger than 5 (probably you meant field2 here but it doesn't matter).
On the fourth line you are creating a third dataframe which has no lineage to the original two, just to the original dataframe.
If you generally do strong filtering (i.e. you get a very small number of elements) you can do something like this:
cachedDF = myDataFrame.select("field1", "field2", ... "fieldn").cache
cachedDF.count()
filteredDF = cachedDF.filter(some strong filter)
res = myDataFrame.join(broadcast(filteredDF), cond)
i.e. cachedDF has all the fields you filter on, then you filter very strongly and then do an inner join (with cond being all relevant selected fields or some id field) which would give all relevant data.
That said, in most cases, assuming you use a file format such as parquet, caching will not help you much.

PIG - HBASE - Casting values

I'm using PIG to process rows in an HBase table. The values in the HBase table are stored as bytearrays.
I can't figure out if I have to write a UDF that casts bytearrays to various types, or if pig does that automatically.
I have the following script:
raw = LOAD 'hbase://TABLE' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('CF:I') AS (product_id:bytearray);
ids = FOREACH raw GENERATE (int)product_id;
dump ids;
I get a list of parenthesis '()'.
According to the docs, it should work. I checked the value in hbase shell they're all value=\x00\x00\x00\x02
How can i get this to work?
Needed to add the following option to get it to cast...
LOAD 'hbase://TABLE' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('CF:I','-caster HBaseBinaryConverter') AS (product_id:bytearray);
Thanks to this post.
If you have non-text values in the columns, you need to specify the -caster option with HBaseBinaryConverter(default is Utf8StorageConverter) and map them to respective types so that PIG will cast them properly before serializing them in text.
a = load 'hbase://TESTTABLE_1' using org.apache.pig.backend.hadoop.hbase.HBaseStorage('TESTCOLUMN_A TESTCOLUMN_B TESTCOLUMN_C ','-loadKey -caster HBaseBinaryConverter') as (rowKey:chararray,col_a:int, col_b:double, col_c:chararray);

Hive: How to have a derived column that has stores the sentiment value from the sentiment analysis API

Here's the scenario:
Say you have a Hive Table that stores twitter data.
Say it has 5 columns. One column being the Text Data.
Now How do you add a 6th column that stores the sentiment value from the Sentiment Analysis of the twitter Text data. I plan to use the Sentiment Analysis API like Sentiment140 or viralheat.
I would appreciate any tips on how to implement the "derived" column in Hive.
Thanks.
Unfortunately, while the Hive API lets you add a new column to your table (using ALTER TABLE foo ADD COLUMNS (bar binary)), those new columns will be NULL and cannot be populated. The only way to add data to these columns is to clear the table's rows and load data from a new file, this new file having that new column's data.
To answer your question: You can't, in Hive. To do what you propose, you would have to have a file with 6 columns, the 6th already containing the sentiment analysis data. This could then be loaded into your HDFS, and queried using Hive.
EDIT: Just tried an example where I exported the table as a .csv after adding the new column (see above), and popped that into M$ Excel where I was able to perform functions on the table values. After adding functions, I just saved and uploaded the .csv, and rebuilt the table from it. Not sure if this is helpful to you specifically (since it's not likely that sentiment analysis can be done in Excel), but may be of use to anyone else just wanting to have computed columns in Hive.
References:
https://cwiki.apache.org/Hive/gettingstarted.html#GettingStarted-DDLOperations
http://comments.gmane.org/gmane.comp.java.hadoop.hive.user/6665
You can do this in two steps without a separate table. Steps:
Alter the original table to add the required column
Do an "overwrite table select" of all columns + your computed column from the original table into the original table.
Caveat: This has not been tested on a clustered installation.

Reading XML-files with StAX / Kettle (Pentaho)

I'm doing an ETL-process with Pentaho (Spoon / Kettle) where I'd like to read XML-file and store element values to db.
This works just fine with "Get data from XML" -component...but the XML file is quite big, several giga bytes, and there fore reading the file takes too long.
Pentaho Wiki says:
The existing Get Data from XML step is easier to use but uses DOM
parsers that need in memory processing and even the purging of parts
of the file is not sufficient when these parts are very big.
The XML Input Stream (StAX) step uses a completely different approach
to solve use cases with very big and complex data stuctures and the
need for very fast data loads...
There fore I'm now trying to do the same with StAX, but it just doesn't seem to work out like planned. I'm testing this with XML-file which only has one element group. The file is read and then mapped/inserted to table...but now I get multiple rows to table where all the values are "undefined" and some rows where I have the right values. In total I have 92 rows in the table, even though it should only have one row.
Flow goes like:
1) read with StAX
2) Modified Java Script Value
3) Output to DB
At step 2) I'm doing as follow:
var id;
if ( xml_data_type_description.equals("CHARACTERS") &&
xml_path.equals("/labels/label/id") ) {
id = xml_data_value; }
...
I'm using positional-staz.zip from http://forums.pentaho.com/showthread.php?83480-XPath-in-Get-data-from-XML-tool&p=261230#post261230 as an example.
How to use StAX for reading XML-file and storing the element values to DB?
I've been trying to look for examples but haven't found much. The above example uses "Filter Rows" -component before inserting the rows. I don't quite understand why it's being used, can't I just map the values I need? It might be that this problem occurs because I don't use, or know how to use, Filter Rows -component.
Cheers!
I posted a possible StAX-based solution on the forum listed above, but I'll post the gist of it here since it is awaiting moderator approval.
Using the StAX parser, you can select just those elements that you care about, namely those with a data type of CHARACTERS. For the forum example, you basically need to denormalize the rows in sets of 4 (EXPR, EXCH, DATE, ASK). To do this you add the row number to the stream (using an Add Sequence step) then use a Calculator to determine a "bucket number" = INT((rownum-1)/4). This will give you a grouping field for a Row Denormaliser step.
When the post is approved, you'll see a link to a transformation that uses StAX and the method I describe above.
Is this what you're looking for? If not please let me know where I misunderstood and maybe I can help.

Resources