Using Parquet metatada to find specific key - parquet

I have a bunch of Parquet files containing data where each row has the form [key, data1, data2, data3,...]. I need to know in which file a certain key is located, without actually opening each file and searching. Is it possible to get this from the Parquet metadata?
The keys are formatted as strings.
I already tried accessing the metadata using PyArrow, but didn't get the data I wanted.

Short answer is no.
Longer answer: Parquet has two types of metadata that help in eliminating data, min/max statistics and optionally BloomFilters. With these two you can definitively determine if a file does not contain your key, but can't determine if 100% does (unless your key happens to be a min/max value). Pyarrow currently only really exposes row group statistics and doesn't support BloomFilter reading/writing at all.
Also, if the key is of low enough cardinality then dictionary encoding might be used to encode the column. If all data in a column is dictionary encoded, the it might be possible through some lower level APIs (likely not pyarrow) to retrieve the dictionaries and scan them instead of the entire file.
If you are in control of the writing process then sorting data based on key/limiting the number of keys per file would help make these methods even more efficient.

Related

Data storage format for unstructured data rows on HDFS

We are consuming very large data that needs to be written as fast as we receive and we are using HDFS, so we prefer using it. The data is almost unstructured, and we will be doing basic queries on them rarely. The data is flat with some fields, each row representing another data.
key1=str key2=30.3 key3=longtexthere
Another data row:
key1=3 key5=abc
SequenceFile seemed the most natural one but I could not find how to store multiple rows in a single SequenceFile.
Currently, in our temporary solution, we have multiple writers that writes to multiple text files. So when querying is needed, we read them in parallel. However, current text files contains 1000s of rows and I don't think creating a single SequenceFile for each row would be feasiable, it would incur much overhead for storing metadata and reading many too many files at once when querying.
I think the problem can be solved by using HBase or Cassandra, a columunar database but we are almost required to use HDFS. Am I missing something with SequenceFiles or we should really use a columunar database?
So sequence file format is like this:
<key, value>
<key, value>
<key, value>
...
where the key is a WritableComparable and the value is a Writable.
Now what a lot of people are doing - and you could do the same - is:
Only use the key OR the value 'column'
Implement a custom Writable which wraps a set of other Writables (call it record, row, ...)
That way you can model everything you want. That record writable could have a fixed schema, like it contains 'IntWritable, Text, IntWritable, IntWritable' (depending on you fields). Or in case you don't wanna support different types, you could use the existing ArrayWritable as your 'record'.
Knowing the schema of each file (e.g. put it into the metadata of the sequence file, will allow you to do reads on files with different/evolved schema's.
So its a lot of handcrafting, but build can very efficient and flexible structure. Never used it, but take a look at http://pangool.net/userguide/schemas.html, think they already modeled suche a flexible record/tuple schema on top of sequence files.
Bottom line, i think you can achieve what you want with sequence files.
However i would recommend to also have a look at columnar file formats like Parquet or ORC files. Those come with their own tradeoffs, but you will have a higher compression rate and selective reads (column projection, filter pushdown). Also you don't have to invent the schema/tuple structure.

Handling incremental data - Hadoop

We had 5 years of data in cluster and we are loading the data everyday. The data that gets added everyday might contain duplicate data , partially modified data etc ..
1 . How to handle duplicate data - should that be handled as part of highlevel programming interfaces pig, hive etc .. or any other alternatives.
Say if there is a usecase to find out what is changed between two records given the key to find out the row.
What is the best way to model the data, using which hadoop eco system components.
How to handle duplicate data
It's very hard to remove duplicates from HDFS raw data,
so I guess your approach is right: remove using pig or hive while loading those data.
Say if there is a usecase to find out what is changed between two records given the key to find out the row.
For this case, do you meaning that two records has the same key?
Then what kind of changes you want to capture?
When you say that, you need to remove duplicates and also the delta between two records when you know the key, you should have some criteria of which data to be removed in case of partial changed data.
In both scenarios, you can have a handle of the key and write logic to remove duplicates. Map reduce seems to be a good choice, given the parallelism, performance and ability to manage based on keys. Mostly your requirements could be handled in reducer
See if Sqoop-merge fits your use case.
From the doc:
The merge tool allows you to combine two datasets where entries in one dataset should overwrite entries of an older dataset. For example, an incremental import run in last-modified mode will generate multiple datasets in HDFS where successively newer data appears in each dataset. The merge tool will "flatten" two datasets into one, taking the newest available records for each primary key.

Sequence File of Objects into Hive

We started with a bunch of data stored in NetCDF files. From there, some Java code was written to create sequence files from the NetCDF files. We don't know much about the original intentions of the code, but we have been able to learn a little bit about the sequence files themselves. Ultimately, we are trying to create tables within Hive using these sequence files, but seem incapable of doing so at the moment.
We know that the keys and values within the sequence files are stored as objects that implements WritableComparable. We are also capable of creating Java code to iterate through all of the data in the sequence files.
So, what would be necessary to actually get Hive to read the data within the objects of these sequence files properly?
Thanks in advanced!
UPDATE: The reason it is so difficult to describe where I am having trouble exactly is because I am not necessarily getting any errors. Hive is simply just reading the sequence files incorrectly. When running the Hadoop -text command on my sequence file I get a list of objects as such:
NetCDFCompositeKey#263c7e3f , NetCDFRecordWritable#4d846db5
The data is within those objects themselves. So, currently from the help of #Tariq I believe what I have to do in order to actually read those objects is to create a custom InputFormat to read the keys and a custom SerDe to serialize and deserialize the objects?
I'm sorry, i'm not able to understand from your question where exactly you are facing the problem. If you wish to use SequenceFiles through Hive you just have to add STORED AS SEQUENCEFILE clause while issuing CREATE TABLE(most probably you already know this, nothing new). When you work on SequenceFiles Hive treats each key/value pair of the SequenceFiles similar to rows in normal files. Important thing here is that keys will be ignored. Apart from that nothing very special.
Having said that, if you wish to read both keys and values, you might have to write a custom InputFormat that can read both keys and values. See this project for example. It allows us to access data stored in a SequenceFile's key.
Also, if your keys and values are custom classes, you will require to write a SerDe as well to serialize and deserialize your data.
HTH
P.S. : I don't know if this is exactly what you were looking for. Do let me know if it is not and add some more detail to your question. I'll try addressing that.

Filter data on row key in Random Partitioner

I'm working on Cassandra Hadoop integration (MapReduce). We have used RandomPartitioner to insert data to gain faster write speed. Now we have to read that data from Cassandra in MapReduce and perform some calculations on it.
From the lots of data we have in cassandra we want to fetch data only for particular row keys but we are unable to do it due to RandomPartitioner - there is an assertion in the code.
Can anyone please guide me how should I filter data based on row key on the Cassandra level itself (I know data is distributed across regions using hash of the row key)?
Would using secondary indexes (still trying to understand how they works) solve my problem or is there some other way around it?
I want to use cassandra MR to calculate some KPI's on the data which is stored in cassandra continuously. So here fetching whole data from cassandra every time seems an overhead to me? The rowkey I'm using is like "(timestamp/60000)_otherid"; this CF contains reference of rowkeys of actual data stored in other CF. so to calculate KPI I will work for a particular minute and fetch data from other CF, and process it.
When using RandomPartitioner, keys are not sorted, so you cannot do a range query on your keys to limit the data. Secondary indexes work on columns not keys, so they won't help you either. You have two options for filtering the data:
Choose a data model that allows you to specify a thrift SlicePredicate, which will give you a range of columns regardless of key, like this:
SlicePredicate predicate = new SlicePredicate().setSlice_range(new SliceRange(ByteBufferUtil.bytes(start), ByteBufferUtil.bytes(end), false, Integer.MAX_VALUE));
ConfigHelper.setInputSlicePredicate(conf, predicate);
Or use your map stage to do this by simply ignoring input keys that are outside your desired range.
I am unfamiliar with the Cassandra Hadoop integration but trying to understand how to use the hash system to query the data yourself is likely the wrong way to go.
I would look at the Cassandra client you are using (Hector, Astynax, etc.) and ask how to query by row keys from that.
Querying by the row key is a very common operation in Cassandra.
Essentially if you want to still use a RandomPartitioner and want the ability to do range slices you will need to create a reverse index (a.k.a. inverted index). I have answered a similar question here that involved timestamps.
Having the ability to generate your rowkeys programmatically allows you to emulate a range slice on rowkeys. To do this you must write your own InputFormat class and generate your splits manually.

Huge files in hadoop: how to store metadata?

I have a use case to upload some tera-bytes of text files as sequences files on HDFS.
These text files have several layouts ranging from 32 to 62 columns (metadata).
What would be a good way to upload these files along with their metadata:
creating a key, value class per text file layout and use it to create and upload as sequence files ?
create SequenceFile.Metadata header in each file being uploaded as sequence file individually ?
Any inputs are appreciated !
Thanks
I prefer storing meta data with the data and then designing your application to be meta data driven, as opposed to embedding meta data in the design or implementation of your application which then means updates to metadata require updates to your app. Ofcourse there are limits to how far you can take a metadata driven application.
You can embed the meta data with the data such as by using an encoding scheme like JSON or you could have the meta data along side the data such as having records in the SeqFile specifically for describing meta data perhaps using reserved tags for the keys so as to given metadata its own namespace separate from the namespace used by the keys for the actual data.
As for the recommendation of whether this should be packaged into separate Hadoop files, bare in mind that Hadoop can be instructed to split a file into Splits (input for map phase) via configuration settings. Thus even a single large SeqFile can be processed in parallel by several map tasks. The advantage of having a single hdfs file is that it more closely resembles the unit of containment of your original data.
As for the recommendation about key types (i.e. whether to use Text vs. binary), consider that the key will be compared against other values. The more compact the key, the faster the comparison. Thus if you can store a dense version of the key that would be preferable. Likewise, if you can structure the key layout so that the first bytes are typically NOT the same then it will also help performance. So, for instance, serializing a Java class as the key would not be recommended because the text stream begins with the package name of your class which is likely to be the same as every other class and thus key in the file.
If you want data and its metadata bundled together, then AVRO format is the appropriate one. It allows schema evolution also.
The simplest thing to do is to make the keys and values of the SequenceFiles Text. Pick a meaningful field from your data to make the Key, the data itself is the value as a Text. SequenceFiles are designed for storing key/value pairs, if that's not what your data is then don't use a SequenceFile. You could just upload unprocessed text files and input those to Hadoop.
For best performance, do not make each file terabytes in size. The Map stage of Hadoop runs one job per input file. You want to have more files than you have CPU cores in your Hadoop cluster. Otherwise you will have one CPU doing 1 TB of work and a lot of idle CPUs. A good file size is probably 64-128MB, but for best results you should measure this yourself.

Resources