How to store Partitioned data using pig in RC Format? - hadoop

I was wondering if there is a UDF or something that can store my data in a partitioned fashion in RC Format. I know there is org.apache.pig.piggybank.storage.MultiStorage but it only does it for some compression format. I want to store my data in RC Format but using the same partitioned storage structure that MultiStorage provides.
Thanks,
imtiaz

There is no such solution available either in piggybank or some other alternative. I had faced a similar issue. But dropped the implementation due to some other requirements.Only solution available is to extend the MultiStorage udf to provide RC storage format.
Twitter has open sourced its RC file storage. You can take help from it.
http://grepcode.com/file/repo1.maven.org/maven2/com.twitter.elephantbird/elephant-bird-rcfile/3.0.8/com/twitter/elephantbird/pig/store/RCFilePigStorage.java

Related

Query MinIO database without converting the files with Pandas

I would like to know if there is any option available in order to query a MinIO database that stores DeltaTables in parquet format.
Currently I am using pyarrow with pandas but is really slow when the data become larger.
I saw that PySpark can be used to query the DeltaTables but I would like to know if there are any other options.
Thanks
It could depend how big the scale of the data you are dealing with, for big enough data sets you could try using presto for SQL syntax queries of from a MinIO source parquet files, using Hive Connector here is a how to:
https://blog.min.io/interactive-sql-query-with-presto-on-minio-cloud-storage/
Also, when you hit a large dataset could take advantage of Hive partition folder naming convention (ie. s3://bucketname/year=2019/ )to reduce the size of the data set needed to be queried, here is the docs regarding partitioning in in hive connector.
Unrelated note: credits to this question for help me remember the convention name

Does all of three: Presto, hive and impala support Avro data format?

I am clear about the Serde available in Hive to support Avro schema for data formats. Comfortable in using avro with hive.
AvroSerDe
for say, I have found this issue against presto.
https://github.com/prestodb/presto/issues/5009
I need to choose components for fast execution cycle. Presto and impala provide much smaller execution cycle.
So, Anyone please let me clarify that which would be better in different data formats.
Primarily, I am looking for avro support with Presto now.
However, lets consider following data formats stored on HDFS:
Avro format
Parquet format
Orc format
Which is the best to use with high performance on different data formats.
?? please suggest.
Impala can read Avro data but can not write it. Please refer to this documentaion page describing the file formats supported by Impala.
Hive supports both reading and writing Avro files.
Presto's Hive Connector supports Avro as well. Thanks to David Phillips for pointing out this documentaion page.
There are different benchmarks on the internet about performance, but I would not like to link to a specific one as results heavily depend on the exact use case benchmarked.

Avro file type for images?

I try to...figure that case in Hadoop.
What is best file format Avro or SequenceFile, in case storing images in HDFS and process them after, with Python?
SequenceFile are key-value oriented, so I think that Avro files will work better?
I use SequenceFile to store images in HDFS and it works well. Both Avro and SequenceFile are binary file formats, hence they can store images efficiently. As a keys in SequenceFile I usually use the original image file names.
SequenceFile's are used in many image processing products, such as OpenIMAJ. You can use existing tools for working with images in SequenceFile's, for example OpenIMAJ SequenceFileTool.
In addition, you can take a look at HipiImageBundle. This is a special format provided by HIPI (Hadoop Image Processing Interface). In my experience, HipiImageBundle has better performance, than the SequenceFile. But in can be used only by HIPI.
If you don't have large number of files (less than 1M), you can try to store them without packaging in one big file and use CombineFileInputFormat to speedup processing.
I never use Avro to store images and I don't know about any project that use it.

Is there a simple way to migrate from SequenceFiles to Avro?

I'm currently using hadoop mapreduce jobs with SequenceFiles of writables.
The same Writable type are used for serialization also in the non-hadoop related parts of the system.
This method is hard to maintain - mainly because of the lack of schema and the need for manual handling of version changes.
It appears that apache avro handles these issues.
The problem is, that during the migration I will have data in both formats.
is there a simple way to handle the migration?
I haven't tried it myself, but maybe using AvroSequenceFile format would help. It's just a wrapper around SequenceFile so in theory you should be able to write data in both your old SequenceFile format as well as your new Avro format which should make the migration easier.
Here is more information about this format.
Generally, there is nothing stopping you from using Avro data and SequenceFiles interchangably. Use whatever InputFormat is necessary for the type of data you need, and for output it of course makes sense to use Avro formats whenever practial. If your input comes in different formats, take a look at MultipleInputs. Essentially, you will still have to implement separate Mappers, but that's to be expeced considering the Map input key/value is different.
Moving to Avro is a wise move. If you have the capacity in time and hardware, it might even be worthwhile to explicitly convert your data from SequenceFile to Avro right away. You can use any language supported by Avro which also happens to supports SequenceFiles to do this. Java certainly does (clearly), but Pig is also pretty handy for doing this.
The user contributed PiggyBank project has functionality for reading a SequenceFile, and then it is simply a matter of using AvroStorage from the same PiggyBank project with the appropriate Avro Scheme to get your Avro file.
If only Pig supported loading Avro schemas from file.. ! If you use Pig you will unfortunately have to form scripts that explicitly contain the Avro schema, which can be a bit annoying.

Is there a common place to store data schemas in Hadoop?

I've been doing some investigation lately around using Hadoop, Hive, and Pig to do some data transformation. As part of that I've noticed that the schema of data files doesn't seem to attached to files at all. The data files are just flat files (unless using something like a SequenceFile). Each application that wants to work with those files has its own way of representing the schema of those files.
For example, I load a file into the HDFS and want to transform it with Pig. In order to work effectively with it I need to specify the schema of the file when I load the data:
EMP = LOAD 'myfile' using PigStorage() as { first_name: string, last_name: string, deptno: int};
Now, I know that when storing a file using PigStorage, the schema can optionally be written out along side it, but in order to get a file into Pig in the first place it seems like you need to specify a schema.
If I want to work with the same file in Hive, I need to create a table and specify the schema with that too:
CREATE EXTERNAL TABLE EMP ( first_name string
, last_name string
, empno int)
LOCATION 'myfile';
It seems to me like this is extremely fragile. If the file format changes even slightly then the schema must be manually updated in each application. I'm sure I'm being naive but wouldn't it make sense to store the schema with the data file? That way the data is portable between applications and the barrier to using another tool would be lower since you wouldn't need to re-code the schema for each application.
So the question is: Is there a way to specify the schema of a data file in Hadoop/HDFS or do I need to specify the schema for the data file in each application?
It looks like you are looking for Apache Avro. With Avro your schema is embedded in your data, so you can read it without having to worry about schema issues and it makes schema evolution really easy.
The great thing about Avro is that it is completely integrated in Hadoop and you can use it with a lot of Hadoop sub-projects like Pig and Hive.
For example with Pig you could do:
EMP = LOAD 'myfile.avro' using AvroStorage();
I would advise looking at the documentation for AvroStorage for more details.
You can also work with Avro with Hive as described here but I have not used that personally but it should work the same way.
What you need is HCatalog which is
"Apache HCatalog is a table and storage management service for data
created using Apache Hadoop.
This includes:
Providing a shared schema and data type mechanism.
Providing a table abstraction so that users need not be concerned with where or how
their data is stored.
Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive."
You can take a look at the "data flow example" in the docs to see exactly the scenario you are talking about
Apache Zebra seems to be the tool that could provide a common schema definition across mr, pig and hive. It has its own schema store. MR job can use its built in TableStore to write to HDFS.

Resources