How do i use Sqoop to save data in a parquet-avro file format? - hadoop

I need to move my data from a relational database to HDFS but i would like to save the data to a parquet-avro file format. Looking at the sqoop documentation it seems like my options are --as-parquetfile or --as-avrodatafile, but not a mix of both. From my understanding of this blog/picture below, the way parquet-avro works is that it is a parquet file with the avro schema embedded and a converter to convert and save an avro object to a parquet file and vise versa.
My initial assumption is that if i use the sqoop option --as-parquetfile then the data being saved to the parquet file will be missing the avro schema and the converter won't work. However upon looking at the sqoop code that saves the data to a parquet file format it does seem to be using a util related to avro but i'm not sure what's going on. Could someone clarify? If i cannot do this with sqoop, what other options do i have?

parquet-avro is mainly a convenience layer so that you can read/write data that is stored in Apache Parquet into Avro object. When you read the Parquet again with parquet-avro, the Avro schema is inferred from the Parquet schema (alternatively you should be able to specify an explicit Avro schema). Thus you should be fine with --as-parquetfile.

Related

Can ParquetWriter or AvroParquetWriter store the schema separately?

Do you know, can ParquetWriter or AvroParquetWriter store the schema separately without data?
Now schema is written into parquet file:
AvroParquetWriter.Builder builder = AvroParquetWriter.<GenericRecord>builder(new Path(file.getName()))
.withSchema(payload.getSchema())
.build90;
Do you know is possible write only data without schema into parquet file?
Thank you!
#ЭльфияВалиева. No, the parquet metadata (schema) in the footer is necessary to provide parquet readers the necessary schema to read the parquet data.

How can I process Avro Container data with different versions of schema?

I have months' worth of data from a single domain stored in HDFS in Avro Container files. Each file has the schema for all the data in that file, of course. How do I process all the data using Hive or Pig? It seems both Hive and Pig need the avsc file of some form of table structure definition up front. i.e. even if I use Avro tools to extract avsc from each file I will have to load each dataset using a different avsc file and I cannot process all of them using one job or DDL + Query.
Isn't it possible for Hive and Pig to pull the avsc at runtime based on the Avro Container file it is processing? Is it already implemented and I'm not finding it or too difficult to implement?

Bigdata Live data streaming using flume

I am trying to analyze twitter data using flume
i got the files from twitter using flume in BigInsights
but the data I received is of compressed Avro schema which is not readable
can anyone tell me a way so that can convert that file to JSON (Readable)
in order to do some analysis on it.
Or is there any way so that the data I receive is already in JSON (Readable) format.
Thanks In Advance.
This is the data i received
Avro format is not designed to be human readable and it's desinged to be consumed by programs. But you have a few options to view this data or even better analyze the data.
Create Hive Table: This option will allow you to analyze data using SQL queries, Spark SQL, Spark notebooks, visualization tools like Tableau and Excel too.
Your table creation script will look like this:
CREATE TABLE twitter_data
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{...
In schema literal, you can define your own schema too.
Write Program: If you are developer and want to/like to wrangle data using programming, you have many languages to choose from to read, parse, convert and write from Avro file to JSON.

Adding parquet-avro support to scalding

How can I create a Scalding Source that will handle conversions between avro and parquet.
The solution should:
1. Read from parquet format and convert to avro memory representation
2. Write avro objects into a parquet file
Note: I noticed Cascading has a module for leveraging thrift and parquet. It occurs to me that this would be a good place to start looking. I also opened a thread on google-groups/scalding-dev
Try our latest changes in this fork -
https://github.com/epishkin/scalding/tree/parquet_avro/scalding-parquet

Is there a way to access avro data stored in hbase using hive to do analysis

My Hbase table has rows that contain both serialized avro (put there using havrobase) and string data. I know that Hive table can be mapped to avro data stored in hdfs to do data analysis but I was wondering if anyone has tried to map hive to hbase table(s) that contains avro data. Basically I need to be able to query both avro and non avro data stored in Hbase, do some analysis and store the result in a different hbase table. I need the capability to do this as a batch job as well. I don't want to write a JAVA MapReduce job to do this because we have constantly changing configurations and we need to use a scripted approach. Any suggestions? Thanks in advance!
You can write an HBase co-processor to expose the avro record as regular HBase qualifiers. You can see an implementation of that in Intel's panthera-dot

Resources