Does all of three: Presto, hive and impala support Avro data format? - hadoop

I am clear about the Serde available in Hive to support Avro schema for data formats. Comfortable in using avro with hive.
AvroSerDe
for say, I have found this issue against presto.
https://github.com/prestodb/presto/issues/5009
I need to choose components for fast execution cycle. Presto and impala provide much smaller execution cycle.
So, Anyone please let me clarify that which would be better in different data formats.
Primarily, I am looking for avro support with Presto now.
However, lets consider following data formats stored on HDFS:
Avro format
Parquet format
Orc format
Which is the best to use with high performance on different data formats.
?? please suggest.

Impala can read Avro data but can not write it. Please refer to this documentaion page describing the file formats supported by Impala.
Hive supports both reading and writing Avro files.
Presto's Hive Connector supports Avro as well. Thanks to David Phillips for pointing out this documentaion page.
There are different benchmarks on the internet about performance, but I would not like to link to a specific one as results heavily depend on the exact use case benchmarked.

Related

Query Parquet data through Vertica (Vertica Hadoop Integration)

So I have a Hadoop cluster with three nodes. Vertica is co-located on cluster. There are Parquet files (partitioned by Hive) on HDFS. My goal is to query those files using Vertica.
Right now what I did is using HDFS Connector, basically create an external table in Vertica, then link it to HDFS:
CREATE EXTERNAL TABLE tableName (columns)
AS COPY FROM "hdfs://hostname/...../data" PARQUET;
Since the data size is big. This method will not achieve good performance.
I have done some research, Vertica Hadoop Integration
I have tried HCatalog but there's some configuration error on my Hadoop so that's not working.
My use case is to not change data format on HDFS(Parquet), while query it using Vertica. Any ideas on how to do that?
EDIT: The only reason Vertica got slow performance is because it cant use Parquet's partitions. With higher version Vertica(8+), it can utlize hive's metadata now. So no HCatalog needed.
Terminology note: you're not using the HDFS Connector. Which is good, as it's deprecated as of 8.0.1. You're using the direct interface described in Reading Hadoop Native File Formats, with libhdfs++ (the hdfs scheme) rather than WebHDFS (the webhdfs scheme). That's all good so far. (You can also use the HCatalog Connector, but you need to do some additional configuration and it will not be faster than an external table.)
Your Hadoop cluster has only 3 nodes and Vertica is co-located on them, so you should be getting the benefits of node locality automatically -- Vertica will use the nodes that have the data locally when planning queries.
You can improve query performance by partitioning and sorting the data so Vertica can use predicate pushdown, and also by compressing the Parquet files. You said you don't want to change the data so maybe these suggestions don't work for you; they're not specific to Vertica so they might be worth considering anyway. (If you're using other tools to interact with your Parquet data, they'll benefit from these changes too.) The documentation of these techniques was improved in 8.0.x (link is to 8.1 but this was in 8.0.x too).
Additional partitioning support was added in 8.0.1. It looks like you're using at least 8.0; I can't tell if you're using 8.0.1. If you are, you can create the external table to only pay attention to the partitions you care about with something like:
CREATE EXTERNAL TABLE t (id int, name varchar(50),
created date, region varchar(50))
AS COPY FROM 'hdfs:///path/*/*/*'
PARQUET(hive_partition_cols='created,region');

Bigdata Live data streaming using flume

I am trying to analyze twitter data using flume
i got the files from twitter using flume in BigInsights
but the data I received is of compressed Avro schema which is not readable
can anyone tell me a way so that can convert that file to JSON (Readable)
in order to do some analysis on it.
Or is there any way so that the data I receive is already in JSON (Readable) format.
Thanks In Advance.
This is the data i received
Avro format is not designed to be human readable and it's desinged to be consumed by programs. But you have a few options to view this data or even better analyze the data.
Create Hive Table: This option will allow you to analyze data using SQL queries, Spark SQL, Spark notebooks, visualization tools like Tableau and Excel too.
Your table creation script will look like this:
CREATE TABLE twitter_data
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES ('avro.schema.literal'='{...
In schema literal, you can define your own schema too.
Write Program: If you are developer and want to/like to wrangle data using programming, you have many languages to choose from to read, parse, convert and write from Avro file to JSON.

ExecuteSQL processor in Nifi returns data in avro format

Just started working with Apache Nifi. I am trying to fetch data from oracle and place it in HDFS then build an external hive table on top of it. The problem is ExecuteSQL processor returns data in avro format. Is there anyway I can get this data in a readable format?
apache nifi also has an 'ConvertAvroToJSON' processor. That might help you get it into a readable format. We also really need to just knock out the ability for our content viewer to nicely render avro data which would help as well.
Thanks
joe

How to get data from HDFS? Hive?

I am new to Hadoop. I ran a map reduce on my data and now I want to query it so I can put it into my website. Is Apache Hive the best way to do that? I would greatly appreciate any help.
Keep in mind that Hive is a batch processing system, which under the hoods converts the SQL statements to bunch of MapReduce jobs with stage builds in between. Also, Hive is a high latency system i.e. based on your dataset sizes you are looking at minutes to hours or even days to process a complicated query.
So, if you want to serve the results from your MapReduce job output in your website, its highly recommended you export the results back to a RDBMS using sqoop and then take it from there.
Or, if the data itself is huge and cannot be exported back to RDBMS. Then another option you could think of is using a NoSQL system like HBase.
welcome to Hadoop!
I highly recommend you watch Cloudera Essentials for Apache Hadoop | Chapter 5: The Hadoop Ecosystem and familiarize yourself with the different ways to transfer data inbound and outbound from your HDFS cluster. The video is easy-to-watch and describes advantages / disadvantages to each tool, but this outline should give you the basics of the Hadoop Ecosystem:
Flume - Data integration and import of flat files into HDFS. Designed for asynchronous data streams (e.g., log files). Distributed, scalable, and extensible. Supports various endpoints. Allows preprocessing on data before loading to HDFS.
Sqoop - Bidirectional transfer of structured data (RDBMS) and HDFS. Permits incremental import to HDFS. RDBMS must support JDBC or ODBC.
Hive - SQL-like interface to Hadoop. Requires table structure. JDBC and/or ODBC is required.
Hbase - Allows interactive access of HDFS. Sits on top of HDFS and apply structure to data. Allows for random reads, scales horizontally with cluster. Not a full query language; only permits get/put/scan operations (can be used with Hive and/or Impala). Row-key indexes only on data. Does not use Map Reduce paradigm.
Impala - Similar to Hive, high-performance SQL Engine for querying vast amounts of data stored in HDFS. Does not use Map Reduce. Good alternative to Hive.
Pig - Data flow language for transforming large datasets. Permits schema optionally defined at runtime. PigServer (Java API) permits programmatic access.
Note: I assume the data you are trying to read already exists in HDFS. However, some of the products in the Hadoop ecosystem may be useful for your application or as a general reference, so I included them.
If you're only looking to get data from HDFS then yes, you can do so via Hive.
However, you'll most beneficiate from it if your data are already organized (for instance, in columns).
Lets take an example : your map-reduce job produced a csv file named wordcount.csv and containing two rows : word and count. This csv file is on HDFS.
Let's now suppose you want to know the occurence of the word "gloubiboulga". You can simply achieve this via the following code :
CREATE TABLE data
(
word STRING,
count INT,
text2 STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";
LOAD DATA LOCAL INPATH '/wordcount.csv'
OVERWRITE INTO TABLE data;
select word, count from data where word=="gloubiboulga";
Please note that while this language looks highly like SQL, you'll still have to learn a few things about it.

Is there a simple way to migrate from SequenceFiles to Avro?

I'm currently using hadoop mapreduce jobs with SequenceFiles of writables.
The same Writable type are used for serialization also in the non-hadoop related parts of the system.
This method is hard to maintain - mainly because of the lack of schema and the need for manual handling of version changes.
It appears that apache avro handles these issues.
The problem is, that during the migration I will have data in both formats.
is there a simple way to handle the migration?
I haven't tried it myself, but maybe using AvroSequenceFile format would help. It's just a wrapper around SequenceFile so in theory you should be able to write data in both your old SequenceFile format as well as your new Avro format which should make the migration easier.
Here is more information about this format.
Generally, there is nothing stopping you from using Avro data and SequenceFiles interchangably. Use whatever InputFormat is necessary for the type of data you need, and for output it of course makes sense to use Avro formats whenever practial. If your input comes in different formats, take a look at MultipleInputs. Essentially, you will still have to implement separate Mappers, but that's to be expeced considering the Map input key/value is different.
Moving to Avro is a wise move. If you have the capacity in time and hardware, it might even be worthwhile to explicitly convert your data from SequenceFile to Avro right away. You can use any language supported by Avro which also happens to supports SequenceFiles to do this. Java certainly does (clearly), but Pig is also pretty handy for doing this.
The user contributed PiggyBank project has functionality for reading a SequenceFile, and then it is simply a matter of using AvroStorage from the same PiggyBank project with the appropriate Avro Scheme to get your Avro file.
If only Pig supported loading Avro schemas from file.. ! If you use Pig you will unfortunately have to form scripts that explicitly contain the Avro schema, which can be a bit annoying.

Resources