ORCfile storage implementation in Pig - hadoop

does anybody know how to use ORCfiles input/output in Pig?
I found some kind of support for RCFiles in elephant-birds, but it seems ORC format is not supported...
Could you please provide a sample of using Pig to access/store ORC files in Pig?

Support for ORC Storage through Pig is not yet committed and under active development. Refer to Apache JIRA PIG-3558. Following this, you would be able to access ORC files via your Pig Script like this
load 'foo.orc' using OrcStorage();
...
store .. using OrcStorage('-c SNAPPY');

Define a HCatalog table using HCat CLI stored as ORC.Then LOAD the relation in pig using org.apache.hcatalog.pig.HCatLoader() or STORE using org.apache.hcatalog.pig.HCatStorer()

Related

Does all of three: Presto, hive and impala support Avro data format?

I am clear about the Serde available in Hive to support Avro schema for data formats. Comfortable in using avro with hive.
AvroSerDe
for say, I have found this issue against presto.
https://github.com/prestodb/presto/issues/5009
I need to choose components for fast execution cycle. Presto and impala provide much smaller execution cycle.
So, Anyone please let me clarify that which would be better in different data formats.
Primarily, I am looking for avro support with Presto now.
However, lets consider following data formats stored on HDFS:
Avro format
Parquet format
Orc format
Which is the best to use with high performance on different data formats.
?? please suggest.
Impala can read Avro data but can not write it. Please refer to this documentaion page describing the file formats supported by Impala.
Hive supports both reading and writing Avro files.
Presto's Hive Connector supports Avro as well. Thanks to David Phillips for pointing out this documentaion page.
There are different benchmarks on the internet about performance, but I would not like to link to a specific one as results heavily depend on the exact use case benchmarked.

Read data from Hadoop HDFS with SparkSQL connector to visualize it in Superset?

on a Ubuntu server I set up Divolte Collector to gather clickstream data from websites. The data is being stored in Hadoop HDFS (Avro files).
(http://divolte.io/)
Then I would like to visualize the data with Airbnb Superset which has several connectors to common databases (thanks to SqlAlchemy) but not to HDFS.
Superset has in particular a connector to SparkSQL thanks to JDBC Hive (http://airbnb.io/superset/installation.html#database-dependencies)
So is it possible to use it to retrieve HDFS clickstream data? Thanks
In order to read HDFS data in SparkSQL there are two major ways depening on your setup:
Read the table as it was defined in Hive (reading from a remote metastore) (probably not your case)
SparkSQL by default (if not configured otherwise) creates a embedded metastore for Hive which allows you to issue DDL and DML statements using Hive syntax.
You need an external package for that to work com.databricks:spark-avro.
CREATE TEMPORARY TABLE divolte_data
USING com.databricks.spark.avro
OPTIONS (path "path/to/divolte/avro");
Now data should be available inside the table divolte_data

Query local parquet using presto

Using spark and drill, I am able to query local parquet files.
Does presto provide the same capability?
In other words, is it possible to query local parquet files using presto - without going through HDFS or hive?
I did not find a straightforward way to do this. This has been long time now and I am not sure if there are other options available at the moment.
What I did was; create a custom hive meta store that would return the schemas, tables with paths of my parquet files. In presto, configured it using that meta store and that worked pretty fine.
From my understanding, Presto's localfile is only for http_request_logs (which is why they have settings for: presto-logs.http-request-log.location). I wasn't able to query local parquet data with Presto.
I was able to query data using Apache Drill. Out of the box, you can switch out the below directory with your local file system and run regular SQL on it:
# Start with /bin/drill-embedded
0: jdbc:drill:zk=local> select * from dfs.`/somedir/withparquetfiles/`

Is there a way to access avro data stored in hbase using hive to do analysis

My Hbase table has rows that contain both serialized avro (put there using havrobase) and string data. I know that Hive table can be mapped to avro data stored in hdfs to do data analysis but I was wondering if anyone has tried to map hive to hbase table(s) that contains avro data. Basically I need to be able to query both avro and non avro data stored in Hbase, do some analysis and store the result in a different hbase table. I need the capability to do this as a batch job as well. I don't want to write a JAVA MapReduce job to do this because we have constantly changing configurations and we need to use a scripted approach. Any suggestions? Thanks in advance!
You can write an HBase co-processor to expose the avro record as regular HBase qualifiers. You can see an implementation of that in Intel's panthera-dot

Mahout Hive Integration

I want to combine Hadoop based Mahout recommenders with Apache Hive.So that My generated Recommendations are directly stored in to my Hive Tables..Do any one know similar tutorials for this..?
Hadoop based Mahout recommenders can store the results in HDFS directly.
Hive also allows you to create table schema on top of any data using CREATE EXTERNAL TABLE recommend_table which also specifies the location of the data (LOCATION '/home/admin/userdata';).
This way you are ensured that when new data is written to that location - /home/admin/userdata then it is already available to Hive and can be queried by existing Table schema : recommend_table.
I had blogged about it some time back: external-tables-in-hive-are-handy. This solution helps for any kind of map-reduce program output that needs to be available immediately for Hive ad-hoc queries.

Resources