Why ORC can support acid in hive? - hadoop

I have read the documentation that says
"Only ORC file format is supported in this first release. The feature has been built such that transactions can be used by any storage format that can determine how updates or deletes apply to base records (basically, that has an explicit or implicit row id), but so far the integration work has only been done for ORC."
But my question is that why only ORC, why it can't be done for parquet for instance, does ORC have something special that makes it compatible for ACID transactions?

As per cwiki,
Only ORC file format is supported in this first release. The feature has been built such that transactions can be used by any storage format that can determine how updates or deletes apply to base records (basically, that has an explicit or implicit row id), but so far the integration work has only been done for ORC.
ORC doesnt have anything special except columnar storage, zipped file format - which makes it fast over text files. Theoretically doable.
This feature came in 2014 for version 0.13. After that, they have improved the functionality but not for other file formats. Which means, there may not be enough demand or complex for other file format or they have to rewrite everything for other file formats.
You can refer to this link for more details on the transactional feature.
https://cwiki.apache.org/confluence/display/hive/hive+transactions

Related

Does all of three: Presto, hive and impala support Avro data format?

I am clear about the Serde available in Hive to support Avro schema for data formats. Comfortable in using avro with hive.
AvroSerDe
for say, I have found this issue against presto.
https://github.com/prestodb/presto/issues/5009
I need to choose components for fast execution cycle. Presto and impala provide much smaller execution cycle.
So, Anyone please let me clarify that which would be better in different data formats.
Primarily, I am looking for avro support with Presto now.
However, lets consider following data formats stored on HDFS:
Avro format
Parquet format
Orc format
Which is the best to use with high performance on different data formats.
?? please suggest.
Impala can read Avro data but can not write it. Please refer to this documentaion page describing the file formats supported by Impala.
Hive supports both reading and writing Avro files.
Presto's Hive Connector supports Avro as well. Thanks to David Phillips for pointing out this documentaion page.
There are different benchmarks on the internet about performance, but I would not like to link to a specific one as results heavily depend on the exact use case benchmarked.

Apache Solr support for ORC file format

I have a bunch of tables in Hive, stored as ORC. I want to index their data in a SolrCloud collection.
Is there any support for indexing data stored in ORC format in Solr?
I've googled around but nothing came out.
Looks like you want SolR to read data from a specific Hive file format.
You might look at the problem the other way i.e. use Hive to write data to SolR -- and thus let Hive take care of the complexity of the actual input file format (whether ORC, Parquet, AVRO, whatever -- even HBase data files).
In the LucidWorks GitHub repo you will find a project labeled hive-solr. Have a look.
I'll accept Samson's answer.
Anyway, I'm not fully satisfied about this solution. In fact, now I still need to create an external table manually declaring all fields in the original table. In terms of operations, it is not different from creating a new table (stored ad textfile) starting from the original one, indexing the new text files and finally dropping them (of course, this may be a problem for very large tables, which is not my case).
Being ORC a self-describing format, it would be great for Solr to read both field names and data directly from the compressed files.

appending to ORC file

I'm new to Big data and related technologies, so I'm unsure if we can append data to the existing ORC file. I'm writing the ORC file using Java API and when I close the Writer, I'm unable to open the file again to write new content to it, basically to append new data.
Is there a way I can append data to the existing ORC file, either using Java Api or Hive or any other means?
One more clarification, when saving Java util.Date object into ORC file, ORC type is stored as:
struct<timestamp:struct<fasttime:bigint,cdate:struct<cachedyear:int,cachedfixeddatejan1:bigint,cachedfixeddatenextjan1:bigint>>,
and for java BigDecimal it's:
<margin:struct<intval:struct<signum:int,mag:struct<>,bitcount:int,bitlength:int,lowestsetbit:int,firstnonzerointnum:int>
Are these correct and is there any info on this?
Update 2017
Yes now you can! Hive provides a new support for ACID, but you can append data to your table using Append Mode mode("append") with Spark
Below an example
Seq((10, 20)).toDF("a", "b").write.mode("overwrite").saveAsTable("tab1")
Seq((20, 30)).toDF("a", "b").write.mode("append").saveAsTable("tab1")
sql("select * from tab1").show
Or a more complete exmple with ORC here; below an extract:
val command = spark.read.format("jdbc").option("url" .... ).load()
command.write.mode("append").format("orc").option("orc.compression","gzip").save("command.orc")
No, you cannot append directly to an ORC file. Nor to a Parquet file. Nor to any columnar format with a complex internal structure with metadata interleaved with data.
Quoting the official "Apache Parquet" site...
Metadata is written after the data to allow for single pass writing.
Then quoting the official "Apache ORC" site...
Since HDFS does not support changing the data in a file after it is
written, ORC stores the top level index at the end of the file (...)
The file’s tail consists of 3 parts; the file metadata, file footer
and postscript.
Well, technically, nowadays you can append to an HDFS file; you can even truncate it. But these tricks are only useful for some edge cases (e.g. Flume feeding messages into an HDFS "log file", micro-batch-wise, with fflush from time to time).
For Hive transaction support they use a different trick: creating a new ORC file on each transaction (i.e. micro-batch) with periodic compaction jobs running in the background, à la HBase.
Yes this is possible through Hive in which you can basically 'concatenate' newer data. From hive official documentation https://cwiki.apache.org/confluence/display/Hive/Hive+Transactions#HiveTransactions-WhatisACIDandwhyshouldyouuseit?

How granular is access control on HDFS for unstructured data?

I am looking for any piece of technical paper explaining how access control is conducted on unstructured data ingested by HDFS.
Can the granularity level be smaller than POSIX-ish file permissions?
Similarly, how would products like RecordService (from Cloudera), which provide an abstraction layer for security on storage components, work on unstructured data?
For instance, if I have a very big emails archive file (more than a terabyte), would I be able to specify a more fine-grained ACL than one on the entire file itself? I am thinking about email headers, etc.
The granularity supported is to the row and column levels. See details.
Presently, for RecordService to work, your data must be organized as Hive Metastore tables. In the future, RecordService may infer structure/schema from the files themselves (but, not the case today).

Modeling Data in Hadoop

Currently I am bringing into Hadoop around 10 tables from an EDW (Enterprise Data Warehouse), these tables are closely related to a Star Schema model. I'm usig Sqoop to bring all these tables across, resulting in 10 directories containing csv files.
I'm looking at what are some better ways to store these files before striking off MR jobs. Should I follow some kind of model or build an aggregate before working on MR jobs? I'm basically looking at how might be some ways of storing related data together.
Most things I have found by searching are storing trivial csv files and reading them with opencsv. I'm looking for something a bit more involved and not just for csv files. If moving towards another format works better than csv, then that is no problem.
Boils down to: How best to store a bunch of related data in HDFS to have a good experience with MR.
I suggest spending some time with Apache Avro.
With Sqoop v1.3 and beyond you can import data from your relational data sources as Avro files using a schema of your own design. What's nice about Avro is that it provides a lot of features in addition to being a serialization format...
It gives you data+schema in the same file but is compact and efficient for fast serialization. It gives you versioning facilities which are useful when bringing in updated data with a different schema. Hive supports it in both reading and writing and Map Reduce can use it seamlessly.
It can be used as a generic interchange format between applications (not just for Hadoop) making it an interesting option for a standard, cross-platform format for data exchange in your broader architecture.
Storing these files in csv is fine. Since you will be able to process these files using text output format and could also read it through hive using specific delimiter. You could change the delimiter if you do not like comma to pipe("|") that's what I do most of the time. Also you generally need to have large files in hadoop but if its large enough that you can partition these files and each file partition is in the size of few 100 gigs then it would be a good to partition these files into separate directory based on your partition column.
Also it would be better idea to have most of the columns in single table than having many normalized small tables. But that varies depending on your data size. Also make sure whenever you copy , move or create data you do all the constraint check on your applications as it will be difficult to make small changes in the table later on, you will need to modify the complete file for even small change.
Hive Partitioning and Bucketing concepts can be used to effectively used to put similar data together (not in nodes, but in files and folders) based on a particular column. Here are some nice tutorials for Partitioning and Bucketing.

Resources