I am trying to load a dataset stored on HDFS (textfile) into hive for analysis.
I am using create external table as follows:
CREATE EXTERNAL table myTable(field1 STRING...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
LOCATION '/user/myusername/datasetlocation';
This works fine, but it requires write access to the hdfs location. Why is that?
In general, what is the right way to load text data to which I do not have write access? is there a 'read-only' external table type?
Edit: I noticed this issue on hive regarding the question. It does not seem to have been resolved.
Partially answering my own question:
Indeed it seems not to be resolved by hive at this moment. But here is an interesting fact: hive does not require write access to the files themselves, but only to the folder. For example, you could have a folder with permissions 777, but the files within it, which are accessed by hive, can stay read-only, e.g. 644.
I don't have a solution to this, but as a workaround I've discovered that
CREATE TEMPORARY EXTERNAL TABLE
works without write permissions, the difference being the table (but not the underlying data) will disappear after your session.
If you require write access to hdfs files give
hadoop dfs -chmod 777 /folder name
this means your giving all access permissions to that particular file.
Related
Hive has two kinds of tables which are Managed and External Tables, for the difference, you can check Managed. VS External Tables.
Currently, to move external database from HDFS to Alluxio, I need to modify external table's location to alluxio://.
The statement is something like: alter table catalog_page set location "alluxio://node1:19998/user/root/tpcds/1000/catalog_returns"
According to my understanding, it should be a simple metastore modification,however, for some tables modification, it will spend dozens of minutes. The database itself contains about 1TB data btw.
Is there anyway for me to accelerate the table alter process? If no, why it's so slow? Any comment is welcomed, thanks.
I found suggested way which is metatool under $HIVE_HOME/bin.
metatool -updateLocation <new-loc> <old-loc> Update FS root location in the
metastore to new location.Both
new-loc and old-loc should be
valid URIs with valid host names
and schemes.When run with the
dryRun option changes are
displayed but are not persisted.
When run with the
serdepropKey/tablePropKey option
updateLocation looks for the
serde-prop-key/table-prop-key
that is specified and updates
its value if found.
By using this tool, the location modification is very fast. (maybe several seconds.)
Leave this thread here for anyone who might run into the same situation.
I use flink 1.6,I know I can use custom sink and hive jdbc to write to hive,or use JDBCAppendTableSink,but it is still use jdbc.The problem is hive jdbc do not suppot batchExecute method.I think it will be very slow.
Then I seek another way,I write a DataSet to hdfs with writeAsText method,then create hive table from hdfs.But there is still a problem:the how to append incremental data.
The api of WriteMode is:
Enum FileSystem.WriteMode
Enum Constant and Description
NO_OVERWRITE
Creates the target file only if no file exists at that path already.
OVERWRITE
Creates a new target file regardless of any existing files or directories.
For example,first batch,I write data of September to hive,then I get data of October,I want to append it.
But If I use OVERWRITE to the same hdfs file,data of September will not exist any more,if I use NO_OVERWRITE,I must write it to a new hdfs file,then a new hive table,we need them in a same hive table.And I do not know how to combine 2 hdfs file to a hive table.
So How to write incremental data to hive using flink?
As you already wrote there is no HIVE-Sink. I guess the default pattern is to write (text, avro, parquett)-files to HDFS and define an external hive table on that directory. There it doesn't matter if you have a single file or mutiple files. But you most likely have to repair this table on a regular basis (msck repair table <db_name>.<table_name>;). This will update the meta-data and the new files will be available.
For bigger amounts of data I would recommend to partition the table and add the partitions on demand (This blogpost might give you a hint: https://resources.zaloni.com/blog/partitioning-in-hive).
I am trying to create hive tables as outputs of my spark (1.5.1 version) job on a hadoop cluster (BigInsight 4.1 distribution) and am facing permission issues. My guess is spark is using a default user (in this case 'yarn' and not the job submitter's username) to create the tables and therefore fails to do so.
I tried to customize the hive-site.xml file to set an authenticated user that has permissions to create hive tables, but that didn't work.
I also tried to set Hadoop user variable to an authenticated user but it didn't work either.
I want to avoid saving txt files and then creating hive tables to optimize performances and reduce the size of the outputs through orc compression.
My questions are :
Is there any way to call write function of the spark dataframe api
with a specified user ?
Is it possible to choose a username using oozie's workflow file ?
Does anyone have an alternative idea or has ever faced this problem ?
Thanks.
Hatak!
Consider df holding your data, you can write
In Java:
df.write().saveAsTable("tableName");
You can use different SaveMode like Overwrite, Append
df.write().mode(SaveMode.Append).saveAsTable("tableName");
In Scala:
df.write.mode(SaveMode.Append).saveAsTable(tableName)
A lot of other options can be specified depending on what type you would like to save. Txt, ORC (with buckets), JSON.
I am using Hadoop and facing the dreaded problem of large numbers of small files. I need to be able to create har archives out of existing hive partitions and query them at the same time. However, Hive apparently supports archiving partitions only in managed tables and not external tables - which is pretty sad. I am trying to find a workaround for this, by manually archiving the files inside a partition's directory, using hadoop's archive tool. I now need to configure hive to be able to query the data stored in these archives, along with the unarchived data stored in other partition directories. Please note that we only have external tables in use.
The namespace for accessing the files in the created partition-har corresponds to the hdfs path of the partition dir.
For example, For example, a file in hdfs:
hdfs:///user/user1/data/db1/tab1/ds=2016_01_01/f1.txt
can after archiving be accessed as:
har:///user/user1/data/db1/tab1/ds=2016_01_01.har/f1.txt
Would it be possible for hive to query the har archives from the external table? Please suggest a way if yes.
Best Regards
In practice, the line between "managed" and "external" tables is very thin.
My suggestion:
create a "managed" table
add explicitly partitions for some days in the future, but with ad hoc locations -- i.e. the directories your external process expects to use
let the external process dump its file directly at HDFS level -- they are automagically exposed in Hive queries, "managed" or not(the Metastore does not track individual files and blocks, they are detected on each query; as a side note, you can run backup & restore operations at HDFS level if you wish, as long as you don't mess with the directory structure)
when a partition is "cold" and you are pretty sure there will never be another file dumped there, you can run a Hive command to archive the partition i.e. move small files in a single HAR + flag the partition as "archived" in the Metastore
Bonus: it's easy to unarchive your partition within Hive (whereas there is no hadoop unarchive command AFAIK).
Caveat: it's a "managed" table so remember not to DROP anything unless you have safely moved your data out of the Hive-managed directories.
I have copied the files associated an hbase table into another cluster and stored the files in the hbase folder. I can see the table when i do a list. When I do scan 'myTable' it can't find the table.
When I go through the HBase-WebUI, I see the table including its cf information, when I click on the table I get:
org.apache.hadoop.hbase.TableNotFoundException: hbTable
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1024)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:889)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:846)
How do i get the hbaseRegionServers to manage the table?
P.S. For the purposes of this exercise I'm not interested in using the export utility or the copyTable Utility.
Please refer this link : Complete Bulk Load
Assuming here that you copied the Hfiles from one cluster to another.
Did you create the table in the other cluster? I think the error is due to the missing information in the .META