I need to query the hourly map reduce batch results from Imapala
output directory structure will be
/data/access/web1/2015/Jan/day1/09/part-r-00000
/data/access/web1/2015/Jan/day1/09/part-r-00001
...
/data/access/web1/2015/Jan/day1/20/part-r-00000
/data/access/web1/2015/Jan/day1/20/part-r-00001
...
/data/access/web1/2015/Jan/day2/01/part-r-00000
...
/data/access/web1/2015/Jan/day30/18/part-r-00000
....
Is it possible to create an impala table to read the data from /data/access/web1/* directory (including sub directories)
By default impala is not querying the data from sub directories.
How to enable recursive reading in impala?
Work around is to create a partition table in impala.
But partition table doesn't fit our requirement.
How to resolve this issue?
As of now recursive reading of files from sub directories under the TABLE LOCATION is not supported in Impala. Example: If a table is created with location '/home/data/input/'
and if the directory structure is as follows:
/home/data/input/a.txt
/home/data/input/b.txt
/home/data/input/subdir1/x.txt
/home/data/input/subdir2/y.txt
then Impala can query from following files only
/home/data/input/a.txt
/home/data/input/b.txt
Following files are not queried
/home/data/input/subdir1/x.tx
/home/data/input/subdir2/y.txt
As a alternative solution, you can read the data from Hive and insert into a Final Hive Table.
Create an Impala view on top of this table for Interactive or Reporting queries.
You can set this feature in Hive using below configuration settings.
Hive supports subdirectory scan with options
SET mapred.input.dir.recursive=true;
and
SET hive.mapred.supports.subdirectories=true;
Checkout Hive external tables:
CREATE EXTERNAL TABLE my_external_table (c1 INT, c2 STRING, c3 TIMESTAMP)
LOCATION '/data/access/web1';
Impala will read data from the given HDFS directory recursively.
When you add new files to the HDFS directory, call refresh my_external_table; on Impala to notify Impala about the new data.
Related
When load data from HDFS to Hive, using
LOAD DATA INPATH 'hdfs_file' INTO TABLE tablename;
command, it looks like it is moving the hdfs_file to hive/warehouse dir.
Is it possible (How?) to copy it instead of moving it, in order, for the file, to be used by another process.
from your question I assume that you already have your data in hdfs.
So you don't need to LOAD DATA, which moves the files to the default hive location /user/hive/warehouse. You can simply define the table using the externalkeyword, which leaves the files in place, but creates the table definition in the hive metastore. See here:
Create Table DDL
eg.:
create external table table_name (
id int,
myfields string
)
location '/my/location/in/hdfs';
Please note that the format you use might differ from the default (as mentioned by JigneshRawal in the comments). You can use your own delimiter, for example when using Sqoop:
row format delimited fields terminated by ','
I found that, when you use EXTERNAL TABLE and LOCATION together, Hive creates table and initially no data will present (assuming your data location is different from the Hive 'LOCATION').
When you use 'LOAD DATA INPATH' command, the data get MOVED (instead of copy) from data location to location that you specified while creating Hive table.
If location is not given when you create Hive table, it uses internal Hive warehouse location and data will get moved from your source data location to internal Hive data warehouse location (i.e. /user/hive/warehouse/).
An alternative to 'LOAD DATA' is available in which the data will not be moved from your existing source location to hive data warehouse location.
You can use ALTER TABLE command with 'LOCATION' option. Here is below required command
ALTER TABLE table_name ADD PARTITION (date_col='2017-02-07') LOCATION 'hdfs/path/to/location/'
The only condition here is, the location should be a directory instead of file.
Hope this will solve the problem.
I have a directory, such as /user/name/folder.
Inside this directory, I have more sub-directories named dt=2020-06-01, dt=2020-06-02, dt=2020-06-03, etc.
These directories contain parquet files. They all have the same schema.
Is it possible to create an Impala table using /user/name/folder?
Each time I do, I get a Table with 0 records. Is there a way to tell Impala to pull the parquet files from all of the sub-directories?
One way to do that is loading data with static partitioning in which you manually define the different partitions. With static partitioning, you create a partition manually, using an ALTER TABLE … ADD PARTITION statement,
and then load the data into the partition.
CREATE TABLE customers_by_date
(cust_id STRING, name STRING)
PARTITIONED BY (dt STRING)
STORED AS PARQUET;
ALTER TABLE customers_by_country
ADD PARTITION (dt='2020-06-01')
SET LOCATION '/user/name/folder/dt=2020-06-01';
If the location is not specified then the location is created
ALTER TABLE customers_by_date
ADD PARTITION (dt='2020-06-01');
and you could load data with HDFS commands too
$ hdfs dfs -cp /user/name/folder/dt=2020-06-01 /user/directory_impala/table/partition
You could follow these links to the Cloudera documentation for further details:
Partitioning for Impala Tables
Impala Create table statement
Impala Alter table statement
Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore
I'm populating a partitioned Hive table in parquet storage format using a query that is using a number of union all operators. Query is executed using Tez, which with default settings results in multiple concurrent Tez writers creating HDFS structure, where parquet files are sitting in subfolders (with Tez writer ID for the folder name) under partition folders.
E.g. /apps/hive/warehouse/scratch.db/test_table/part=p1/8/000000_0
Even after invalidate metadata and collect stats on the table, Impala returns zero rows when the table is queried.
The issue seems to be with Impala not traversing into partition subfolder to look for parquet files.
If I set hive.merge.tezfiles to true (it's false by default), effectively forcing Tez to use an extra processing step to coalesce multiple files into one, resulting parquet files are written directly under partition folder, and after refresh Impala can see the data in the new or updated partitions.
I wonder if there is an config option for Impala to instruct it to look in partition subfolders or perhaps there is a patch for Impala that changes its behavior in that regards.
As of now recursive reading of files from sub directories under the TABLE LOCATION is not supported in Impala.
Example: If a table is created with location '/home/data/input/'
and if the directory structure is as follows:
/home/data/input/a.txt
/home/data/input/b.txt
/home/data/input/subdir1/x.txt
/home/data/input/subdir2/y.txt
then Impala can query from following files only
/home/data/input/a.txt
/home/data/input/b.txt
Following files are not queried
/home/data/input/subdir1/x.txt
/home/data/input/subdir2/y.txt
As a alternative solution, you can read the data from Hive and insert into a Final Hive Table.
Create an Impala view on top of this table for Interactive or Reporting queries.
You can set this feature in Hive using below configuration settings.
Hive supports subdirectory scan with options
SET mapred.input.dir.recursive=true;
and
SET hive.mapred.supports.subdirectories=true;
recently I want to load the log files into hive tables, I want a tool which can read data from a certain directory and load them into hive automatically. This directory may include lots of subdirectories, for example, the certain directory is '/log' and the subdirectories are '/log/20130115','/log/20130116','/log/201301017'. Is there some ETL tools which can achieve the function that:once the new data is stored in the certain directory, the tool can detect this data automatically and load them into hive table. Is there such tools, do I have to write script by myself?
You can easily do this using Hive external tables and partitioning your table by day. For example, create your table as such:
create external table mytable(...)
partitioned by (day string)
location '/user/hive/warehouse/mytable';
This will essentially create an empty table in the metastore and make it point to /user/hive/warehouse/mytable.
Then you can load your data in this directory with the format key=value where key is your partition name (here "day") and value is the value of your partition. For example:
hadoop fs -put /log/20130115 /user/hive/warehouse/mytable/day=20130115
Once your data is loaded there, it is in the HDFS directory, but the Hive metastore doesn't know yet that it belongs to the table, so you can add it this way:
alter table mytable add partition(day='20130115');
And you should be good to go, the metastore will be updated with your new partition, and you can now query your table on this partition.
This should be trivial to script, you can create a cron job running once a day that will do these command in order and find the partition to load with the date command, for example continuously doing this command:
hadoop fs -test /log/`date +%Y%m%d`
and checking if $? is equal to 0 will tell you if the file is here and if it is, you can transfer it and add the partition as described above.
You can make use of LOAD DATA command provided by Hive. It exactly matches your use case. Specify a directory in your local file system and make Hive tables from it.
Example usage -
LOAD DATA LOCAL INPATH '/home/user/some-directory'
OVERWRITE INTO TABLE table