issue in create table using another table in hive - hadoop

In hive there is a test table. table data have multiple small files so I want create another table using that test table so the newly created table will have less partitions and query will be fast. But I creating new table it gives me error.
CREATE TABLE IF NOT EXISTS test_merge
STORED AS parquet
AS
SELECT * FROM test;
Error
ERROR : Status: Failed
ERROR : FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask
INFO : Completed executing command(queryId=hive_20180108060101_7bca2cc8-e19b-4e6d-aa00-362039526523); Time taken: 366.845 seconds
Error: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask (state=08S01,code=3)
It is working fine with less data. example
CREATE TABLE IF NOT EXISTS test_merge
STORED AS parquet
AS
SELECT * FROM test limit 100000;
It may be memory issues, I don't know. Please help

When you are trying to write parquet format files, spark would catch a batch of rows into data block called "Row Group" before flush them to disk. So usually it requires more memory than row oriented formats. Try to increase "spark.executor.memory" or decrease "parquet.block.size", this may help

Related

Oracle Data Integrator- ODI 12.2.1--Loadplan Issue no of records count issue

I come across a scenario in my project.I am loading data from file to Table using ODI.I am running My interfaces through loadplan.I've 1000 Records in my source file,and also getting 1000 records in target file.but when I'm checking ODI loadplan execution log its showing number of insert is 2000.can anyone please help.or is it a ODI bug.?
The number of inserts does not only show the inserts in the target table but also all the insert happening in temporary tables. Depending on the knowledge modules (KMs) used in an interface, ODI might load data in a C$_ table (LKM) or I$_ table (IKM/CKM). The rows loaded in these table will also be counted.
You can look at the code generated in the operator to check if your KMs are using using these temporary. You can also simulate an execution to see the code generated.

Unable to partition hive table backed by HDFS

Maybe this is an easy question but, I am having a difficult time resolving the issue. At this time, I have an pseudo-distributed HDFS that contains recordings that are encoded using protobuf 3.0.0. Then, using Elephant-Bird/Hive I am able to put that data into Hive tables to query. The problem that I am having is partitioning the data.
This is the table create statement that I am using
CREATE EXTERNAL TABLE IF NOT EXISTS test_messages
PARTITIONED BY (dt string)
ROW FORMAT SERDE
"com.twitter.elephantbird.hive.serde.ProtobufDeserializer"
WITH serdeproperties (
"serialization.class"="path.to.my.java.class.ProtoClass")
STORED AS SEQUENCEFILE;
The table is created and I do not receive any runtime errors when I query the table.
When I attempt to load data as follows:
ALTER TABLE test_messages_20180116_20180116 ADD PARTITION (dt = '20171117') LOCATION '/test/20171117'
I receive an "OK" statement. However, when I query the table:
select * from test_messages limit 1;
I receive the following error:
Failed with exception java.io.IOException:java.lang.IllegalArgumentException: FieldDescriptor does not match message type.
I have been reading up on Hive table and have seen that the partition columns do not need to be part of the data being loaded. The reason I am trying to partition the date is both for performance but, more so, because the "LOAD DATA ... " statements move the files between directories in HDFS.
P.S. I have proven that I am able to run queries against hive table without partitioning.
Any thoughts ?
I see that you have created EXTERNAL TABLE. So you cannot add or drop partition using hive. you need to create a folder using hdfs or MR or SPARK. EXTERNAL table can only be read by hive but not managed by HDFS. You can check the hdfs location '/test/dt=20171117' and you will see that folder has not been created.
My suggestion is create the folder(partition) using "hadoop fs -mkdir '/test/20171117'" then try to query the table. although it will give 0 row. but you can add the data to that folder and read from Hive.
You need to specify a LOCATION for an EXTERNAL TABLE
CREATE EXTERNAL TABLE
...
LOCATION '/test';
Then, is the data actually a sequence file? All you've said is that it's protobuf data. I'm not sure how the elephantbird library works, but you'll want to double check that.
Then, your table locations need to look like /test/dt=value in order for Hive to read them.
After you create an external table over HDFS location, you must run MSCK REPAIR TABLE table_name for the partitions to be added to the Hive metastore

Incremental load in greenplum

I have external and internal tables in greenplum. External table is pointing in hdfs a csv file. This csv file in Hdfs getting load with full data of a table every hour.
What is best way to load data incrementally in internal table of the greenplum.
create dimension table in greenplum where it store last of the till where it loaded previously like timestamp or any datapoint.
use above dimension table , you can an UDF return in such a way that evry one hour whenever a new file arrives , it will loaded to stage/extrenal table and then with last loaded parameters from dimension table the , it will pick only relevant/new records to process further.
Thanks,
shobha

sqlldr style error logging in Oracle external tables

I'm currently trying to get the error messages we get from our Oracle external table loading process to match the level of detail we get when loading via sqlldr.
Currently if I load a file in sqlldr and a record fails I get this error message which is pretty useful - I get the record number, the actual column name that failed and the record in a bad file.
Record 4: Rejected - Error on table ERROR_TEST, column COL1.
ORA-01722: invalid number
I've got an external table along with an INSERT statement to a target table that logs errors to a DBMS_ERRLOG table but this is the equivalent error message from that process.
ORA-01722: invalid number
Whilst this process has the benefit of recording the actual record in a table so you can see column name mappings it doesn't actually list which column has the issue. This is a big issue when looking at tables with many columns...
It looks as though I can have the REJECT LIMIT on the external table itself which will give the same above error but that will then lose the ERR table logging. I'm guessing it's a case of one or the other?

Moving partial data across hadoop instances

I have to copy a certain chunk of data from one hadoop cluster to another. I wrote a hive query which dumps the data into hdfs. After copying the file to the destination cluster, I tried to load the data using the command "load data inpath '/a.txt' into table data". I got the following error message
Failed with exception Wrong file format. Please check the file's format.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.MoveTask
I had dumped the data as a sequence file. Can anybody let me know what am I missing here ?
You should use STORED AS SEQUENCEFILE while creating the table if you want to store sequence file in the table. And you have written that you have dumped data as Sequence file but your file name is a.txt. I didn't get that.
If you want to load a text file into a table that expects Sequence file as the data source you could do one thing. First create a normal table and load the text file into this table. Then do :
insert into table seq_table select * from text_table;

Resources