Issue with snapshot using delta on spark-dbt - snapshot

I'm creating dbt snapshot on spark using Delta Lake. But after the Initial dbt snapshot run, from the second dbt snapshot command onwards, i'm getting the error snapshot:
target is not a snapshot table (missing "dbt_scd_id", "dbt_valid_from", "dbt_valid_to")
I'm using local spark, using thriftserver by executing $SPARK_HOME/sbin/start-thriftserver.sh command. I'm storing the tables and datasets on s3

Related

Sqoop failing when importing as avro in AWS EMR

I'm trying to perform an sqoop import in Amazon EMR(hadoop 2.8.5 sqoop 1.4.7). The import goes pretty well when no avro option(--as-avrodatafile) is specified. But once it's set, the job is failing with
19/10/29 21:31:35 INFO mapreduce.Job: Task Id : attempt_1572305702067_0017_m_000000_1, Status : FAILED
Error: org.apache.avro.reflect.ReflectData.addLogicalTypeConversion(Lorg/apache/avro/Conversion;)V
Using this option -D mapreduce.job.user.classpath.first=true doesn't work.
Running locally(in my machine) I found that copying the avro-1.8.1.jar in sqoop to hadoop lib folder works, but in the EMR cluster I have only access to the master node, so doing the above doesn't work because it isn't the master node who runs the jobs.
Did anyone face this problem?
The solution I found was to connect to every node in the cluster(I thought I only had access to the master node, but I was wrong, in EMR we have access to all nodes) and replace the Avro jar that is included with Hadoop by the Avro jar that comes in Sqoop. It's not an elegant solution but it works.
[UPDATE]
Happened that the option -D mapreduce.job.user.classpath.first=true wasn't working because I was using s3a as target dir when Amazon says that we should use s3. As soon as I started using s3 Sqoop could perform the import correctly. So, no need of replacing any file in the nodes. Using s3a could lead to some strange errors under EMR due to Amazon own configuration, don't use it. Even in terms of performance s3 is better than s3a in EMR as the implementation for s3 is Amazon's.

How to add jar files for Hue in Cloudera?

I'm running an SQL query on a JSON serde table. It's working in the Hive CLI, but it's failing in Hue with the error:
Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
I guess it's due to the missing jar file; any idea how to add the jar file hive-hcatalog-core-1.2.1.jar for Hue?
Place your jar in HDFS and add same path by using ADD JAR hdfs:///user/hive/lib/hive-hcatalog-core-1.2.1.jar ;
Run ADD JAR hive-hcatalog-core-1.2.1.jar in hue before your query this thing will be present till your current secession persists.
For the benefit of others, who might face same issue either for this particular jar "hive-hcatalog-core-1.2.1.jar" or any udf jar:
In the HUE - Query Editor, run the following command:
add jar hdfs:/hive-hcatalog-core-1.2.1.jar;
Please note single quotes is not required as is the case with Hive CLI
Exact command cloudera gave is ADD JAR {{lib_dir}}/hive/lib/hive-contrib.jar;
1)I am unable to find hive/lib directory on CDH 5
The {{lib_dir}} on CDH installed environments for Hive would either be /usr/lib/hive/ or /opt/cloudera/parcels/CDH/lib/hive/ (depending on packages or parcels being in use).
this is the way to add jar in cloudera
for this you have to change to supper user by use this command
SUDO SU
it will change to supper user

Hive select query failed on ORC table

Exception:
Failed with exception java.io.IOException:java.io.IOException: Somehow
read -1 bytes trying to skip 6257 more bytes t o seek to position
6708, size: 1290047
Does anyone has any idea about how to fix it on cloud dataproc ?
It looks like you're probably hitting this known issue which is somewhat specific to reading ORC files. The GCS connector version 1.5.4 has the fix, and is rolling out in Dataproc this week (expected to be fully rolled out by this Friday, October 14th).
In the meantime, you can use a small initialization action to update the connector version on your dataproc clusters automatically; create a file named update-gcs-1.5.4.sh:
#!/bin/bash
rm -f /usr/lib/hadoop/lib/gcs-connector*.jar
gsutil cp gs://hadoop-lib/gcs/gcs-connector-1.5.4-hadoop2.jar /usr/lib/hadoop/lib/
And then upload that file to GCS somewhere:
gsutil cp update-gcs-1.5.4.sh gs://<YOUR_BUCKET_HERE>/update-gcs-1.5.4.sh
Then create your Dataproc cluster:
gcloud dataproc clusters create \
--initialization-actions gs://<YOUR_BUCKET_HERE>/update-gcs-1.5.4.sh

Issue with load data into HIVE

We have launched two EMR in AWS and installed the hadoop and hive-0.11.0 in one EMR and hive-0.13.1 other one.
Everything seems to be working fine but while trying to loading data into TABLE it's giving the below error and it happening in both the Hive Servers.
ERROR MESSAGE:
An error occurred when executing the SQL command: load data inpath
's3://buckername/export/employee_1/' into table employee_2 Query
returned non-zero code: 10028, cause: FAILED: SemanticException [Error
10028]: Line 1:17 Path is not legal
''s3://buckername/export/employee_1/'': Move from:
s3://buckername/export/employee_1 to:
hdfs://XXX.XX.XXX.XX:X000/mnt/hive_0110/warehouse/employee_2 is not
valid. Please check that values for params "default.fs.name" and
"hive.metastore.warehouse.dir" do not conflict. [SQL State=42000, DB
Errorcode=10028]
I searched for the reason and mean of this message, I found this link but when tried to execute command suggested in the given link it's also giving the below error.
Command:
--service metatool -updateLocation hdfs://XXX.XX.XXX.XX:X000 hdfs://XXX.XX.XXX.XX:X000
Initializing HiveMetaTool.. HiveMetaTool:Parsing failed. Reason:
Unrecognized option: -hiveconf
Any help in this will be really appreciated.
LOAD does not support S3. It is best practice to leave data in S3 and just use it as a Hive external table instead of copying the data to HDFS. Some references http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hive-additional-features.html and When you create an external table in Hive with an S3 location is the data transfered?
If you have installed hive on your Hadoop cluster, the default storage of hive data is HDFS (hive.metastore.warehouse.dir=/user/hive/warehouse).
As a workaround you can copy the file from S3 file system to HDFS and then from HDFS load the file to hive.
Most probably we may need to modify the parameter "hive.exim.uri.scheme.whitelist=hdfs,pfile" to load the data from S3 file system.

Cassandra nodetool snapshot creates snapshot directory but not in the data directory specified

I am using Cassandra nodetool to take a snapshot .
The way i use it is, cd to $CASSANDRA_HOME and then do bin/nodetool snapshot . It says a snapshot directory has been created but I don't find in the data directory.
What am I doing wrong?
Snapshots are in $DATA_DIR/$KEYSPACE/$COLUMN_FAMILY/snapshots/.

Resources