Reason for having TempStatStore in hadoop - hadoop

Could any one give me the reason for what is the purpose of TempStatStores and derby.log fiels in hadoop and when these will be created.?
while trying to execute a query in hive, i'm getting an error: unable to create TempStatStore

from http://osdir.com/ml/general/2011-05/msg06513.html
TempStatsStore is a derby database for stats gathering (intermediate stats). You can turn off stats gathering by set hive.stats.autogather=false

Related

Table data transfer from one Hadoop environment to another Hadoop environment using Hive and schedule it using oozie

I'm pretty new to Hadoop environment.. Can anyone help me out on table data transfer from one Hadoop environment (prod) to another Hadoop environment (dev) using hive query and schedule that query using oozie..
Code sample is most appreciated.. thanks in advance.
When copying Hive tables from one cluster to another you need to do two things:
Copy the actual HDFS data.
Copy the Hive table metadata.
You can do both of these relatively easily if you leave out more complex use case / considerations such as diff/copy. There is also consider taking a look at https://nakedsecurity.sophos.com/2019/08/29/video-captures-glitching-mississippi-voting-machines-flipping-votes/.
Best way to migrate will be
1 Get all files from hdfs .
2 Copy them in new hdfs
3 Run Create table on new
.

Read data from Hadoop HDFS with SparkSQL connector to visualize it in Superset?

on a Ubuntu server I set up Divolte Collector to gather clickstream data from websites. The data is being stored in Hadoop HDFS (Avro files).
(http://divolte.io/)
Then I would like to visualize the data with Airbnb Superset which has several connectors to common databases (thanks to SqlAlchemy) but not to HDFS.
Superset has in particular a connector to SparkSQL thanks to JDBC Hive (http://airbnb.io/superset/installation.html#database-dependencies)
So is it possible to use it to retrieve HDFS clickstream data? Thanks
In order to read HDFS data in SparkSQL there are two major ways depening on your setup:
Read the table as it was defined in Hive (reading from a remote metastore) (probably not your case)
SparkSQL by default (if not configured otherwise) creates a embedded metastore for Hive which allows you to issue DDL and DML statements using Hive syntax.
You need an external package for that to work com.databricks:spark-avro.
CREATE TEMPORARY TABLE divolte_data
USING com.databricks.spark.avro
OPTIONS (path "path/to/divolte/avro");
Now data should be available inside the table divolte_data

Sqoop-2 fails on large import to single node with custom query using sqoop shell

I am prototyping migration of a large record set generated by a computationally expensive custom query. This query takes approximately 1-2 hours to return a result set in SQL Developer
I am attempting to pass this query to a simple Sqoop job with links JDBC to HDFS
I have encountered the following errors in my logs:
2016-02-12 10:15:50,690 ERROR mr.SqoopOutputFormatLoadExecutor [org.apache.sqoop.job.mr.SqoopOutputFormatLoadExecutor$ConsumerThread.run(SqoopOutputFormatLoadExecutor.java:257)] Error while loading data out of MR job.
org.apache.sqoop.common.SqoopException: GENERIC_HDFS_CONNECTOR_0005:Error occurs during loader run
at org.apache.sqoop.connector.hdfs.HdfsLoader.load(HdfsLoader.java:110)
at org.apache.sqoop.connector.hdfs.HdfsLoader.load(HdfsLoader.java:41)
at org.apache.sqoop.job.mr.SqoopOutputFormatLoadExecutor$ConsumerThread.run(SqoopOutputFormatLoadExecutor.java:250)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/username/schema/recordset/.72dee005-3f75-4a95-bbc0-30c6b565d193/f5aeeecc-097e-49ab-99cc-b5032ae18a84.txt (inode 16415): File does not exist. [Lease. Holder: DFSClient_NONMAPREDUCE_-1820866823_31, pendingcreates: 1]
When I try to check the resulting .txt files in my hdfs, they are empty.
Has anyone encountered and solved this? Also, I am noticing additional bugginess with the Sqoop shell. For example, I am unable to check the job status as it always returns UNKNOWN.
I am using sqoop-1.99.6-bin-hadoop200 with Hadoop 2.7.2 (Homebrew install). I am querying a remote Oracle 11 database with the Generic JDBC Connector.
I have already conducted a smaller import job using the schema/table parameters in create job
I am tempted to migrate the entire schema table by table, then just use Hive to generate and store the record set I want. Would this be a better/easier solution?
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException
This query takes approximately 1-2 hours to return a result set in SQL
Developer
I would bet that Sqoop 1.99 creates an empty HDFS file (i.e. the NameNode gets the request, creates the file but does not materialize it for other clients yet, grants an exclusive write lease for Sqoop, and assigns responsibility for writing block#1 to a random DataNode) then waits for the JDBC ResultSet to produce some data... without doing any keep-alive in the meantime.
But alas, after 60 minutes, the NameNode just sees that the lease has expired without any sign of the Sqoop client being alive, so it closes the file -- or rather, makes as if it was never created (no flush has ever occured).
Any chance you can reduce the time lapse with a /*+ FIRST_ROWS */ hint on Oracle side?

Errors from avro.serde.schema - "CannotDetermineSchemaSentinel"

When running jobs on Hadoop (CDH4.6 and Hive 0.10), these errors showed up:
avro.serde.schema
{"type":"record","name":"CannotDetermineSchemaSentinel","namespace":"org.apache.hadoop.hive","fields":
[{"name":"ERROR_ERROR_ERROR_ERROR_ERROR_ERROR_ERROR","type":"string"},{"name":"Cannot_determine_schema","type":"string"},{"name":"check","type":"string"},
{"name":"schema","type":"string"},{"name":"url","type":"string"},{"name":"and","type":"string"},{"name":"literal","type":"string"}]}
What's the root cause, and how do I resolve them?
Thanks!
This happens when Hive is unable to read or parse the avro schema you have given it. Check the avro.schema.url or avro.schema.literal property in your table; it is likely it is set incorrectly.

Mahout Hive Integration

I want to combine Hadoop based Mahout recommenders with Apache Hive.So that My generated Recommendations are directly stored in to my Hive Tables..Do any one know similar tutorials for this..?
Hadoop based Mahout recommenders can store the results in HDFS directly.
Hive also allows you to create table schema on top of any data using CREATE EXTERNAL TABLE recommend_table which also specifies the location of the data (LOCATION '/home/admin/userdata';).
This way you are ensured that when new data is written to that location - /home/admin/userdata then it is already available to Hive and can be queried by existing Table schema : recommend_table.
I had blogged about it some time back: external-tables-in-hive-are-handy. This solution helps for any kind of map-reduce program output that needs to be available immediately for Hive ad-hoc queries.

Resources