I'm a newbie in HiveQL. When I'm creating a table, I came to know we need to keep TRUE some of the properties of transactions. Then I have gone through what are those:
hive>set hive.support.concurrency = true;
hive>set hive.enforce.bucketing = true;
hive>set hive.exec.dynamic.partition.mode = nonstrict;
hive>set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
hive>set hive.compactor.initiator.on = true;
hive>set hive.compactor.worker.threads = a positive number on at least one instance of the Thrift metastore service;
What exactly Concurrency,bucketing,Dynamic.partition.mode = 'nonstrict'?
I have been trying to learn about those things but I'm getting information along with locking mechanisms and ZooKeeper and in memory concepts.
As I'm completely new to this area I'm unable to get a proper knowledge on this property.
Can any one throw some light on this?
From Hive documentation
hive.support.concurrency
Whether Hive supports concurrency or not. A ZooKeeper instance must be
up and running for the default Hive lock manager to support read-write
locks.
Set to true to support INSERT ... VALUES, UPDATE, and DELETE
transactions (Hive 0.14.0 and later). For a complete list of
parameters required for turning on Hive transactions
hive.enforce.bucketing
Whether bucketing is enforced. If true, while inserting into the
table, bucketing is enforced.
hive.exec.dynamic.partition.mode
In strict mode, the user must specify at least one static partition in
case the user accidentally overwrites all partitions. In nonstrict
mode all partitions are allowed to be dynamic.
hive.txn.manager
Set this to org.apache.hadoop.hive.ql.lockmgr.DbTxnManager as part of
turning on Hive transactions. The default DummyTxnManager replicates
pre-Hive-0.13 behavior and provides no transactions.
hive.compactor.initiator.on
Whether to run the initiator and cleaner threads on this metastore
instance. Set this to true on one instance of the Thrift metastore
service as part of turning on Hive transactions. For a complete list
of parameters required for turning on transactions, see
hive.txn.manager.
It's critical that this is enabled on exactly one metastore service
instance (not enforced yet).
hive.compactor.worker.threads
How many compactor worker threads to run on this metastore instance.
Set this to a positive number on one or more instances of the Thrift
metastore service as part of turning on Hive transactions. For a
complete list of parameters required for turning on transactions, see
hive.txn.manager.
Worker threads spawn MapReduce jobs to do compactions. They do not do
the compactions themselves. Increasing the number of worker threads
will decrease the time it takes tables or partitions to be compacted
once they are determined to need compaction. It will also increase the
background load on the Hadoop cluster as more MapReduce jobs will be
running in the background.
Related
I'm trying to configure a cluster with both sharding and replication and have some doubts about how insert_quorum works with Distributed engine and internal replication.
insert_quorum controls synchronous insertion to multiple instances of Replicated* tables (if insert_quorum>=2 the client will return only after data was successfully inserted in insert_quorum replicas).
insert_distributed_sync controls synchronous insertion to Distributed table. if insert_distributed_sync=1 client will return only after data was successfully inserted in target tables (one replica if internal_replication is true).
But how do insert_distributed_sync, insert_quorum and internal_replication work together?
Is my understanding correct that if I execute insert into Distributed table with insert_distributed_sync=1 and insert_quorum=2 the statement will return only after the data was inserted in at least two replicas?
Or is insert_quorum ignored for Distributed engine and works only when writing directly with Replicated* tables?
As I understood
internal_replication and insert_distributed_sync apply to Distributed engine
insert_quorum applied to ReplicatedMergeTree
INSERT query to Distributed table which created over multiple *ReplicatedMergeTree with insert_distributed_sync=1, will invoke multiple inserts into ReplicatedMergeTree tables inside the initial clickhouse-server process use authentication from remote_servers config part.
It will one Insert for each Shard according to sharding key which you defined when create Distributed table.
If you define internal_replication=true, then only One *ReplicatedMergeTree table should be written, but when Distributed engine insert into ReplicatedMergeTree, initial clickhouse-server serves query as a client, so insert_quorum should apply on destination clickhouse-server and initial server will get an answer only after all inserted parts will replicate over ZK.
If you define internal_replication=false, then the Distributed engine should initiate insert to all *ReplicatedMergeTree, and insert_quorum also will apply, but replication conflicts should be resolved on over Zookeeper Queues on ReplicatedMergeTree side, cause inserted parts will have the same control sums and names.
Since HBase is built on top of HDFS which has a replication policy for fault tolerance, does this mean HBase is inherently fault tolerant and data stored in HBase will always be accessible thanks to the underlying HDFS? Or does HBase implement a replication policy of its own (e.g table replication over regions)?
Yes, you can create replica of regions in Hbase, as mentioned here. However, note that HBase high availability is for read only. It is not highly available for writes. If region server goes down, then until regions are assigned to a new region server, you will not be able to write.
To enable read replicas, you need to enable Async WAL replication by setting hbase.region.replica.replication.enabled to true. You will also need to enable high availability for the table at creation time by specifying REGION_REPLICATION value greater than 1, as in docs:
CREATE 't1', 'f1', {REGION_REPLICATION => 2}
More details can be found here.
The concept of replication in HBase is different than HDFS replication. Both are different in different context. HDFS is the file system and replicates data for fault tolerant and high availability features from the data file. While HBase replication is mainly around fault tolerant, high availability and data integrity from a database system perspective.
Of course, HDFS replication capability is used for file level replication for HBase. Along with it, HBase also maintains copies of its meta data into backup nodes (which are again replicated by default by HDFS).
HBase also have backup processes to monitor and recover from failure. like Primary and Secondary Region servers. But the data loss in the region server is protected by HDFS replication only.
Hence, the Hbase replication is mainly around recovery of failure and maintaining data integrity as a database engine. It is just like any other robust database system like Oracle.
If we using 6 mapper in sqoop to importing the data from Oracle, then how many connection will be establish between sqoop and source.
Will it be a single connection or it will be 6 connections for each mapper.
As per sqoop docs:
Likewise, do not increase the degree of parallism higher than that which your database can reasonably support. Connecting 100 concurrent clients to your database may increase the load on the database server to a point where performance suffers as a result.
That means all the mappers will make concurrent connections.
Also keep in mind, if your table has 2 records only, then sqoop will only use 2 mappers not all the 6 mappers.
Check my other answer to understand concept of number of mappers in Sqoop command.
EDIT:
All the mappers will make inactive connections as JDBC client program. Then active connections (which actually fires SQL query) will be shared among multiple mappers.
Fire SQOOP IMPORT command in -verbose mode, you will see logs -
DEBUG manager.OracleManager$ConnCache: Got cached connection for jdbc:oracle:thin:#192.xx.xx.xx:1521:orcl/dev
DEBUG manager.OracleManager$ConnCache: Caching released connection for jdbc:oracle:thin:#192.xx.xx.xx:1521:orcl/dev
Check getConnection and recycle methods for more details.
Each map task will get a DB connection. so in your case 6 maps then 6 connections. please visit github/sqoop to see how it was implemented
-m specify the number of mapper task will be running as part of the Job.
so more number of mappers then more number of connections.
It probably depends on Manager but I guess all of them likely to create one. Take DirectPostgresSqlManager. It creates one connection per mapper through psql COPY TO STDOUT
Please take a look at managers at
Sqoop Managers
I have been trying to implement the UPDATE,INSERT,DELETE operations in hive table as per instructions. But whenever I try to include the properties which will do our work i.e. configuration values set for INSERT, UPDATE, DELETE hive.support.concurrency true (default is false) hive.enforce.bucketing true (default is false) hive.exec.dynamic.partition.mode nonstrict (default is strict) After that, if I run show tables on hive shell it's taking 65.15 seconds which normally runs at 0.18 seconds without the above properties. Apart from show tables, rest of the commands not giving any output i.e. they keep on running until and unless I kill the process. Could you tell me reason for this?
Hive is not an RDBMS. A query that ran for 2 mins may run for 5 mins under the same configuration. Neither Hive nor Hadoop guarantee us about the time taken for a query to execute. Also, please include information about whether you are running on a single node cluster or multi node cluster. And also provide information about the size of data on which you are querying. The information you have provided is insufficient. But, don't come to any conclusion based on time to execute query. Because, lots of factors such as disk, CPU slots, N/W etc etc., are involved in deciding run time of query.
I am pretty new to Cassandra so forgive me when I have some fundamental misunderstanding of the concept of keyspaces. What I am trying to do is to set up a multi datacenter ring in different regions with data replication NetworkTopologyStrategy endpoint_snitch set to GossipingPropertyFileSnitch
hence as explained in the docs I need set the replication strategy for a keyspace
CREATE KEYSPACE "mykey"
WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 2, 'dc2' : 2};
i also read that in cql i can do "use mykey" to set the keyspace
would that be persistantly set then in the cassandra configurtation? As far as i understand each application client in a cluster uses its own keyspace right. Hence i would need to set this in the application??
The examples only show how to create a keyspace for configuring replication strategy options. I i think i managed to understand the basics behind it. What i am looking for is examples how i would tell cassandra to use a certain keyspace strategy (consistently and/or application dependent).
I digged some more in the cassandra docs and think i got a better aubderstanding about the use of keyspace. Am i correct in that for telling cassandra to use a certain keyspace i can create keyspace like so
CREATE KEYSPACE "MyKey" WITH replication = {'class':
'SimpleStrategy', 'replication_factor': '1'} AND durable_writes = true;
and then create tables in this keyspace like so
CREATE TABLE "MyKey"."TableName" (
...
Would this make cassandra to always use the configured replication strategy in the "MyKey" keyspace for that table?
"As far as i understand each application client in a cluster uses its own keyspace right. Hence i would need to set this in the application??"
No. You can think of a keyspace as just a collection of tables, which all your users would access. You would really only create multiple keyspaces if you had dramatically different replication needs for some reason, or if you had a multi-tenant application that required it for security purposes.
"Would this make cassandra to always use the configured replication strategy in the "MyKey" keyspace for that table?"
Yes. TableName table permanently lives in the MyKey keyspace and will inherit the properties of that keyspace.
Once you set your replication factor, you don't typically change it. You can but it would require a fairly IO intensive process in the background. Replication factor is used to determine how many copies of a singe piece of data lives in a particular datacenter and therefor will tell you how many nodes can fail before you have an outage. 3 is by far the most common setting here, but if you do not have 3 nodes in your data center, then a smaller number is fine.