Adding nodes (leaves or aggregators) to a memSQL cluster is straightforward: I edited memsql_cluster.json and reran memsql-cluster setup. The problem is adding partitions to an existing table. The point here is to scale up: need to add more rows, but have exhausted the available memory in the original cluster.
I tried, for example:
mysql> create partition DMP:32 on 'ec2-X-Y-Z.compute-1.amazonaws.com':3306;
ERROR 1773 (HY000): Partition ordinal 32 is out of bounds. It must be in [0, 32).
mysql>
Reading the memsql docs, I could not find any ddl option to change the number of partitions. I would prefer not to drop and recreate these tables. Any ideas on how to do it?
Thanks!
You cannot add more rows to an in-memory database when memory is already exhausted. That said, you can scale out (i.e. add more leaf nodes).
You can add more leaf nodes to your MemSQL cluster, then run rebalance_partitions to distribute your existing partitions evenly across the larger cluster. This will allow each of your partitions to consume more space in your cluster, allowing you to scale out.
If you just wanted to add more partitions, you can use mysqldump to export your MemSQL schema and data, recreate the database with more partitions, then load the schema and data back into your database that now has more partitions.
Learn more about rebalance_partitions here:
http://docs.memsql.com/docs/latest/ref/REBALANCE_PARTITIONS.html
In order to re-partition the data currently you have to (1)export the schema and (2)data from your database (partitions are set on the database level), (3)recreate the database, and (4)reload everything back in.
You can dump database schema and tables to local files using mysqldump. It's best to run mysqldump twice, once to dump the schema and once to dump the data.
This command will create a file in the local directory containing the schemas for all tables in the DB:
mysqldump -h 127.0.0.1 -u root -B <db_name> --no-data > schema.sql
If you have a large database and not enough local disk to dump all data at once you can specify some table(s) per command. For each one use a different filename (eg data1.sql, data2.sql, etc) and put one or more table names in as arguments. I would dump any smaller tables with one statement together and then dump any giant tables separately. Use this command:
mysqldump -h 127.0.0.1 -u root -B <db_name> --no-create-info --tables <table1> <table2> <table3> > data1.sql
Once all tables and the schema have been dumped you can recreate the DB with more partitions. We generally recommend #partitions = #leaves * #cores/leaf. Use this command:
CREATE DATABASE <db_name> PARTITIONS = <#_total_partitions>;
Confirm # of partitions with "SHOW PARTITIONS ON <db_name>;"
Insert the schema and data to the new database with these commands from the shell prompt:
mysql -u root -h 127.0.0.1 < schema.sql
mysql -u root -h 127.0.0.1 < data1.sql
The schema will take a few minutes to load. Time for data to load will depend on the size of the dataset. After each table completes its rows will be committed and queries can run against it.
Related
I have a directory in HDFS, everyday one processed file is placed in that directory with DateTimeStamp in file name, if I create external table on top of that Directory location, does external table refreshes itself when every day file comes and resides in that directory ??
If you add files into table directory or partition directory, does not matter, external or managed table in Hive, the data will be accessible for queries, you do not need to do any additional steps to make data available, no refresh is necessary.
Hive table/partition is a metadata (DDL, location, statistics, access permissions, etc) plus data files in the location. So, data is stored in the table/partition location in HDFS.
Only if you create new directory for new partition which is not created yet, then you will need to execute ALTER TABLE ADD PARTITION LOCATION=<new location> or MSCK REPAIR TABLE command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS.
If you add files into already created table/partition locations, no refresh is necessary.
CBO can use statistics for query calculation without reading data files, for example count(*). It works for simple queries only, like count(*), max().
If you are using CBO with statistics for query calculation, you may need to refresh it using ANALYZE TABLE hive_table PARTITION(partitioned_col) COMPUTE STATISTICS. See this answer for more details: https://stackoverflow.com/a/39914232/2700344
If you do not need statistics and want your table location to be scanned every time you query it, switch it off: set hive.compute.query.using.stats=false;
I want to apply archive and purge mechanism on hive tables, which includes internal and external tables and both partitioned and non-partitioned.
I have a site_visitors table and its partitioned with visit_date.
And I wanted to archive the site_visitors table data, where in users not visited my site in last one year. At the same time, I don't want to keep this archived data in same table directory. I can have archived data some specific location.
You can handle that on the partitions in the HDFS directory, below is one of the ways you can achieve that.
Your internal table/Main table will be sitting on top of hdfs and the directory will look something like below hdfs:namenonde/user/hive/warehouse/schema.db/site_visitors/visit_date=2017-01-01
hdfs:namenonde/user/hive/warehouse/schema.db/site_visitors/visit_date=2017-01-02
hdfs:namenonde/user/hive/warehouse/schema.db/site_visitors/visit_date=2017-01-03
You can create an archive table on top of HDFS or if you are just looking to archive the data you can dump the partitions to other location in HDFS. Either way, your HDFS location will look something like below.
hdfs:namenonde/hdfs_location/site_visitors/visit_date=2017-01-01
hdfs:namenonde/hdfs_location/site_visitors/visit_date=2017-01-02
hdfs:namenonde/hdfs_location/site_visitors/visit_date=2017-01-03
You can run a UNIX script or javascript or in any other language that is used in your environment to move the files from one HDFS location to the other archive hdfs location based on the partition dates.
You can also do with the below approach, where you can load the data into archive table and drop the data in the original table.
#!bin/bash
ARCHIVE=$1
now=$(date +%Y-%m-%d)
StartDate=$now
#archive_dt will give a date based on the ARCHIVE date and that be will used for alterations and loading
archive_dt=$(date --date="${now} - ${ARCHIVE} day" +%Y-%m-%d)
EndDate=$archive_dt
#You can use hive or beeline or impala to insert the data into archive table, i'm using beeline for my example
beeline -u ${CONN_URL} -e "insert into table ${SCHEMA}.archive_table partition (visit_date) select * from ${SCHEMA}.${TABLE_NAME} where visit_date < ${archive_dt}"
#After the data been loaded to the archive table i can drop the partitions in original table
beeline -u ${CONN_URL} -e "ALTER TABLE ${SCHEMA}.main_table DROP PARTITION(visit_date < ${archive_dt})"
#Repair the tables to sync the metadata after alterations
beeline -u ${CONN_URL} -e "MSCK REPAIR TABLE ${SCHEMA}.main_table; MSCK REPAIR TABLE archiveSchema.archive_table"
I have an use case which required around 200 hive parquet table.
I need to load these parquet table from flat text files. But we can not directly load parquet table from flat text file.
So I am using following approach
Created a temporary managed text table.
Loaded temp table with text data.
Created external parquet table.
Loaded parquet table with text table using select query.
Dropped text file for temporary text table (but keep table in metastore).
As this approach is keeping temporary metadata (for 200 tables) in metastore. So I have second approach is that I will drop temporary text table too along with text files from hdfs. And next time re-create temporary table and delete once parquet get created.
Now, as I need to follow above steps for all 200 tables for every 2 hours, So will creating and deleting tables from metastore impact anything in cluster during production?
Which approach can impact production, keeping temporary metadata in metastore, creating and deleting tables (metadata) from hive metastore?
Which approach can impact production, keeping temporary metadata in
metastore, creating and deleting tables (metadata) from hive
metastore?
No, there is no impact, the backend of the HiveMetastore should be able to handle 200 * n Changes per hour easily. If you're unsure start with 50 tables and monitor the backend DB performance.
Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1
The source is a MS SQL database of around 1.6TB and around 25 tables
The ultimate objective is to check if the existing queries can run faster on the HDP
There isn't a luxury of time and availability to import the data several times, hence, the import has to be done once and the Hive tables, queries etc. need to be experimented with, for example, first create a normal, partitioned table in ORC. If it doesn't suffice, try indexes and so on. Possibly, we will also evaluate the Parquet format and so on
4.As a solution to 4., I decided to first import the tables onto HDFS in Avro format for example :
sqoop import --connect 'jdbc:sqlserver://server;database=dbname' --username someuser --password somepassword --as-avrodatafile --num-mappers 8 --table tablename --warehouse-dir /dataload/tohdfs/ --verbose
Now I plan to create a Hive table but I have some questions mentioned here.
My question is that given all the points above, what is the safest(in terms of time and NOT messing the HDFS etc.) approach - to first bring onto HDFS, create Hive tables and experiment or directly import in Hive(I dunno if now I delete these tables and wish to start afresh, do I have to re-import the data)
For Loading, you can try these options
1) You can do a mysql import to csv file that will be stored in your Linux file system as backup then do a distcp to HDFS.
2) As mentioned, you can do a Sqoop import and load the data to Hive table (parent_table).
For checking the performance using different formats & Partition table, you can use CTAS (Create Table As Select) queries, where you can create new tables from the base table (parent_table). In CTAS, you can mention the format like parque or avro etc and partition options is also there.
Even if you delete new tables created by CTAS, the base table will be there.
Based on my experience, Parque + partition will give a best performance, but it also depends on your data.
I see that the connection and settings are all correct. But I dint see --fetch-size in the query. By default the --fetch-size is 1000 which would take forever in your case. If the no of columns are less. I would recommend increasing the --fetch-size 10000. I have gone up to 50000 when the no of columns are less than 50. Maybe 20000 if you have 100 columns. I would recommend checking the size of data per row and then decide. If there is one column which has size greater than 1MB data in it. Then I would not recommend anything above 1000.
I am trying to compare same functionality on my PostgreSQL data warehouse and newly created Hive data warehouse on same box with same data and same table structure . I am trying to understand Hive benefits, but... Despite the fact that data load into PostgreSQL running 3 times slower - the index creation/rebuild on PostgreSQL is 20 times faster, the index doesn't need to be rebuild every time like in Hive.
My question is: what I am missing in Hive configuration?
My setup is:
CREATE TABLE mytable
(
aa int,
bb string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION '/data/spaces/hadoop/hadoopfs';
LOAD DATA LOCAL INPATH '/data/Informix94/spaces/postgres/myfile_big' OVERWRITE INTO TABLE mytable;
CREATE INDEX mytable_indx ON TABLE mytable(aa) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD LOCATION '/data/spaces/hadoop/hadoopfs';
set hive.optimize.autoindex=true;
set hive.optimize.index.filter=true;
alter index mytable_indx ON mytable rebuild;
My Box is VM with 3 G ram with PostgreSQL running on it and taking ~ 1 G ram. He is serving as metadata store. I am using most recent stable versions of CentOS, Hadoop, Hive and didn't changed Hive default setting except matadata store location and statistics disabling.
The result:
index rebuild takes 4798 seconds on 260.000.000 rows or 80 seconds on 5.000.000 rows.
Hive only works well when your data doesn't fit on a single machine anymore. So the results you are seeing are expected results. So once you've collected Terabytes or Petabytes of data you'll be much happier with hive. In the use-case you describe PostgreSQL would be a much better match.