Situation:
I'm reading files from hdfs with Spark and creating around 1000 Hive tables in 30 minutes.
Requirement:
As fast as possible I need to have these 1000 tables on Oracle as well.
My thoughts:
1. Load same dataframe to Hive and then to Oracle via jdbc within the same Spark application.
2. Load data to Hive. Sqoop tables from Hive to Oracle.
Any other ideas? Basically I need to replicate whole Hive database with ~1000 tables to Oracle.
Extremely appreciate any advice.
I want to import big table from oracle database to HDFS using Sqoop.
As the table size is huge and it is having primary key sqoop can run multiple mappers parallel.
I have some questions in
1)Due to bad record in oracle database, one mapper got exception and others are running fine. So all the job will get failed or except one mapper data all other mappers will write data in HDFS?
2)Is sqoop is intelligent enough to run parallel mappers if we hive --m option.
If we give --m 4 then sqoop can increase mappers based on tables size or it will run with 4 only?
Is any body came across this kind of scenario??
Based on my knowledge.
If one mapper gets failed, The sqoop process will try to kill other mapper. The process won't delete the data from HDFS. You can see some of the data been created in your HDFS location.
When we specify number of mapper (using -m x option) the program will use at most x mapper.
How to import last 3 days incremental data from oracle to hdfs using Sqoop.
Currently i have written generic sqoop command using Shell-Script to import data from multiple oracle database for multiple plants.
So can anyone help me how to write sqoop command to import the last 3 days data.
In your SQOOP job you can issue SQL so in your SQL statement you can add a date function to the Where clause assuming the table you are pulling from has a date column.
Example: select ,,... from where >= (CURRENT-DATE -3);
Hi I have a hive table on HBASE that has 200gb of records .
I am running simple hive query to fetch 20 gb records .
But this takes around 4 hours of time .
I can not create partition on HIVE table cause it is integrated on HBASE.
Please suggest any idea to improve performance
This is my HIVE query
INSERT OVERWRITE LOCAL DIRECTORY '/hadoop/user/m6034690/FSDI/FundamentalAnalytic/FundamentalAnalytic_2014.txt'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
select * from hbase_table_FundamentalAnalytic where FilePartition='ThirdPartyPrivate' and FilePartitionDate='2014';
If you can, then I think Apache Phoenix will speed things up.
https://phoenix.apache.org/faq.html
Very simple and intuitive to use and super fast.
Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1
The source is a MS SQL database of around 1.6TB and around 25 tables
The ultimate objective is to check if the existing queries can run faster on the HDP
There isn't a luxury of time and availability to import the data several times, hence, the import has to be done once and the Hive tables, queries etc. need to be experimented with, for example, first create a normal, partitioned table in ORC. If it doesn't suffice, try indexes and so on. Possibly, we will also evaluate the Parquet format and so on
4.As a solution to 4., I decided to first import the tables onto HDFS in Avro format for example :
sqoop import --connect 'jdbc:sqlserver://server;database=dbname' --username someuser --password somepassword --as-avrodatafile --num-mappers 8 --table tablename --warehouse-dir /dataload/tohdfs/ --verbose
Now I plan to create a Hive table but I have some questions mentioned here.
My question is that given all the points above, what is the safest(in terms of time and NOT messing the HDFS etc.) approach - to first bring onto HDFS, create Hive tables and experiment or directly import in Hive(I dunno if now I delete these tables and wish to start afresh, do I have to re-import the data)
For Loading, you can try these options
1) You can do a mysql import to csv file that will be stored in your Linux file system as backup then do a distcp to HDFS.
2) As mentioned, you can do a Sqoop import and load the data to Hive table (parent_table).
For checking the performance using different formats & Partition table, you can use CTAS (Create Table As Select) queries, where you can create new tables from the base table (parent_table). In CTAS, you can mention the format like parque or avro etc and partition options is also there.
Even if you delete new tables created by CTAS, the base table will be there.
Based on my experience, Parque + partition will give a best performance, but it also depends on your data.
I see that the connection and settings are all correct. But I dint see --fetch-size in the query. By default the --fetch-size is 1000 which would take forever in your case. If the no of columns are less. I would recommend increasing the --fetch-size 10000. I have gone up to 50000 when the no of columns are less than 50. Maybe 20000 if you have 100 columns. I would recommend checking the size of data per row and then decide. If there is one column which has size greater than 1MB data in it. Then I would not recommend anything above 1000.