multiple insert into a table using Apache Spark - hadoop

I am working on a project and i am stuck on following scenario.
I have a table: superMerge(id, name, salary)
and I have 2 other tables: table1 and table2
all the tables ( table1, table2 and superMerge) has same structure.
Now, my challenge is to insert/update superMerge table from table1 and table2.
table1 is updated every 10mins and table2 every 20 mins therefore at time t=20mins i have 2 jobs trying to update same table(superMerge in this case.)
I want to understand how can i acheive this parallel insert/update/merge into superMerge table using Spark or any other hadoop application.

The problem here is that the two jobs can't communicate with each other, not knowing what the other is doing. A relatively easy solution whould be to implement a basic file-based "locking" system:
Each job creates a (empty) file in a specific folder on HDFS indicating that the update/insert is in progress and removes that file if the jobs is done
Now, each jobs has to check whether such a file exists or not prior to starting the update/insert. If it exists, the job must wait until the files is gone.

Can you control code of job1 & job2? How do you schedule those?
In general you can convert those two jobs into 1 that runs every 10 minutes. Once in 20 mins this unified job runs with different mode(merging from 2 tables), while default mode will be to merge from 1 table only.
So when you have same driver - you don't need any synchronisation between two jobs(e.g. locking). This solution supposes that jobs are finishing under 10 mins.

How large are your dataset ? Are you planning to do it in Batch (Spark) or could you stream your inserts / updates (Spark Streaming) ?
Lets assume you want to do it in batch:
Launch only one job every 10 minutes that can process the two tables. if you got Table 1 and Table 2 do a Union and join with superMerge. As Igor Berman suggested.
Be careful has your superMerge table will get bigger your join will take longer.

I faced this situation, write the tb1 DF1 to a location1 and tb2 DF2 to location 2 and at the end just switch the paths to the super merge table, you can also do the table to table insert but that consumes a lot of runtimes especially in the hive.
overwriting to the staging locations location1 and location 2:
df1.write.mode("overwrite").partitionBy("partition").parquet(location1)
df2.write.mode("overwrite").partitionBy("partition").parquet(location2)
switching paths to super merge table :
hiveContext.sql(alter table super_merge_table add if not exists partition(partition=x); LOAD DATA INPATH 'location1/partition=x/' INTO TABLE super_merge_table partition(partition=x))"
hiveContext.sql(alter table super_merge_table add if not exists partition(partition=x); LOAD DATA INPATH 'location2/partition=x/' INTO TABLE super_merge_table partition(partition=x))"
You can do the parallel merging without overriding the one on other.

Related

SQL query suddenly runs slowly until base table's stats are gathered

We have a scheduled job that loads table A from multiple tables twice a day. During that procedure, there is a one separate single select statement that gets two columns from table B. That is:
select column1, column2
into l_col1, l_col2
from Table B
where ....
The job starts, goes through several selects in a procedure, and when it comes to the above select it halts, in a session window we see that above select is holding the session and when we gather stats for that Table B, it solves the problem, the query runs fast in seconds.
However, this problem appears again when the job runs for the second time in a day. The same thing happens even though Table B has not changed, no transactions have been made to it. We gather statistics again to fasten the query. We tried to add a job to gather stats for it before the main job starts but it did not help. Anybody know why this happens?

Oozie workflow: how to keep most recent 30 days in the table

I am trying to build a Hive table and automate it through oozie. Data in the table need not be older than last 30 days.
Action in the work flow would be run every day. It will first purge data that are 30 days older, and insert data for today. Sliding window with 30 days interval.
Can someone show an example how to achieve this?
Hive stores data in HDFS files, and these files are immutable.
Well, actually, with recent Hadoop releases HDFS files can be appended to, or even truncated, but with a low-level API, and Hive has no generic way to modify data files for text/AVRO/Parquet/ORC/whatever format, so for practical purposes HDFS files are immutable for Hive.
One workaround is to use transactional ORC tables that create/rewrite an entire data file on each "transaction" -- requiring a background process for periodic compaction of the resulting mess (e.g. another step of rewriting small files into bigger files).
Another workaround would be an ad hoc batch rewrite of your table whenever you want to get rid of older data -- e.g. every 2 weeks, run a batch that removes data older than 30 days.
> simple design
make sure that you will have no INSERT or SELECT running until the purge is over
create a new partioned table with the same structure plus a dummy
partitioning column
copy all the data-to-keep to that dummy partition
run an EXCHANGE PARTITION command
drop the partitioned table
now the older data is gone, and you can resume INSERTs
> alternative design, allows INSERTs while purge is running
rebuild your table with a dummy partitioning key, and make sure that
all INSERTs always go into "current" partition
at purge time, rename "current" partition as "to_be_purged" (and
make sure that you will run no SELECT until purge is over, otherwise
you may get duplicates)
copy all the data-to-keep from "to_be_purged" to "current"
drop partition "to_be_purged"
now the older data is gone
But it would be soooooooooooooooooo much simpler if your table was partitioned by month, in ISO format (i.e. YYYY-MM). In that case you could just get the list of partitions and drop all that have a key "older" than (current month -1), with a plain bash script. Believe me, it's simple and rock-solid.
As already answered in How to delete and update a record in Hive (by user ashtonium), hive version 0.14 is available with ACID support. So, create .hql script and use simple DELETE + where condition (for current date use unix_timestamp()) and INSERT. The INSERT should be used in a bulk fashion not in a OLTP one.

How Load distributed data in Hive works?

My target is to perform a SELECT query using Hive
When I have a small data on a single machine (namenode), I start by:
1-Creating a table that contains this data: create table table1 (int col1, string col2)
2-Loading the data from a file path: load data local inpath 'path' into table table1;
3-Perform my SELECT query: select * from table1 where col1>0
I have huge data, of 10 millions rows that doesn't fit into a single machine. Lets assume Hadoop divided my data into for example 10 datanodes and each datanode contains 1 million row.
Retrieving the data to a single computer is impossible due to its huge size or would take alot of time in case it is possible.
Will Hive create a table at each datanode and perform the SELECT query
or will Hive move all the data a one location (datanode) and create one table? (which is inefficient)
Ok, so I will walk through what happens when you load data into Hive.
The 10 million line file will be cut into 64MB/128MB blocks.
Hadoop, not Hive, will distribute the blocks to the different slave nodes on the cluster.
These blocks will be replicated several times. Default is 3.
Each slave node will contain different blocks that make up the original file, but no machine will contain every block. However, since Hadoop replicates the blocks there must be at least enough empty space on the cluster to accommodate 3x the file size.
When the data is in the cluster Hive will project the table onto the data. The query will be run on the machines Hadoop chooses to work on the blocks that make up the file.
10 million rows isn't that large though. Unless the table has 100 columns you should be fine in any case. However, if you were to do a select * in your query just remember that all that data needs to be sent to the machine that ran the query. That could take a long time depending on file size.
I hope I covered your question. If not please let me know and I'll try to help further.
The query
select * from table1 where col1>0
is just a map side job. So the data block is processed locally at every node. There is no need to collect data centrally.

Not able to apply dynamic partitioning for a huge data set in Hive

I have a table test_details with some 4 million records. Using the data in this table, I have to create a new partitioned table test_details_par with records partitioned on visit_date. Creating the table is not a challenge, but when I come to the part where I have to INSERT the data using Dynamic Partitions, Hive gives up when I try to insert data for more number of days. If I do it for 2 or 3 days the Map Reduce jobs runs successfully but for more days it fails giving a JAVA Heap Space Error or GC Error.
A Simplified Snapshot of my DDLs is as follows:
CREATE TABLE test_details_par( visit_id INT, visit_date DATE, store_id SMALLINT);
INSERT INTO TABLE test_details_par PARTITION(visit_date) SELECT visit_id, store_id, visit_date FROM test_details DISTRIBUTE BY visit_date;
I have tried setting these parameters, so that Hive executes my job in a better way:
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set hive.exec.max.dynamic.partitions.pernode = 10000;
Is there anything that I am missing to run the INSERT for a complete batch without specifying the dates specifically?
Neels,
Hive 12 and below have well-known scalability issues with dynamic partitioning that will be addressed with Hive 13. The problem is that Hive attempts to hold a file handle open for each and every partition it writes out, which causes out of memory and crashes. Hive 13 will sort by partition key so that it only needs to hold one file open at a time.
You have 3 options as I see
Change your job to insert only a few partitions at a time.
Wait for Hive 13 to be released and try that (2-3 months to wait).
If you know how, build Hive from trunk and use it to complete your data load.

showing hbase data in JSP taking several minutes.Please advice

I am accessing HBase data from JSP using Hive queries.Now since Hbase can store huge data like Terabytes of data.If the data is in so much the hive query (which converts into map reduce tasks) will take several minutes of time.So will the JSP page wait say 10 minutes to display the data.What should be the strategy.Is this the correct approach.If not so what is the best approach to show huge hbase data on JSP.
Hive/Any hadoop map-reduce system for that matter, is designed for offline batch processing. Submitting Hive queries from JSP and waiting for an arbitrary amount of time for data to be ready and be shown on the front-end is a definite no-no. If the cluster is super-busy , your jobs might not be even scheduled within the specified time.
What exactly do you want to show from Hbase on the front end ?
If it is a set of rows from a table and you know what the rows are (meaning you have the row key or your application can compute it at run time) , just fetch those rows from and display.
if you have to do some SQL-like operations(joins/ selects etc), then I guess you do realize , HBase is a No-SQL system and you are supposed to do these operations in the application and then fetch the appropriate rows using the row key.
For eg: If you have 2 HBase tables , say Dept (dept Id as row key and a string column(employees) with commma separated list of empIds) and Employee( emp Id as row key and columns Name, Age, Salary) . To find the employee with highest salary in a dept, you have to
a.Fetch the row from the Dept table (using dept Id)
b. Iterate the list of empIds from the employees column.
c. In each iteration , fetch the row from Employee table(by empId row key)
and find the max
Yes HBase can handle TBs of data, but you ll almost never have to show that much of data on the front-end using JSP. Im guessing, you ll most likely be interested in only a portion of the data , though the backing HBase table is much bigger

Resources