How to handle volatile records efficiently - hadoop

I have this problem and i wanted to run by others and to see if i can handle this in a better way.
We have 300 node cluster and we process transaction information/records on a daily basis. We could get ~ 10 million trasaction each day and the record size ~2K bytes each.
We currently use HDFS for data storage, pig and hive for data processing. We use the external hive table type in most cases where it is partitioned by transaction created date.
The business is such that we might get an update on a transaction that was created months or years before. Example, i might get an update of a transaction created 5 years back. We cant ignore this record but to reprocess the corresponding partition again just for a single record.
On a daily basis we end up processing 1000 partitions due to this. There are further ETL applications that uses these transaction table.
I understand that this is a limitation on hive/hdfs architecture.
I am sure that others would have had this problem, it will be really helpful if you can share the options that you might have tried and how did you over come this ?

You do not have to overwrite partitions: you can simply insert into them. Do not include the "overwrite" in your insert commands.
Following is an example of a table partitioned by date, in which I did an insert (without ovewrite!) twice - and you can see the records are there . .twice! So that shows the partition got appended, not overwritten/dropped.
insert into table insert_test do not put overwrite here! select ..
hive> select * from insert_test;
OK
name date
row1 2014-03-05
row2 2014-03-05
row1 2014-03-05
row2 2014-03-05
row3 2014-03-06
row4 2014-03-06
row3 2014-03-06
row4 2014-03-06
row5 2014-03-07
row5 2014-03-07
row6 2014-03-09
row7 2014-03-09
row6 2014-03-09
row7 2014-03-09
row8 2014-03-16
row8 2014-03-16

Related

hive not taking values

I am trying to import a file into hive.
The sample data is as following
1::Toy Story (1995)::Animation|Children's|Comedy
2::Jumanji (1995)::Adventure|Children's|Fantasy
3::Grumpier Old Men (1995)::Comedy|Romance
4::Waiting to Exhale (1995)::Comedy|Drama
My table declaration is as following
create table movies(id int,title string,genre string) row format delimited fields terminated by '::';
But after loading the data, my table shows data for the first two fields only.
Total MapReduce CPU Time Spent: 1 seconds 600 msec
OK
1 Toy Story (1995)
2 Jumanji (1995)
3 Grumpier Old Men (1995)
4 Waiting to Exhale (1995)
Time taken: 22.087 seconds
Can anyone help me on why this is happening or how to debug this.
Hive row delimiter will take only one character by default, since you have two characters '::' Please try Creating Table with MultiDelimitSerDe
Query:
CREATE TABLE movies (id int,title string,genre string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES ("field.delim"="::")
STORED AS TEXTFILE;
Output:
hive> select * from movies;
OK
1 Toy Story (1995) Animation|Children's|Comedy
2 Jumanji (1995) Adventure|Children's|Fantasy
3 Grumpier Old Men (1995) Comedy|Romance
4 Waiting to Exhale (1995) Comedy|Drama
Please refer similar post:
Load data into Hive with custom delimiter

How to speed up the creation of a table from table partitioned by date?

I have a table with a huge amount of data. It is partitioned by week. This table contains a column named group. Each group could have multiple records of weeks. For example:
List item
gr week data
1 1 10
1 2 13
1 3 5
. . 6
2 2 14
2 3 55
. . .
I want to create a table based on one group. The creation currently is taking ~23 minutes on Oracle 11g. This is a long time since I have to repeat the process for each group and I have many groups. what is the best fastest way to create the table ?
Create all tables then use INSERT ALL WHEN
http://docs.oracle.com/cd/B19306_01/server.102/b14200/statements_9014.htm#i2145081
The data will be read only once.
insert all
when gr=1 then
into tab1 values (gr, week, data)
when gr=2 then
into tab2 values (gr, week, data)
when gr=3 then
into tab3 values (gr, week, data)
select *
from big_table
The best speed up you reach if you don't copy the data on group basis and process then week by week, but you don't say what you will reach so it is not possible to comment (this approach may be of course difficult or impracticable; but you should at least consider it).
Therefore below some hints how to extract the group data:
remove all indexes as this will only block space - all you need to do is one large FULL TABLE SCAN
check the available space and size of each group; maybe you can process several groups in one pass
deploy parallel query
.
create table tmp as
select /*+ parallel(4) */ * from BIG_TABLE
where group_id in (..list of groupIds..);
Please note that parallel mode must be enabled in the database, ask your DBA if you are unsure. The point is that the large FULL TABLE SCAN is performed by several sub-processes (here 4) which may (dependent on your mileage) cut the elapsed time.

Get the recent N records from a very huge database in oracle

I am working on a very large Oracle database with around 90 million records. An I wanted to get the recent 100 records for the UI purpose. I was trying to achieve this using order by clause on the date column in the schema, even though I am able to get the recent records but it takes around 20-25 mints for the processing.
E.g. Schema
message_id varchar2,
message_head varchar2,
message_body varchar2,
----------
----------
message_date date
I was using the message_date to sort for the recent messages.
Could anyone please help me out to provide the latest 300 messages in less time (say less than 1 minute) And i am also wondering how big data-driven companies like facebook and twitter manages to give the latest posts and tweets within seconds.
Thanks in advance
1) Create an index on message_date
2) Add a sequence column (that is indexed also) and retrieve the last X records using that column
Note that if you're working on a 'live' database that the last X records is changing all the time.
Another approach, or maybe even complementary...
Use optimizer mode hint for FIRST N ROWS. That will instruct the optimizer to optimize for a fast retrieval of the first records.
SELECT /*+ FIRST_ROWS(10) */ employee_id, last_name, salary, job_id
FROM employees
WHERE department_id = 20;
will optimize for retrieving the 10 first rows. You should do 100.

Insert delayed when network error and insert trigger inserting on dblink-table

Maybe a strange title, I'll try to explain.
I have two Oracle servers serverA and serverB.
On serverA I have a table tabA in which every minute a row is
inserted.
On serverB I create a table tabB that has the same
structure as tabA.
On serverA I create a dblink to serverB. On
serverA I create a insert trigger like
On A
create trigger tabA_trig
after insert on tabA
begin
insert into tabB#serverB(...) values(:new....,:new... etc);
exception
when others then null;
end tabA_trig;
After the creation of this trigger a new row is inserted into tabB every time a new row is inserted into tabA, as expected.
HOWEVER: when the communication between serverA and serverB is broken, I don't get any errors (the exception above takes care of that), BUT no data is inserted into tabA either! Very strange, and what is even stranger is that after about 15 minutes, new data gets inserted into tabA again. After a while the missing data starts to fill in the hole and after a while no data is missing and the insert functions as expected.
Example (tabA and tabB, first column time minutes and hours, second column value):
Network OK:
tabA 1000 22;1001 22;1002 22
tabB 1000 22;1001 22;1002 22
Network ERROR:
About 15 minutes of no new data.
After 15 minutes:
tabA 1000 22;1001 22;1002 22;1017 22;1018 22;....
tabB 1000 22;1001 22;1002 22
After 30 minutes:
tabA 1000 22;1001 22;1002 22;1003 22;1004 22;1005 22;1006 22;1017 22;1018 22;....
tabB 1000 22;1001 22;1002 22
After 1 hour:
tabA 1000 22;1001 22;1002 22;1003 22;1004 22;1005 22;1006 22;1006 22;...;1017 22;1018 22;....
tabB 1000 22;1001 22;1002 22
If I disable the trigger, the insert on tabA works immediately.
The reason I'm using triggers and not materialized views is that I want to keep all data that has been replicated to tabB even when data is deleted from tabA.
Does anyone know what to do about this?
First, replicating data from one database to another using custom triggers is almost certainly a bad idea. Oracle provides a variety of technologies that help you implement replication. Materialized views are probably the simplest and most likely what you'd want here though you could also look at Streams or Golden Gate or even something like Change Data Capture (CDC). Custom triggers have a substantial impact on the performance of the triggering insert and they introduce a variety of failure scenarios such as this that are hard to debug.
Since no data is being inserted into tabA while the network is down but those inserts are appearing later, I would tend to suspect that some sort of exception is being raised that causes the inserting application to either queue the insert until later or that the application is creating a distributed transaction that involves either database B or some other resource that is also affected by the network error that cannot be successfully committed until some later point in time. Since we don't know anything about your application's architecture, it's hard to speculate too deeply. You can check the DBA_2PC_PENDING table to see if Oracle is acting as the distributed transaction coordinator for any distributed transactions but there are numerous other software components that might be acting as the transaction coordinator.
Or Oracle advanced queues if both database are similar versions. A little tricky to set up but the asynchronous messaging model will handle instances where one database or another is down.
As to whether it is too complex a solution I find that I simplify the questions I put here. What the entire problem is only the OP can tell. Does the problem seem like more of a messaging problem or amalgamating data? Both solutions will work much better than triggers.
You need to activate dead connection detection using
SQLNET.EXPIRETIME
I wouldn't recommend replication the way you are doing it. You can use an "ON COMMIT REFRESH" materialized view to achieve this kind of replication.

Inserts are 4x slower if table has lots of record (400K) vs. if it's empty

(Database: Oracle 10G R2)
It takes 1 minute to insert 100,000 records into a table. But if the table already contains some records (400K), then it takes 4 minutes and 12 seconds; also CPU-wait jumps up and “Free Buffer Waits” become really high (from dbconsole).
Do you know what’s happing here? Is this because of frequent table extents? The extent size for these tables is 1,048,576 bytes. I have a feeling DB is trying to extend the table storage.
I am really confused about this. So any help would be great!
This is the insert statement:
begin
for i in 1 .. 100000 loop
insert into customer
(id, business_name, address1,
address2, city,
zip, state, country, fax,
phone, email
)
values (customer_seq.nextval, dbms_random.string ('A', 20), dbms_random.string ('A', 20),
dbms_random.string ('A', 20), dbms_random.string ('A', 20),
trunc (dbms_random.value (10000, 99999)), 'CA', 'US', '798-779-7987',
'798-779-7987', 'asdfasf#asfasf.com'
);
end loop;
end;
Here dstat output (CPU, IO, MEMORY, NET) for :
Empty Table inserts: http://pastebin.com/f40f50dbb
Table with 400K records: http://pastebin.com/f48d8ebc7
Output from v$buffer_pool_statistics
ID: 3
NAME: DEFAULT
BLOCK_SIZE: 8192
SET_MSIZE: 4446
CNUM_REPL: 4446
CNUM_WRITE: 0
CNUM_SET: 4446
BUF_GOT: 1407656
SUM_WRITE: 1244533
SUM_SCAN: 0
FREE_BUFFER_WAIT: 93314
WRITE_COMPLETE_WAIT: 832
BUFFER_BUSY_WAIT: 788
FREE_BUFFER_INSPECTED: 2141883
DIRTY_BUFFERS_INSPECTED: 1030570
DB_BLOCK_CHANGE: 44445969
DB_BLOCK_GETS: 44866836
CONSISTENT_GETS: 8195371
PHYSICAL_READS: 930646
PHYSICAL_WRITES: 1244533
UPDATE
I dropped indexes off this table and performance improved drastically even when inserting 100K into 600K records table (which took 47 seconds with no CPU wait - see dstat output http://pastebin.com/fbaccb10 ) .
Not sure if this is the same in Oracle, but in SQL Server the first thing I'd check is how many indexes you have on the table. If it's a lot the DB has to do a lot of work reindexing the table as records are inserted. It's more difficult to reindex 500k rows than 100k.
The indices are some form of tree, which means the time to insert a record is going to be O(log n), where n is the size of the tree (≈ number of rows for the standard unique index).
The fastest way to insert them is going to be dropping/disabling the index during the insert and recreating it after, as you've already found.
Even with indexes, 4 minutes to insert 100,000 records seems like a problem to me.
If this database has I/O problems, you haven't fixed them and they will appear again. I would recommend that you identify the root cause.
If you post the index DDL, I'll time it for a comparison.
I added indexes on id and business_name. Doing 10 iterations in a loop, the average time per 100,000 rows was 25 seconds. This was on my home PC/server all running on a single disk.
Another trick to improve performance is to turn on or set the cache higher on your sequence(customer_seq). This will allow oracle to allocate the sequence into memory instead of hitting the object for each insert.
Be careful with this one though. In some situations this will cause gaps your sequence to have gaps between values.
More information here:
Oracle/PLSQL: Sequences (Autonumber)
Sorted inserts always take longer the more entries there are in the table.
You don't say which columns are indexed. If you had indexes on fax, phone or email, you would have had a LOT of duplicates (ie every row).
Oracle 'pretends' to have non-unique indexes. In reality every index entry is unique with the rowid of the actual table row being the deciding factor. The rowid is made up of the file/block/record.
It is possible that, once you hit a certain number of records, the new ones were getting rowids which meant that had to be fitted into the middle of existing indexes with a lot of index re-writing going on.
If you supply full table and index creation statements, others would be able to reproduce the experience which would have allowed for more evidence based responses.
i think it has to do with the extending the internal structure of the file, as well as building database indexes for the added information - i believe the database arranges the data in a non-linear fashion that helps speed up data retrieval on selects

Resources