Incremental loading with pig? - hadoop

I have a few sources of data (let's say these are users events).
Each source, has it's own events definitions.
While loading to HCatalog I create one table with event_definitions (event_id, event_code, event_source, evend_description), and one with events(event_is, date, etc...)
These tables are unions of all date from source tables.
event_definitions.event_id is a surrogate key, which is a foreign key for events.event_id
This key is taken from pig RANK function.
And everything works fine while initial loading.
But how do i serve incremetal loading? New values in event_definitions table must bigger surrogate key vale than last one. With RANK the is no posibility to start with exact numer. It always starts with number 1.
How do you serve these situations?
Regards
Paweł

Related

AWS Glue - disabling bookmarks for some of sources in the job

I've got a data warehouse with star pattern - fact table and multiple dimensions around that. They are connected by foreign keys.
I've got two AWS Glue jobs:
Populates dimensions (run on-demand, they doesn't change often)
Populates fact table (should be run even every hour to get fresh fact data in warehouse)
So the situation is: I've got filled-in dimension tables using first job. In second job I need to get only fresh data for fact table, find foreign keys for each record in dimension table and persist new row in fact table.
Problem is, that when using bookmarks, AWS Glue thinks that since dimension tables didn't change since last run, nothing is there and inserts null as foreign keys.
I tried to disable bookmarking by removing transformation_ctx from the generated script, but it didn't work.
From this:
dimension_node1647201451763 = glueContext.create_dynamic_frame.from_catalog(
database="foobar-staging",
table_name="dimension",
transformation_ctx="dimension_node1647201451763",
)
I did that:
foobaritem_node1647201451763 = glueContext.create_dynamic_frame.from_catalog(
database="foobar-staging",
table_name="foobar_item",
)
But still those record were not found.
Only solution that I can imagine is disabling bookmarks completely and then add "not exists" checks for all records processed, which would prevent duplicates.

Cassandra update and sort on same column

I'm looking for some inputs around cassandra data modelling for a timeline kind of feature. To store data for the timeline, I'm planning to use timeuuid in cassandra and make it as a clustering key. This will help in sorting the data. But the same data can be updated and I need to store the updated timeuuid corresponding the data so that it can be pushed up in the timeline. This involves fetching the previous data-timeuuid row, delete it and insert the new one. But doesn't seem to performant. How can I handle the sorting and updating on the same column (in my case timeuuid) to implement timeline feature.
I suggest this schema to you :
CREATE TABLE timeline_idx {
timeline_key text,
time timeuuid,
content_key text,
PRIMARY KEY ((partition_key), time)
}
CREATE TABLE timeline_content {
content_key text,
content blob,
PRIMARY KEY (content_key)
}
Timeline_idx is used to give you the content keys ordered as a timeline. Then you can retrieve the content in a second table called timeline_content. It is not ordered and there is no clustering key. You can update your content without knowing its timeuuid. I choose text type for timeline_key and content_key but you can choose whatever you want as long as it identifies timelines and contents uniquely.

HBase row key design for reads and updates

I'm try to understand the best way to design the key for my HBase Table.
My use case :
Structure right now
PersonID | BatchDate | PersonJSON
When some thing about the person is modified, a new PersonJSON and new a batchdate is inserted in to Hbase updating the old records. And every 4 hours a scan of all the people who are modified are then pushed to Hadoop for further processing.
If my key is just personID it great for updating the data. But my performance sucks because I have to add a filter on BatchData column to scan all the rows greater than a batch date.
If my key is a composite key like BatchDate|PersonID I could use startrow and endrow on the row key and get all the rows that have been modified. But then I would have lot of duplicated since the key is not unique and can no longer update a person.
Is bloom filter on row+col (personid+batchdate) an option ?
Any help is appreciated.
Thanks,
Abhishek
In addition to the table with PersonID as the rowkey, it sounds like you need a dual-write secondary index, with BatchDate as the rowkey.
Another option would be Apache Phoenix, which provides support for secondary indexes.
I usually do two steps:
Create table one just have key is commbine of BatchDate+PersonId, value could be empty.
Create table two just as normal you did. Key is PersonId Value is the whole data.
For date range query: query table one first to get the PersonIds, and then use Hbase batch get API to get the data by batch. it would be very fast.

Hive: How to have a derived column that has stores the sentiment value from the sentiment analysis API

Here's the scenario:
Say you have a Hive Table that stores twitter data.
Say it has 5 columns. One column being the Text Data.
Now How do you add a 6th column that stores the sentiment value from the Sentiment Analysis of the twitter Text data. I plan to use the Sentiment Analysis API like Sentiment140 or viralheat.
I would appreciate any tips on how to implement the "derived" column in Hive.
Thanks.
Unfortunately, while the Hive API lets you add a new column to your table (using ALTER TABLE foo ADD COLUMNS (bar binary)), those new columns will be NULL and cannot be populated. The only way to add data to these columns is to clear the table's rows and load data from a new file, this new file having that new column's data.
To answer your question: You can't, in Hive. To do what you propose, you would have to have a file with 6 columns, the 6th already containing the sentiment analysis data. This could then be loaded into your HDFS, and queried using Hive.
EDIT: Just tried an example where I exported the table as a .csv after adding the new column (see above), and popped that into M$ Excel where I was able to perform functions on the table values. After adding functions, I just saved and uploaded the .csv, and rebuilt the table from it. Not sure if this is helpful to you specifically (since it's not likely that sentiment analysis can be done in Excel), but may be of use to anyone else just wanting to have computed columns in Hive.
References:
https://cwiki.apache.org/Hive/gettingstarted.html#GettingStarted-DDLOperations
http://comments.gmane.org/gmane.comp.java.hadoop.hive.user/6665
You can do this in two steps without a separate table. Steps:
Alter the original table to add the required column
Do an "overwrite table select" of all columns + your computed column from the original table into the original table.
Caveat: This has not been tested on a clustered installation.

Teradata: How to design table to be normalized with many foreign key columns?

I am designing a table in Teradata with about 30 columns. These columns are going to need to store several time-interval-style values such as Daily, Monthly, Weekly, etc. It is bad design to store the actual string values in the table since this would be an attrocious repeat of data. Instead, what I want to do is create a primitive lookup table. This table would hold Daily, Monthly, Weekly and would use Teradata's identity column to derive the primary key. This primary key would then be stored in the table I am creating as foreign keys.
This would work fine for my application since all I need to know is the primitive key value as I populate my web form's dropdown lists. However, other applications we use will need to either run reports or receive this data through feeds. Therefore, a view will need to be created that joins this table out to the primitives table so that it can actually return Daily, Monthly, and Weekly.
My concern is performance. I've never created a table with such a large amount of foreign key fields and am fairly new to Teradata. Before I go on the long road of figuring this all out the hard way, I'd like any advice I can get on the best way to achieve my goal.
Edit: I suppose I should add that this lookup table would be a mishmash of unrelated primitives. It would contain group of values relating to time intervals as already mentioned above, but also time frames such as 24x7 and 8x5. The table would be designed like this:
ID Type Value
--- ------------ ------------
1 Interval Daily
2 Interval Monthly
3 Interval Weekly
4 TimeFrame 24x7
5 TimeFrame 8x5
Edit Part 2: Added a new tag to get more exposure to this question.
What you've done should be fine. Obviously, you'll need to run the actual queries and collect statistics where appropriate.
One thing I can recommend is to have an additional row in the lookup table like so:
ID Type Value
--- ------------ ------------
0 Unknown Unknown
Then in the main table, instead of having fields as null, you would give them a value of 0. This allows you to use inner joins instead of outer joins, which will help with performance.

Resources