Sqoop Increment with string column - hadoop

I'm trying to use an incremental sqoop job across all tables in a database. Some of the databases only have string values in the columns. Is there a way to increment on a string value? There is a common string name across all tables.

After my initial comment I was thinking if the question you asked even made sense. It would if your database forced you to store either the record date or the incrementing number into a text column, but the odds of that is very slim.
If you have a date field you can actually use, you can just use 'lastmodified' mode instead of 'append' mode.

Related

Saving float changes to a float or to a varchar2 column?

I need to save before and after value changes of certain fields of an items table to an items_log table. Changes are saved by an after change trigger on the items table.
Some of the items table columns are varchar2 type and some are number(*) type.
What is the better approach? Saving to separate two before and after number fields and two before and after varchar2 fields? Or conserving space by saving everything to two before and after varchar2 fields?
The purpose of this log table is to record which user changed a field and the before and after values.
Could saving a float value to a string field lead to an unexpected diversion from the original value?
Thanks in advance
"What is the better approach?"
There is no "better" approach. There is only an approach that's good enough for your application. If your table will have a few thousand rows in it, it doesn't really matter. If your table will have a few million rows, then space may be more of a concern.
If your goal is to display to a user what changes occurred to your item and it's not going to see a lot of activity, storing everything as a varchar may be good enough. You probably don't want to store rows for fields that did not change.
I use APC's approach often. The items_log table is the same as the item table, and includes a history id, timestamp, action (I, U, or D), and user along with all the columns of the item row. Everything is maintained by a trigger. There are also built-in Oracle auditing features to do auditing for you.

Oracle - build dimension from a file based data source

I'm trying to build a star schema in Oracle 12c. In my case my data source is not a relational database but a single excel/csv file which is populated via a google form, which means I don't have any sort of reference from a source system such as auto incremental keys/ids. Now what would be the best approach to build a star schema given this condition?
File row sample:
<submitted timestamp>,<submitted by user>,<region>,<country>,<branch>,<branch location>,<branch area>,<branch type>,<branch name>,<branch private? yes/no value>,<the following would be all "fact" values (measurements),...,...,...
In case i wanted to build a "branch" dimension, how would I handle updates/inserts after the first load into the dimension table?
Thought solution so far:
I had thought of making a concatenated string "key" with the branch values, which would make it unique (underscore would be the "glue" to concatenate the values), eg:
<region>_<country>_<branch>_<branch location> as branch_key
I would insert all the distinct branches into a staging table, including they branch_key column for each one of them, then when trying to load into the dimension I could compare which key does not exists yet in my dimension table and then insert it. As for updates, I'm a bit stuck on how to handle that, I had thought of having another file mapping which branches are active having a expiration date column. Basically trying to simulate what I could do having the data in a database instead of CSV files.
This is all I can think of so far, do you have any other recommendations/ideas on how to implement this? Take on consideration that the data source cannot as in I have to read these csv files, since data is not stored anywhere else.
Thank you.

HBase row key design for reads and updates

I'm try to understand the best way to design the key for my HBase Table.
My use case :
Structure right now
PersonID | BatchDate | PersonJSON
When some thing about the person is modified, a new PersonJSON and new a batchdate is inserted in to Hbase updating the old records. And every 4 hours a scan of all the people who are modified are then pushed to Hadoop for further processing.
If my key is just personID it great for updating the data. But my performance sucks because I have to add a filter on BatchData column to scan all the rows greater than a batch date.
If my key is a composite key like BatchDate|PersonID I could use startrow and endrow on the row key and get all the rows that have been modified. But then I would have lot of duplicated since the key is not unique and can no longer update a person.
Is bloom filter on row+col (personid+batchdate) an option ?
Any help is appreciated.
Thanks,
Abhishek
In addition to the table with PersonID as the rowkey, it sounds like you need a dual-write secondary index, with BatchDate as the rowkey.
Another option would be Apache Phoenix, which provides support for secondary indexes.
I usually do two steps:
Create table one just have key is commbine of BatchDate+PersonId, value could be empty.
Create table two just as normal you did. Key is PersonId Value is the whole data.
For date range query: query table one first to get the PersonIds, and then use Hbase batch get API to get the data by batch. it would be very fast.

Parse.com : query with two datetime at different column

I want to build sync method, both from my local database to parse.com and from parse.com to my local database. For first case ( from local to parse.com, Alhamdulillah my script run well. But for second case, i need query between column last_update and last_sync (i do not use updatedAt caused by i can't control it, so i use last_update).
Please explain me how to get all data if last_update is greaterThan last_sync?
From https://www.parse.com/questions/querying-between-two-dates i got query between two dates, but it's from same column, the value of query is real value. But my case, the value of key = last_update is column name, last_sync.
Thank you...
You can't use a query constraint that compares two columns. You need to change your logic. Why do you store both a last_update and last_sync in a record?
Syncing is a complex subject and you can easily mess up the logic. I don't understand how you can store a last_sync date on every record, as this has to be different for every user. You need to store the last_sync value for each user, and use that to compare against the last_update column on all records.

How to get sorted rows out of cassandra when using RandomPartioner and Hector as Client?

To improve my skills on Hector and cassandra I'm trying diffrent methods to query data out of cassandra.
Currently I'm trying to make a simple message system. I would like to get the posted messages in chronological order with the last posted message first.
In plain sql it is possible to use 'order by'. I know it is possible if you use the OrderPreservingPartitioner but this partioner is deprecated and less-efficient than the RandomPartioner. I thought of creating an index on a secondary column with a timestamp als value, but I can't figure out how to obtain the data. I'm sure that I have to use at least two queries.
My column Family looks like this:
create column family messages
with comparator = UTF8Type
and key_validation_class=LongType
and compression_options =
{sstable_compression:SnappyCompressor, chunk_length_kb:64}
and column_metadata = [
{column_name: message, validation_class: UTF8Type}
{column_name: index, validation_class: DateType, index_type: KEYS}
];
I'm not sure if I should use DataType or long for the index column, but I think that's not important for this question.
So how can I get the data sorted? If possible I like to know hows its done white the CQL syntax and whitout.
Thanks in advance.
I don't think there's a completely simple way to do this when using RandomPartitioner.
The columns within each row are stored in sorted order automatically, so you could store each message as a column, keyed on timestamp.
Pretty soon, of course, your row would grow large. So you would need to divide up the messages into rows (by day, hour or minute, etc) and your client would need to work out which rows (time periods) to access.
See also Cassandra time series data
and http://rubyscale.com/2011/basic-time-series-with-cassandra/
and https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
and http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/

Resources