Cassandra Hadoop map reduce with wide rows ignores slice predicate

Cassandra Hadoop map reduce with wide rows ignores slice predicate - hadoop

I have a wide row column family which Im trying to run a map reduce job against. The CF is a time ordered collection of events, where the column names are essentially timestamps. I need to run the MR job against a specific date range in the CF.
When I run the job with the widerow property set to false, the expected slice of columns are passed into the mapper class. But when I set widerow to true, the entire column family is processed, ignoring the slice predicate.
The problem is that I have to use widerow support, as the number of columns in the slice can grow very large and consume all the memory if loaded in one go.
I've found this JIRA task which outlines the issue, but it has been closed off as "cannot reproduce" - https://issues.apache.org/jira/browse/CASSANDRA-4871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel
Im running cassandra 1.2.6 and using cassandra-thrift 1.2.4 & hadoop-core 1.1.2 in my jar. The CF has been created using CQL3.
Its worth noting that this occurs regardless of whether I use a SliceRange or specify the columns using setColumn_names() - it still process all of the columns.
Any help will be massively appreciated.

So it seems that this is by design. In the word_count example in github, the following comment exists:
// this will cause the predicate to be ignored in favor of scanning everything as a wide row
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY, true);
Urrrrgh. Fair enough then. It seems crazy that there is no way to limit the columns when using wide rows though.
UPDATE
Apparently the solution is to use the new apache.cassandra.hadoop.cql3 library. See the new example on github for reference: https://github.com/apache/cassandra/blob/trunk/examples/hadoop_cql3_word_count/src/WordCount.java

Sorry, to add the comment as answer but we are trying to do same thing but you mentioned that you are able to that "When I run the job with the widerow property set to false, the expected slice of columns are passed into the mapper class. " but when we set widerow property set to false we are still getting we are getting errors. How did you pass the timestamp range in the slice predicate.
The CF that we use is a timeline of events with uid as partition key and event_timestamp as composite column. The equivalent cql is,
CREATE TABLE testcf (
uid varchar,
evennt_timestamp timestamp,
event varchar,
PRIMARY KEY (uid, event_timestamp));
Map reduce code – to send only events within start and end dates (note : we can query from the cassandra-client and cqlsh on the timestamp composite column and get the desired events)
// Settting widerow to false
config.setInputColumnFamily(Constants.KEYSPACE_TRACKING, Constants.CF_USER_EVENTS, false);
DateTime start = getStartDate(); // e.g., July 30th 2013
DateTime end = getEndDate(); // e.g., Aug 6th 2013
SliceRange range = new SliceRange(
ByteBufferUtil.bytes(start.getMillis()),
ByteBufferUtil.bytes(end.getMillis()),
false, Integer.MAX_VALUE);
SlicePredicate predicate = new SlicePredicate().setSlice_range(range);
config.setInputSlicePredicate(predicate);
But the above code doesn't work. We get the following error,
java.lang.RuntimeException: InvalidRequestException(why:Invalid bytes remaining after an end-of-component at component0)
at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:384)
Wondering if we are sending incorrect data in the start and end parameters in the slice range.
Any hint or help is useful.

Related

How to calculate a value using previous rows' values in Talend

I have a dataset like below.
Dataset:
Now the business logic is to find out the last paid date for each of the loans. I tried using a tmap component, it calls a java routine that has a static variable last_paid_dt which would store the transaction date when the daily deposit > 0. However, when the daily deposit is less than 0 the static var would not get changed. This works fine when the amount paid is 0.
Issue - See the red highlighted values in the table below
When the amount paid is reversed a day or after, the last paid should be from previous non-reversed positive amount. I was not able to get that done.
Also when a new loan id starts processing I need the static variable to get reset which is not currently happening.
If my current methodology is wrong, please help me doing in a better and efficient way. Thanks
Expected output:

First of all you need to use a Map component, with the key being the loanId.
You don't want to overwrite the value. I.e. If the key exists in your map, then do not overwrite it with a new value.
You can use the globalMap if you want, in that case I'd do:
globalMap.get("loan_"+loanId) != null ?
globalMap.put("loand_"+loanId,loanDate) : loanDate
then later:
globalMap.get("loan_"+loanId)
Not elegant, but works. a More elegant would be to define your own map that you put into globalMap and after the process null it out, so you free up the memory. But this all depends on the complexity of your job.

How to get sorted rows out of cassandra when using RandomPartioner and Hector as Client?

To improve my skills on Hector and cassandra I'm trying diffrent methods to query data out of cassandra.
Currently I'm trying to make a simple message system. I would like to get the posted messages in chronological order with the last posted message first.
In plain sql it is possible to use 'order by'. I know it is possible if you use the OrderPreservingPartitioner but this partioner is deprecated and less-efficient than the RandomPartioner. I thought of creating an index on a secondary column with a timestamp als value, but I can't figure out how to obtain the data. I'm sure that I have to use at least two queries.
My column Family looks like this:
create column family messages
with comparator = UTF8Type
and key_validation_class=LongType
and compression_options =
{sstable_compression:SnappyCompressor, chunk_length_kb:64}
and column_metadata = [
{column_name: message, validation_class: UTF8Type}
{column_name: index, validation_class: DateType, index_type: KEYS}
];
I'm not sure if I should use DataType or long for the index column, but I think that's not important for this question.
So how can I get the data sorted? If possible I like to know hows its done white the CQL syntax and whitout.
Thanks in advance.

I don't think there's a completely simple way to do this when using RandomPartitioner.
The columns within each row are stored in sorted order automatically, so you could store each message as a column, keyed on timestamp.
Pretty soon, of course, your row would grow large. So you would need to divide up the messages into rows (by day, hour or minute, etc) and your client would need to work out which rows (time periods) to access.
See also Cassandra time series data
and http://rubyscale.com/2011/basic-time-series-with-cassandra/
and https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
and http://pkghosh.wordpress.com/2011/03/02/cassandra-secondary-index-patterns/

SCD TYPE2 in informatica

Can anyone of you please elaborate on how to map the informatica for the inserts and updates to the target from source table?
I appreciate it, if you explain with example.

TYPE2 Only INSERTS(New Rows as well as Updated Rows)
Version Data Mapping:
The Type 2 Dimension/Version Data mapping filters source rows based on user-defined comparisons and inserts both new and changed dimensions into the target. Changes are tracked in the target table by versioning the primary key and creating a version number for each dimension in the table. In the Type 2 Dimension/Version Data target, the current version of a dimension has the highest version number and the highest incremented primary key of the dimension.
Use the Type 2 Dimension/Version Data mapping to update a slowly changing dimension table when you want to keep a full history of dimension data in the table. Version numbers and versioned primary keys track the order of changes to each dimension.
When you use this option, the Designer creates two additional fields in the target:
PM_PRIMARYKEY. The Integration Service generates a primary key for each row written to the target.
PM_VERSION_NUMBER. The Integration Service generates a version number for each row written to the target.
Creating a Type 2 Dimension/Effective Date Range Mapping
The Type 2 Dimension/Effective Date Range mapping filters source rows based on user-defined comparisons and inserts both new and changed dimensions into the target. Changes are tracked in the target table by maintaining an effective date range for each version of each dimension in the target. In the Type 2 Dimension/Effective Date Range target, the current version of a dimension has a begin date with no corresponding end date.
Use the Type 2 Dimension/Effective Date Range mapping to update a slowly changing dimension table when you want to keep a full history of dimension data in the table. An effective date range tracks the chronological history of changes for each dimension.
When you use this option, the Designer creates the following additional fields in the target:
PM_BEGIN_DATE. For each new and changed dimension written to the target, the Integration Service uses the system date to indicate the start of the effective date range for the dimension.
PM_END_DATE. For each dimension being updated, the Integration Service uses the system date to indicate the end of the effective date range for the dimension.
PM_PRIMARYKEY. The Integration Service generates a primary key for each row written to the target.
The Type 2 Dimension/Flag Current mapping
The Type 2 Dimension/Flag Current mapping filters source rows based on user-defined comparisons and inserts both new and changed dimensions into the target. Changes are tracked in the target table by flagging the current version of each dimension and versioning the primary key. In the Type 2 Dimension/Flag Current target, the current version of a dimension has a current flag set to 1 and the highest incremented primary key.
Use the Type 2 Dimension/Flag Current mapping to update a slowly changing dimension table when you want to keep a full history of dimension data in the table, with the most current data flagged. Versioned primary keys track the order of changes to each dimension.
When you use this option, the Designer creates two additional fields in the target:
PM_CURRENT_FLAG. The Integration Service flags the current row “1” and all previous versions “0.”
PM_PRIMARYKEY. The Integration Service generates a primary key for each row written to the target.

You can start by looking at the Definition of SCD type-2 here.
http://en.wikipedia.org/wiki/Slowly_changing_dimension#Type_2
This implementation is so common in data warehouses that Informatica actually provides you with the template to do so. You can just "plug-in" your table names and the attributes.
If you have informatica installed, you can go to the following location in the help guide to see the detailed implementation logic.
Contents > Designer Guide > Using the mapping wizards > Creating a type 2 dimension.

Use a router to define groups for UPDATE and INSERT. Pass the output of each group to update strategy and then to target. HTH.

Can you sort a GET on a Cassandra column family by the Timestamp value created for each column entry, rather than the column Keys?

Basically I have a 'thread line' where new threads are made and a TimeUUID is used as a key. Which obviously provides sorting of a new thread quite easily, espically when say making a query of the latest 20 threads etc.
My problem is that when a new 'post' is made to a thread I want to be able to 'bump' that thread to the front of the 'thread line' which is where the problem comes in, how do I basically make this happen so I can still make queries that can still be selected in the right order without providing any kind of duplicates etc.
The only way I can see this working is if rather than a column family sorting via a TimeUUID I need the column family to sort via the insertion Timestamp, therefore I can use the unique thread IDs for column keys and retrieve these in the order they are inserted or reinserted rather than by TimeUUID? Is this possible or am I missing a simple trick that allows for this? As far as I know you have to set a particular comparitor or otherwise it defaults to bytes?

Columns within a row are always sorted by name with the given comparator. You cannot sort by timestamp or value or anything else, or Cassandra would not be able to merge multiple updates to the same column correctly.
As to your use case, I can think of two options.
The most similar to what you are doing now would be to create a second columnfamily, ThreadMostRecentPosts, with timeuuid columns (you said "keys" but it sounds like you mean "columns"). When a new post arrives, delete the old most-recent column and add a new one.
This has two problems:
The unit of replication is the row, so having this grow indefinitely could be problematic. (Using expiring columns to age out no-longer-relevant thread information might help.)
You need a lock manager so that multiple posts to the same thread don't race and possibly leave multiple entries in this row.
I would suggest instead creating a row per day (for instance), whose columns are the thread IDs and whose values are the most recent post. Adding a new post just updates the value in that column; no delete/re-add is done, so the race is not a problem anymore. You don't get sorting for free anymore but that's okay because you're limiting it to a small enough set that you can do that sort in memory (say, yesterday's threads and today's).
(Finally, I would add that I can say from experience that having a cutoff past which old threads don't get bumped to the front by a new reply is a Good Thing.)

Enterprise Architect Oracle long field column properties

I have a little problem with Enterprise Architect by Sparx System.
Im trying to model database schema for Oracle. I created table with primary key with data type long. But when im trying to modify column properties (set AutoNum = true) I see empty properties. I read documentation of EA and saw that I need to setup this property to generate sequence syntax.
When I change data type to number, or switch database to mysql (for example) everything is alright, there are properties so Im able to modify AutoNum value.
Did you had similar problem and found solution ? or maybe im doing something wrong.
regards

It's becouse Oracle use sequence instead of autoincrement option. I've checked it and I think you have to use NUMBER column type and then set AutoNum property (you have to select Generate Sequences in options to get proper DDL code too). Instead of LONG data type you can set PRECISION and SCALE options on NUMBER type ie NUMBER(8) mean you can have 8 digits number and it can be set up to 38, so if you don't want to store info about every star in the universe will be enought for your scenario :)

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio