Spark Streaming - Filter dynamically - filter

I have a Spark streaming job. I want to apply filter to my input RDD.
I want to fetch filter criteria dynamically each time from Hbase during each spark streaming batch.
How do I achieve this ?
I can create connection object using Map partitions once.
But with in spark filter how do i achieve the same thing ?

I think, the right approach is writing a filter function of your own (pseudocode):
DStream<Integer> intDstream= someIntegerIntoDStream;
intDstream.foreachPartition{
create HBase connection here if you need it for a batch
while(arg0.hasNext()){ //here you have an iterator
Integer current = arg0.next();
create HBase connection here if you need it for each element
//Here is your filter function:
if( current meets your condition )
arg0.remove();
So what happens is that you are running on your executor and you manually picking each element, applying a condition to it and removing it if it meets your criteria.

Related

Cassandra aggregate to Map

I am new to cassandra, I've mainly been using Hive the past several months. Recently I started a project where I need to do some of the things I did in hive with cassandra instead.
Essentially, I am trying to find a way to do an aggregate of multiple rows into a single map on query.
In hive, I simply do a group by, with a "map" aggregate. Does a way exist in cassandra to do something similar?
Here is an example of a working hive query that does the task I am looking to do:
select
map(
"quantity", count(caseid)
, "title" ,casesubcat
, "id" , casesubcatid
, "category", named_struct("id",casecatid,'title',casecat)
) as casedata
from caselist
group by named_struct("id",casecatid,'title',casecat) , casesubcat, casesubcatid
Mapping query results to Map (or some other type/structure/class of your choice) is responsibility of client application and usually is a trivial task (but you didn't specify in what context this map is going to be used).
Actual question here is about GROUP BY in Cassandra. This is not supported out of the box. You can check Cassandra's standard aggregate functions or try creating user defined function, but Cassandra Way is knowing your query in advance, designing your schema accordingly, doing heavy lifting in write phase and simplistic querying afterwards. Thus, grouping/aggregation can often be achieved by using dedicated counter tables.
Another option is to do data processing in additional layer (Apache Spark, for example). Have you considered using Hive on top of Cassandra?

Perform actions before end of the micro-batch in Spark Streaming

Is there a possibility to perform some action at the end of each micro-batch inside the DStream in Spark Streaming? My aim is to compute number of the events processed by Spark. Spark Streaming gives me some numbers, but the average also seems to sum up zero values (as some micro-batches are empty).
e.g. I do collect some statistics data and want to send them to my server, but the object that collects the data only exists during a certain batch and is initialized from the scratch for the next batch. I would love to be able to call my "finish" method before the batch is done and the object is gone. Otherwise I loose the data that has not been sent to my server.
Maybe you can use StreamingListener:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener

retrieve data from hbase using prefix string

I am using hbase 0.98.4
I want to retrieve data from hbase table by using scanner with java api where my
startRow : username_uniqueId
stopRow is : username_uniqueId* so here there can be anything appended
I set these param to scan object but it is not fetching data from hbase table.
Basically i want to fetch all records which starts with some specific string.
For this i can use prefix filter but i came to know it kills hbase performance as it scan whole hbase table. So i am avoiding it.
Can anyone have a better solution apart from using prefix filter?

Load HBase records from PIG conditionally

Is there a way to load records from HBase into a pig relation based on the value of a particular column in HBase? Thank You
If you look at the source code for the pig HBase loader you can see it can filter on key range and timestamps and it can get columns by prefix but not filter by column value.
You can write your own loader (even based on that code) and add the capability you need. Note that the performance for filtering on column values would not be great anyway and filtering for that value in the mapper, while slower than filtering in an HBase filter, will not be that different (you'd be basically saving the interprocess communication from the regionserver to the pig mapper)

Filter data on row key in Random Partitioner

I'm working on Cassandra Hadoop integration (MapReduce). We have used RandomPartitioner to insert data to gain faster write speed. Now we have to read that data from Cassandra in MapReduce and perform some calculations on it.
From the lots of data we have in cassandra we want to fetch data only for particular row keys but we are unable to do it due to RandomPartitioner - there is an assertion in the code.
Can anyone please guide me how should I filter data based on row key on the Cassandra level itself (I know data is distributed across regions using hash of the row key)?
Would using secondary indexes (still trying to understand how they works) solve my problem or is there some other way around it?
I want to use cassandra MR to calculate some KPI's on the data which is stored in cassandra continuously. So here fetching whole data from cassandra every time seems an overhead to me? The rowkey I'm using is like "(timestamp/60000)_otherid"; this CF contains reference of rowkeys of actual data stored in other CF. so to calculate KPI I will work for a particular minute and fetch data from other CF, and process it.
When using RandomPartitioner, keys are not sorted, so you cannot do a range query on your keys to limit the data. Secondary indexes work on columns not keys, so they won't help you either. You have two options for filtering the data:
Choose a data model that allows you to specify a thrift SlicePredicate, which will give you a range of columns regardless of key, like this:
SlicePredicate predicate = new SlicePredicate().setSlice_range(new SliceRange(ByteBufferUtil.bytes(start), ByteBufferUtil.bytes(end), false, Integer.MAX_VALUE));
ConfigHelper.setInputSlicePredicate(conf, predicate);
Or use your map stage to do this by simply ignoring input keys that are outside your desired range.
I am unfamiliar with the Cassandra Hadoop integration but trying to understand how to use the hash system to query the data yourself is likely the wrong way to go.
I would look at the Cassandra client you are using (Hector, Astynax, etc.) and ask how to query by row keys from that.
Querying by the row key is a very common operation in Cassandra.
Essentially if you want to still use a RandomPartitioner and want the ability to do range slices you will need to create a reverse index (a.k.a. inverted index). I have answered a similar question here that involved timestamps.
Having the ability to generate your rowkeys programmatically allows you to emulate a range slice on rowkeys. To do this you must write your own InputFormat class and generate your splits manually.

Resources