Index Sort: serialization of data stream based on index - sorting

I have two streams from where I am getting the data. I would like to output a stream sorted by indexes of two current stream.
for simplicity here is sample code:
While (1)
{
read.stream1(buffer1,size);
read.stream2(buffer2,size);
int index1 = (int) buffer1[20];
int index2 = (int) buffer2[20];
///// This is the part which I have to do:
sort based on the index, which ever index is small
send that.
outputbuffer = Sort(index1,buffer1,index2,buffer2);
/////
outputstream.send(outputbuffer);
}
we have no information about the incomings of indexes, its like continuous sorting of data stream i.e serialization.

Related

How to return the N documents closest to a specific key from a couchdb view

I have a view on a couchdb database which exposes a certain document property as a key:
function (doc) {
if (doc.docType && doc.docType === 'CARD') {
if (doc.elo) {
emit(doc.elo, doc._id);
} else {
emit(1000, doc._id);
}
}
}
I'm interested in querying this db for the (say) 25 documents with keys closest to a given input. The only thing I can think to do is to set a search range and make repeated queries until I have enough results:
// pouchdb's query fcn
function getNresultsClosestToK(key: number, limit: number) {
let range = 20;
do {
cards = await this.db.query('elo', {
limit,
startkey: (key - range).toString(),
endkey: (key + range).toString()
});
range += 20;
} while (cards.rows.length < limit)
return cards;
}
But this may require several calls and is inefficient. Is there a way to pass a single key and a limit to couch and have it return the limit documents closest to the supplied key?
If I understand correctly, you want to query for a specific key, then return 12 results before the key, the key itself, and 12 results after the key, for a total of 25 results.
The most direct way to do this is with two queries against your view, with the proper combination of startkey, limit, and descending values.
For example, to get the key itself, and the 12 values following, query your view with these options:
startkey: <your key>
limit: 13
descending: false
Then to get the 12 entries before your key, perform a query with the following options:
startkey: <your key>
limit: 13
descending: true
This will give you two result sets, each with (a maximum of) 13 items. Note that your target key will be repeated (it's in each result set). You'll then need to combine the two result sets.
Note this does have a few limitations:
It returns a maximum of 26 results. If your data does not contain 12 values before or after your target key, you'll get fewer than 26 results.
If you have duplicate keys, you may get unexpected results. In particular:
If your target key is duplicated, you'll get 25 - N unique results (where N is the number of duplicates of your target key)
If your non-target keys are duplicated, you have no way of guaranteeing which of the duplicate keys will be returned, so performing the same query multiple times may result in different return values.

Spring data mongo - Get sum of array of object

I have the following document:
{
pmv: {
budgets: [
{
amount: 10
},
{
amount: 20
}
]
}
}
and I need to sum the amount field from every object in budgets. But it's also possible that the budget object doesn't exist so I need to check that.
How could I do this? I've seen many questions but with projections, I just need a integer number which in this case would be 30.
How can I do it?
Thanks.
EDIT 1 FOR PUNIT
This is the code I tried but its giving me and empty aray
AggregationOperation filter = match(Criteria.where("pmv.budgets").exists(true).not().size(0));
AggregationOperation unwind = unwind("pmv.budgets");
AggregationOperation sum = group().sum("budgets").as("amount");
Aggregation aggregation = newAggregation(filter, unwind, sum);
mongoTemplate.aggregate(aggregation,"Iniciativas",String.class);
AggregationResults<String> aggregationa = mongoTemplate.aggregate(aggregation,"Iniciativas",String.class);
List<String> results = aggregationa.getMappedResults();
You can do this with aggregation pipeline
db.COLLECTION_NAME.aggregate([
{"pmv.budgets":{$exists:true,$not:{$size:0}}},
{$unwind:"$pmv.budgets"},
{amount:{$sum:"$pmv.budgets"}}
]);
This pipeline contains three queries:
get document having non-null and non-empty budgets
$unwind basically open the array and create one document for each array element. e.g. if one document of budgets has 3 elements then it will create 3 document and fill budgets property from each of the array element. You can read more about it here
sum all the budgets property using $sum operator
You can read more about aggregation pipeline here
EDIT: as per comments, adding code for java as well.
AggregationOperation filter = match(Criteria.where("pmv.budgets").exists(true).not().size(0));
AggregationOperation unwind = unwind("pmv.budgets");
AggregationOperation sum = group().sum("pmv.budgets").as("amount");
Aggregation aggregation = newAggregation(filter, unwind, sum);
mongoTemplate.aggregate(aggregation,COLLECTION_NAME,Output.class);
You can do this in more inline way as well but I wrote it like this so that it will be easy to understand.
I hope this answer your question.

Result number for Boolean queries with Apache Lucene

When benchmarking Apache Lucene v7.5 I noticed a strange behavior:
I indexed the English Wikipedia dump (5,677,776 docs) using Lucene with the SimpleAnalyzer (No stopwords, no stemming)
Then I searched the index with the following queries:
the totalHits=5,382,873
who totalHits=1,687,254
the who totalHits=5,411,305
"the who" totalHits=8,827
The result number for the Boolean query the who is both larger than the result number for the single term the and the result number for the single term who, when it should be smaller than both.
Is there an explanation for that?
Code snippet:
analyzer = new SimpleAnalyzer();
MultiFieldQueryParser parser = new MultiFieldQueryParser(new String[]{"title", "content","domain","url"},analyzer);
// Parse
Query q = parser.parse(querystr);
// top-10 results
int hitsPerPage = 10;
IndexReader indexReader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(indexReader);
// Ranker
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage);
// Search
searcher.search(q, collector);
// Retrieve the top-10 documents
TopDocs topDocs=collector.topDocs();
ScoreDoc[] hits = topDocs.scoreDocs;
totalHits=topDocs.totalHits;
System.out.println("query: "+querystr + " " + hits.length+" "+String.format("%,d",totalHits));
The explanation is that the default operator is OR and not AND as you assume. Searching for the who returns documents that have either the or who or both.
the - 5,382,873
who - 1,687,254
the OR who - 5,411,305
I.e. most documents that contain who also contains the, except for 28 432 documents which are added to the result set when you retrieve both.
You can change this behavior by changing the default operator:
parser.setDefaultOperator(QueryParserBase.AND_OPERATOR)

using kafka-streams to create a new KStream containing multiple aggregations

I am sending JSON messages containing details about a web service request and response to a Kafka topic. I want to process each message as it arrives in Kafka using Kafka streams and send the results as a continuously updated summary(JSON message) to a websocket to which a client is connected.
The client will then parse the JSON and display the various counts/summaries on a web page.
Sample input messages are as below
{
"reqrespid":"048df165-71c2-429c-9466-365ad057eacd",
"reqDate":"30-Aug-2017",
"dId":"B198693",
"resp_UID":"N",
"resp_errorcode":"T0001",
"resp_errormsg":"Unable to retrieve id details. DB Procedure error",
"timeTaken":11,
"timeTakenStr":"[0 minutes], [0 seconds], [11 milli-seconds]",
"invocation_result":"T"
}
{
"reqrespid":"f449af2d-1f8e-46bd-bfda-1fe0feea7140",
"reqDate":"30-Aug-2017",
"dId":"G335887",
"resp_UID":"Y",
"resp_errorcode":"N/A",
"resp_errormsg":"N/A",
"timeTaken":23,
"timeTakenStr":"[0 minutes], [0 seconds], [23 milli-seconds]",
"invocation_result":"S"
}
{
"reqrespid":"e71b802d-e78b-4dcd-b100-fb5f542ea2e2",
"reqDate":"30-Aug-2017",
"dId":"X205014",
"resp_UID":"Y",
"resp_errorcode":"N/A",
"resp_errormsg":"N/A",
"timeTaken":18,
"timeTakenStr":"[0 minutes], [0 seconds], [18 milli-seconds]",
"invocation_result":"S"
}
As the stream of messages comes into Kafka, I want to be able to compute on the fly
**
total number of requests i.e a count of all
total number of requests with invocation_result equal to 'S'
total number of requests with invocation_result not equal to 'S'
total number of requests with invocation_result equal to 'S' and UID
equal to 'Y'
total number of requests with invocation_result equal to 'S' and UID
equal to 'Y'
minimum time taken i.e. min(timeTaken)
maximum time taken i.e. max(timeTaken)
average time taken i.e. avg(timeTaken)
**
and write them out into a KStream with new key set to the reqdate value and new value a JSON message that contains the computed values as shown below using the 3 messages shown earlier
{
"total_cnt":3, "num_succ":2, "num_fail":1, "num_succ_data":2,
"num_succ_nodata":0, "num_fail_biz":0, "num_fail_tech":1,
"min_timeTaken":11, "max_timeTaken":23, "avg_timeTaken":17.3
}
Am new to Kafka streams. How do i do the multiple counts and by differing columns all in one or as a chain of different steps? Would Apache flink or calcite be more appropriate as my understanding of a KTable suggests that you can only have a key e.g. 30-AUG-2017 and then a single column value e.g a count say 3. I need a resulting table structure with one key and multiple count values.
All help is very much appreciated.
You can just do a complex aggregation step that computes all those at once. I am just sketching the idea:
class AggResult {
long total_cnt = 0;
long num_succ = 0;
// and many more
}
stream.groupBy(...).aggregate(
new Initializer<AggResult>() {
public AggResult apply() {
return new AggResult();
}
},
new Aggregator<KeyType, JSON, AggResult> {
AggResult apply(KeyType key, JSON value, AggResult aggregate) {
++aggregate.total_cnt;
if (value.get("success").equals("true")) {
++aggregate.num_succ;
}
// add more conditions to get all the other aggregate results
return aggregate;
}
},
// other parameters omitted for brevity
)
.to("result-topic");

Sending Items to specific partitions

I'm looking for a way to send structures to pre-determined partitions so that they can be used by another RDD
Lets say I have two RDDs of key-value pairs
val a:RDD[(Int, Foo)]
val b:RDD[(Int, Foo)]
val aStructure = a.reduceByKey(//reduce into large data structure)
b.mapPartitions{
iter =>
val usefulItem = aStructure(samePartitionKey)
iter.map(//process iterator)
}
How could I go about setting up the Partition such that the specific data structure I need will be present for the mapPartition but I won't have the extra overhead of sending over all values (which would happen if I were to make a broadcast variable).
One thought I have been having is to store the objects in HDFS but I'm not sure if that would be a suboptimal solution.
Another thought I am currently exploring is whether there is some way I can create a custom Partition or Partitioner that could hold the data structure (Although that might get too complicated and become problematic)
thank you for your help!
edit:
Pangea makes a very good point that I should offer some more specifics. Essentially I'm given and RDD of SparseVectors and an RDD of inverted indexes. The inverted index objects are quite large.
My hope is to do a MapPartitions within the RDD of vectors where I can compare each vector to the inverted index. The issue is that I only NEED one inverted index object per partition and doing a join would cause me to have a lot of copies of that index.
val vectors:RDD[(Int, SparseVector)]
val invertedIndexes:RDD[(Int, InvIndex)] = a.reduceByKey(generateInvertedIndex)
vectors:RDD.mapPartitions{
iter =>
val invIndex = invertedIndexes(samePartitionKey)
iter.map(invIndex.calculateSimilarity(_))
)
}
A Partitioner is a function that, given a generic element, will return in which partition it belongs. It also decides the number of partitions.
There's a form of reduceByKey that takes a partitioner as an argument.
If I am understanding correctly your question, you want the data be partitioned while doing the reduce.
See the example:
// create example data
val a =sc.parallelize(List( (1,1),(1,2), (2,3),(2,4) ) )
// create simple sample partitioner - 2 partitions, one for odd
// one for even key.hashCode. You should put your partitioning logic here
val p = new Partitioner { def numPartitions: Int = 2; def getPartition(key:Any) = key.hashCode % 2 }
// your reduceByKey function. Sample: just add
val f = (a:Int,b:Int) => a + b
val rdd = a.reduceByKey(p, f)
// here your rdd will be partitioned the way you want with the number
// of partitions you want
rdd.partitions.size
res8: Int = 2
rdd.map() .. // go on with your processing

Resources