How to count unique occurrences of a key in a view - view

I have following key values emitted in CouchDB view map funtion.
{"key":101,"value":"ABC"}
{"key":101,"value":"ABC"}
{"key":101,"value":"ABC"}
{"key":101,"value":"XYZ"}
{"key":101,"value":"XYZ"}
{"key":101,"value":"XYZ"}
{"key":102,"value":"XYZ"}
{"key":102,"value":"XYZ"}
I need output as unique value count for each key.
{"key":101,"value":2}
{"key":102,"value":1}
How can I go with the reduce function for this?

If the count of unique values per key is finite and if we can hold them in a Set.
Do the following inside the Reducer :
foreach value for a given key
iterate each value and add to a set
emit key, set.size()
Example
Key 101 - Value [ABC, ABC, ABC, XYZ, XYZ, XYZ]
101, create a set s with elements [ABC, XYZ]
emit 101,2 where 2 is s.size()
Key 102 - Value [XYZ, XYZ]
102, create a set s with elements [XYZ]
emit 102, 1 where 1 is s.size()
Map to emit Key Value as-is. Using an combiner is also recommended.

You can try to change your function to something like below and use the _sum reduce function
Assuming that val holds ABC, XYZ and num holds 101, 102
function (doc) {
if (doc.value) {
emit([doc.num, doc.val], 1);
}
}

Related

PowerQuery choose values based on a key column

I have very large files which PowerQuery seems to handle nicely. I need to do some mathematical operations using column d and the value from columns a, b or c based on the value of the key column. My first thought is to isolate the salient value making a column called Salient which selects the value I need and then go from there. In Excel, this might be: =INDEX($A:$E, ROW(F2), MATCH(A2,$A$1:$D$1)).
In reality, I have between 50 and 100 columns as well as millions of rows, so extra points for computational efficiency.
You can define a custom column Salient with just this as the definition:
Record.Field(_, [Key])
The M code for the whole step looks like this:
= Table.AddColumn(#"Prev Step Name", "Salient", each Record.Field(_, [Key]), Int64.Type)
The _ represents the current row, which is a record data type that can be expressed as e.g.
[Key = "a", a = 17, b = 99, c = 21, d = 12]
and you use Record.Field to pick the field corresponding to the Key.

How to return the N documents closest to a specific key from a couchdb view

I have a view on a couchdb database which exposes a certain document property as a key:
function (doc) {
if (doc.docType && doc.docType === 'CARD') {
if (doc.elo) {
emit(doc.elo, doc._id);
} else {
emit(1000, doc._id);
}
}
}
I'm interested in querying this db for the (say) 25 documents with keys closest to a given input. The only thing I can think to do is to set a search range and make repeated queries until I have enough results:
// pouchdb's query fcn
function getNresultsClosestToK(key: number, limit: number) {
let range = 20;
do {
cards = await this.db.query('elo', {
limit,
startkey: (key - range).toString(),
endkey: (key + range).toString()
});
range += 20;
} while (cards.rows.length < limit)
return cards;
}
But this may require several calls and is inefficient. Is there a way to pass a single key and a limit to couch and have it return the limit documents closest to the supplied key?
If I understand correctly, you want to query for a specific key, then return 12 results before the key, the key itself, and 12 results after the key, for a total of 25 results.
The most direct way to do this is with two queries against your view, with the proper combination of startkey, limit, and descending values.
For example, to get the key itself, and the 12 values following, query your view with these options:
startkey: <your key>
limit: 13
descending: false
Then to get the 12 entries before your key, perform a query with the following options:
startkey: <your key>
limit: 13
descending: true
This will give you two result sets, each with (a maximum of) 13 items. Note that your target key will be repeated (it's in each result set). You'll then need to combine the two result sets.
Note this does have a few limitations:
It returns a maximum of 26 results. If your data does not contain 12 values before or after your target key, you'll get fewer than 26 results.
If you have duplicate keys, you may get unexpected results. In particular:
If your target key is duplicated, you'll get 25 - N unique results (where N is the number of duplicates of your target key)
If your non-target keys are duplicated, you have no way of guaranteeing which of the duplicate keys will be returned, so performing the same query multiple times may result in different return values.

Query a table with primary key and two conditions on sort key

I'm trying to query a dynamodb table using the partition key and a sort key. The sort key is a unix date, so I want to request x partition key between these 2 dates on the sort. I am currently able to achieve this with a table scan, but I have to move this to a query for the speed benefit. I am unable to find any decent examples online of people using a partition key and sort key to query their table.
I have carefully read through this https://docs.aws.amazon.com/sdk-for-go/api/service/dynamodb/#DynamoDB.Query and understand that my params must go within the KeyConditionExpression.
I have read through https://github.com/aws/aws-sdk-go/blob/master/service/dynamodb/expression/examples_test.go and understand it on the whole. But I just can't find the syntax for KeyConditionExpression
I'd have thought it was something like this:
keyCond := expression.Key("accountId").
Equal(expression.Value(accountId)).
And(expression.Key("sortKey").
Between(expression.Value(fromDateDec), expression.Value(toDateDec)))
But this throws:
ValidationException: Invalid KeyConditionExpression: Incorrect operand type for operator or function; operator or function: BETWEEN, operand type: NULL
First you need KeyAnd to combine Hash Key and sort key condition.
// keyCondition represents the key condition where the partition key
// "TeamName" is equal to value "Wildcats" and sort key "Number" is equal
// to value 1
keyCondition := expression.KeyAnd(expression.Key("TeamName").Equal(expression.Value("Wildcats")), expression.Key("Number").Equal(expression.Value(1)))
Now instead equal condition you can replace with your between condition as follows
// keyCondition represents the boolean key condition of whether the value
// of the key "foo" is between values 5 and 10
keyCondition := expression.KeyBetween(expression.Key("foo"), expression.Value(5), expression.Value(10))

Perfomance wise for LUA table selection

I'm a bit new to LUA. So I have a game that I need to capture the Entity and insert into the table. The maximum possible Entity table that could happen at the same time is 14. So I read that an array based solution is good.
But I saw that the table size increment even if we delete some value, for example from 10 table value and delete value at index 9 its not automatically shift the size when I want to insert table number 11.
Example:
local Table = {"hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello", "hello"}
-- Current Table size = 10
-- Perform delete at index 9
Table[9] = nil
-- Have new Entity to insert
Table[#Table + 1] = "New Value"
-- The table size will grow by the time the game extend.
So for this type of situation did array based table with nil value inside that grow by the time of new table value inserted will have better perfomance or should I move into table with key?
Or I should just stick with array based table and perform full cleanup when the table isnt used?
If you set an element in a table to nil, then that just stays there as a "hole" in your array.
tab = {1, 2, 3, 4}
tab[2] = nil
-- tab == {1, nil, 3, 4}
-- #tab is actually undefined and could be both 1 or 4 (or something completely unexpected)!
What you need to do is set the field to nil, then shift all the following fields to fill that hole. Luckily, Lua has a function for that, which is table.remove(table, index).
tab = {1, 2, 3, 4}
table.remove(tab, 2)
-- tab == {1, 3, 4}
-- #tab == 3
Keep in mind that this can get very slow as there's lots of memory access involved, so don't go applying this solution when you have a few million elements some day :)
While table.remove(Table, 9) will do the job in your case (removing field from "array" table and shifting remaining fields to fill the hole), you should first consider using "set" table instead.
If you:
- often remove/add elements
- don't care about their order
- often check if table contains a certain element
then the "set" table is your choice. Use it like so
local tab = {
["John"] = true,
["Jane"] = true,
["Bob"] = true,
}
Your elements will be stored as indices in a table.
Remove an element with
tab["Jane"] = nil
Test if table contains an element with
if tab["John"] then
-- tab contains "John"
Advantages compared to array table:
- this will eliminate performance overhead when removing an element because other elements will remain intact and no shifting is required
- checking if element exists in this table (which I assume is the main puspose of this table) is also faster than using array table because it no longer requires iterating over all the elements to find a match, the hash lookup is used instead
Note however that this approach doesn't let you have duplicate values as your elements, because tables can't contain duplicate keys. In that case you can use numbers as values to store the amount of times the element is duplicated in your set, e.g.
local tab = {
["John"] = 1,
["Jane"] = 2,
["Bob"] = 35,
}
Now you have 1 John, 2 Janes and 35 Bobs
https://www.lua.org/pil/11.5.html

What is the use of grouping comparator in hadoop map reduce

I would like to know why grouping comparator is used in secondary sort of mapreduce.
According to the definitive guide example of secondary sorting
We want the sort order for keys to be by year (ascending) and then by
temperature (descending):
1900 35°C
1900 34°C
1900 34°C
...
1901 36°C
1901 35°C
By setting a partitioner to partition by the year part of the key, we can guarantee that
records for the same year go to the same reducer. This still isn’t enough to achieve our
goal, however. A partitioner ensures only that one reducer receives all the records for
a year; it doesn’t change the fact that the reducer groups by key within the partition.
Since we would have already written our own partitioner which would take care of the map output keys going to particular reducer,so why should we group it.
Thanks in advance
In support of the chosen answer I add:
Following on from this explanation
**Input**:
symbol time price
a 1 10
a 2 20
b 3 30
**Map output**: create composite key\values like so:
> symbol-time time-price
>
>**a-1** 1-10
>
>**a-2** 2-20
>
>**b-3** 3-30
The Partitioner: will route the a-1 and a-2 keys to the same reducer despite the keys being different. It will also route the b-3 to a separate reducer.
GroupComparator: once the composites key\value arrive at the reducer instead of the reducer getting
>(**a-1**,{1-10})
>
>(**a-2**,{2-20})
the above will happen due to the unique key values following composition.
the group comparator will ensure the reducer gets:
(a-1,{**1-10,2-20**})
The key of the grouped values will be the one which comes first in the group. This can be controlled by Key comparator.
**[[In a single reduce method call.]]**
Let me improve the statement "... take care of the map output keys going to particular reducer".
Reducer Instance vs reduce method:
One JVM is created per Reduce task and each of these has a single instance of the Reducer class.This is Reducer instance(I call it Reducer from now).Within each Reducer, reduce method is called multiple times depending on 'key grouping'.Each time reduce is called, 'valuein' has a list of map output values grouped by the key you define in 'grouping comparator'.By default, grouping comparator uses the entire map output key.
In the example, map output key is changed to 'year and temperature' to achieve sorting.Unless you define a grouping comparator that uses only the 'year' part of the map output key,you can't make all records of the same year go to same reduce method call.
You need to introduce an intermediate key that is a composite of the year and temperature; partition on the natural key (the year) and introduce a comparator that will sort on the entire composite key. You're right that by partitioning on the year you'll get all the data for a year in the same reducer, so the comparator will effectively sort the data for each year by the temperature.
The default partitioner calculates the hash of the key, and those keys which has the same hash value will be sent to the same reducer. If you have a composite(natural+augment) key emitted in your mapper and if you want to send the keys which has the same natural key to the same reducer then you have to implement a custom partitioner.
public class SimplePartitioner implements Partitioner {
#Override
public int getPartition(Text compositeKey, LongWritable value, int numReduceTasks) {
//Split the key into natural and augment
String naturalKey = compositeKey.toString().split("separator")
return naturalKey.hashCode();
}
}
And now if you want all your relevant rows within a partition of data are sent to a single reducer you must also implement a grouping comparator which considers only the natural key
public class SimpleGroupingComparator extends WritableComparator {
#Override
public int compare(Text compositeKey1, Text compositeKey2) {
return compare(compositeKey1.getNaturalKey(),compositeKey2.getNaturalKey());
}
}

Resources