Using Cassandra to count a big list of data - sorting

We are using Cassandra to count various analytics metrics, broken down by account and date, which seems to be working well:
SELECT COUNT(page_impressions) FROM analytics WHERE account='abc' and MINUTE > '2015-01-01 00:00:00';
We would like to further break down this data by domain, which causes a problem. The number of possible domains would run into the millions for some accounts over the span of a month or so, and we are most interested in the 'top' domains, meaning that we would like to sort by the page_impressions field.
Does anybody have pointers for me on how to count by domain and sort by total page impressions?
Thanks!

As noted by Stefan I would definitely recommend Spark for analysis like this. Also be sure not to actually run a SORT if at all possible for Top N queries. These can usually be accomplished without the shuffle required for a sort by functions like
http://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.rdd.RDD
takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]
Returns the first k (smallest) elements from this RDD as defined by the specified implicit Ordering[T] and maintains the ordering. This does the opposite of top. For example:
sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1)
// returns Array(2)
sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)
// returns Array(2, 3)
num
k, the number of elements to return
ord
the implicit ordering for T
returns
an array of top elements
and
top(num: Int)(implicit ord: Ordering[T]): Array[T]
Returns the top k (largest) elements from this RDD as defined by the specified implicit Ordering[T].

Cassandra supports counters which could be useful to create a top domain list in a separate table.
You might also be interested to use an analytics engine such as presto or spark with cassandra, because its generally not very practical to adopt your data model for different analytics use-cases.

Related

Simplistic way to represent tree-like data?

I have some JSON data that I want to match to a particular array of IDs. So for example, the JSON temperature: 80, weather: tornado can map to an array of IDs [15, 1, 82]. This array of IDs is completely arbitrary and something I will define myself for that particular input, it's simply meant to give recommendations based on conditions.
So while a temperature >= 80 in tornado conditions always maps to [15, 1, 82], the same temperature in cloudy conditions might be [1, 16, 28], and so on.
The issue is that there are a LOT of potential "branches". My program has 7 times of day, each of those time of day nodes has 7 potential temperature ranges, and each of those temperature range nodes have 15 possible weather events. So manually writing if statements for 735 combinations (if I did the math correctly) would be very unruly.
I have drawn a "decision tree" representing one path for demonstration purposes, above.
What are some recommended ways to represent this in code besides massively nested conditionals/case statements?
Thanks.
No need for massive branching. It's easy enough to create a lookup table with the 735 possible entries. You said that you'll add the values yourself.
Create enums for each of your times of day, temperature ranges, and weather events. So your times of day are mapped from 0 to 6, your temperature ranges are mapped from 0 to 6, and your weather events are mapped from 0 to 14. You basically have a 3-dimensional array. And each entry in the array is a list of ID lists.
In C# it would look something like this:
List<List<int>>[][][] = LookupTable[7][7][15];
To populate the lookup table, write a program that generates JSON that you can include in your program. In pseudocode:
for (i = 0 to 6) { // loop for time of day
for (i = 0 to 6) { // loop for temperature ranges
for (i = 0 to 14) { // loop for weather events
// here, output JSON for the record
// You'll probably want a comment with each record
// to say which combination it's for.
// The JSON here is basically just the list of
// ID lists that you want to assign.
}
}
}
Perhaps you want to use that program to generate the JSON skeleton (i.e. one record for each [time-of-day, temperature, weather-event] combination), and then manually add the list of ID lists.
It's a little bit of preparation, but in the end your lookup is dead simple: convert the time-of-day, temperature, and weather event to their corresponding integer values, and look it up in the array. Just a few lines of code.
You could do something similar with a map or dictionary. You'd generate the JSON as above, but rather than load it into a three-dimensional array, load it into your dictionary with the key being the concatenation of the three dimensions. For example, a key would be:
"early morning,lukewarm,squall"
There are probably other lookup table solutions, as well. Those are the first two that I came up with. The point is that you have a whole lot of static data that's very amenable to indexed lookup. Take advantage of it.

Retrieve the average count in count-min-sketch datastructure

I am in love with probabilistic data structures. For my current problem, it seems that the count-min-sketch structure is almost the right candidate. I want to use count-min-sketch to store events per ID.
Let's assume I do have the following
Map<String, Int> {
[ID1, 10],
[ID2, 12],
[ID2, 15]
}
If I use a count-min-sketch, I can query the data structure by IDs and retrieve the ~counts.
Question
Actually I am interested in the average occurrence over all IDs, which in the example above would be: 12,33. If I am using the count-min then it seems that I need to store the Set of IDs and then iterate over the set and query the count-min for each ID and calculate the average. Is there an improved way without storing all IDs? Ideally I just want to retrieve the average right away without remembering all IDs.
Hope that make sense!?
You should be able to calculate the average count if you know the number of entries, and the number of distinct entries:
averageCount = totalNumberOfEntries / numberOfDistinctEntries
Right? And to calculate the number of distinct entries, you can use e.g. HyperLogLog. You already added the hyperloglog tag to your question, so maybe you already know this?

Spark sort by key and then group by to get ordered iterable?

I have a Pair RDD (K, V) with the key containing a time and an ID. I would like to get a Pair RDD of the form (K, Iterable<V>) where the keys are groupped by id and the iterable is ordered by time.
I'm currently using sortByKey().groupByKey() and my tests seem to prove it works, however I'm reading that it may not always be the case, as discussed in this question with diverging answers ( Does groupByKey in Spark preserve the original order? ).
Is it correct or not?
Thanks!
The answer from Matei, who I consider authoritative on this topic, is quite clear:
The order is not guaranteed actually, only which keys end up in each
partition. Reducers may fetch data from map tasks in an arbitrary
order, depending on which ones are available first. If you’d like a
specific order, you should sort each partition. Here you might be
getting it because each partition only ends up having one element, and
collect() does return the partitions in order.
In that context, a better option would be to apply the sorting to the resulting collections per key:
rdd.groupByKey().mapValues(_.sorted)
The Spark Programming Guide offers three alternatives if one desires predictably ordered data following shuffle:
mapPartitions to sort each partition using, for example, .sorted
repartitionAndSortWithinPartitions to efficiently sort partitions while simultaneously repartitioning
sortBy to make a globally ordered RDD
As written in the Spark API, repartitionAndSortWithinPartitions is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.
The sorting, however, is computed by looking only at the keys K of tuples (K, V). The trick is to put all the relevant informations in the first element of the tuple, like ((K, V), null), defining a custom partitioner and a custom ordering. This article descrives pretty well the technique.

Can elasticsearch create histograms by X occurences of a field?

I'm not seeing how this would be done, but is it possible to have a facet that uses an interval to give the stats on every X number of occurrences? As an example, if net was a sequence of numbers ordered by date like:
1,2,3,4,5,6,7
and I set the interval to 2, I would like to get back a histogram like:
count: 2
value: 3,
count: 2,
value: 7,
count: 2,
value: 11,
...
Elasticsearch doesn't support such operation out of the box. It's possible to write such facet, but it's not very practical since it would require writing quite complex custom facet processor and optionally controlling the way records are split into shards (so called routing).
In elasticsearch, any operation that relies on global order of elements is somewhat problematic from the architectural perspective. Elasticsearch splits records into shards, and most operations including searching and facet calculation occur on shards and then results of these shard-level operations are collected and merged into a global result. This is basically map/reduce architecture, and it is the key for horizontal scalability of elasticsearch. Optimal implementation of your facet would require changing routing in such a way that records are split into shards based on their order rather than hash code of id. Alternatively, it can be done by limiting shard-level phase to just extraction of the field values and performing the actual calculation of the facet in the merge phase. The latter approach seems to be more practical but at the same time it is not much different from simply extracting field values for all records and doing calculations on the client side, which is exactly what I would suggest doing here. Just extract all values using desired sort order and calculate all stats on the client. If the number of records in your index is large, you can use Scroll API to retrieve all records using multiple requests.

Do I need to implement a b-tree search for this?

I have an array of integers, which could run into the hundreds of thousands (or more), sorted numerically ascending since that's how they were originally stacked.
I need to be able to query the array to get the index of its first occurrence of a number >= some input, as efficiently as possible. The only way I would know how to do this without even thinking about it would be to iterate through the array testing the condition until it returns true, at which point I'd stop iterating. However, this is the most expensive solution to this problem and I'm looking for the best algorithm to solve it.
I'm coding in Objective-C, but I'll give an example in JavaScript to broaden the audience of people who are able to respond.
// Sample set
var numbers = [1, 7, 23, 23, 23, 89, 1002, 1003];
var indexAfter100 = getIndexOfValueGreaterThan(100);
var indexAfter7 = getIndexOfValueGreaterThan(7);
// (indexAfter100 == 6) == true
// (indexAfter7 == 2) == true
Putting this data into a DB in order to perform this search will only be a last-resort solution since I'm keen to see some sort of algorithm to tackle this quickly in memory.
I do have the ability to change the data structure, or to store an additional data structure as I'm building the array, since my program has already pushed each number one by one onto this stack, so I'd just modify the code that's adding them to the stack. Searching for the index as they're being added to the stack isn't possible since the search operation will be repeated frequently with different values after the fact.
Right now I'm thinking "B-Tree" but to be honest, I would have no idea how to implement one and before I go off and start figuring that out, I wonder if there's a nice algorithm that fits this single use-case better?
You should use binary search. Objective C could even have a built-in method for that (many languages I know do). B-tree won't probably help much, unless you want to store the data on disk.
I don't know about Objective-C, but C (plain 'ol C) comes with a function called bsearch (besides, AFAIK, Obj-C can call C functions just fine):
http://www.cplusplus.com/reference/clibrary/cstdlib/bsearch/
That basically does a binary search which sounds like it's what you need.
A fast search algorithm should be able to handle an array of ints of that size without taking too long, I should think (and the array is sorted, so a binary search would probably be the way to go).
I think a btree is probably overkill...
Since they are sorted in a particular ASCending order and you only need the bigger ones, I would serialize that array, explode it by the INT and keep the part of the serialized string that holds the bigger INTs, then unserialize it and voilá.
Linear search also referred to as sequential search looks at each element in sequence from the start to see if the desired element is present in the data structure. When the amount of data is small, this search is fast.Its easy but work needed is in proportion to the amount of data to be searched.Doubling the number of elements will double the time to search if the desired element is not present.
Binary search is efficient for larger array. In this we check the middle element.If the value is bigger that what we are looking for, then look in the first half;otherwise,look in the second half. Repeat this until the desired item is found. The table must be sorted for binary search. It eliminates half the data at each iteration.Its logarithmic

Resources