I use batsd ( which use statsd) library with jeremy/statsd-ruby client for my ruby web application (rails). And I have to keep simple visits statistic. Great! I use statsd.increment('users.visits') method from the above gem.
Then I noticed, this operation once create new sorted set(zset) and add one element(it looks like "1338932870<X>1) every time.
Why statsd use this approach? Would not be easer and faster use HINCRBY method with simlpe hash(not zadd to zset)?
I know, statsd is good and well-known instrument, but I wonder, is it the counters standart pattern in redis? I'm new to redis and nosql at all, thank you!
I'm not familiar with the package, but if you just use HINCRBY, you will just calculate the last value of the metric and keep it in Redis. I guess a statistical package may need to store the evolution of the metric (in order to plot graph over time or something similar).
Using a zset is a way to store the events ordered by timestamp (i.e. a time serie), and therefore to keep an history of the evolution of this metric. It is slower and consume much more memory than just keeping the last value, but you have the history. See Noah's comment below for the full story.
Using HINCRBY or INCRBY to aggregate counters in real time, and using zset to store time series are two common Redis patterns.
Related
Serializing boost::uuids have a cost. Using them to index into vector / unordered_map requires additional hashing. What are the ideal use cases where boost::uuids are ideal data structures to use ?
UUIDs are valuable if you want IDs that are stable in time and across storage systems.
Imagine having two databases, each with auto-generated IDs.
Merging them would be a headache if the IDs are generated by incrementing integral values from 0.
Merging them would be a breeze if all relevant IDs are UUID.
Likewise, handing out a lot of data to an external party, who records operations offline, and subsequently applying these operations back on original data is much easier with UUIDs - even if relations between elements have been changed, new elements created etc.
UUID is also handy for universal "identification" (not authentication/authorization!) - like in driver versions, plugin ids etc. Think about detecting that an MSI is an update to a particular installed software package.
In general, I wouldn't rate UUIDs a characteristic of any data structure. I'd rate it a tool in designing your infrastructure. It plays on the level of persistence, exchange, not so much on the level of algorithms and in-memory manipulation.
I have a data pipeline system where all events are stored in Apache Kafka. There is an event processing layer, which consumes and transforms that data (time series) and then stores the resulting data set into Apache Cassandra.
Now I want to use Apache Spark in order train some machine learning models for anomaly detection. The idea is to run the k-means algorithm on the past data for example for every single hour in a day.
For example, I can select all events from 4pm-5pm and build a model for that interval. If I apply this approach, I will get exactly 24 models (centroids for every single hour).
If the algorithm performs well, I can reduce the size of my interval to be for example 5 minutes.
Is it a good approach to do anomaly detection on time series data?
I have to say that strategy is good to find the Outliers but you need to take care of few steps. First, using all events of every 5 minutes to create a new Centroid for event. I think tahat could be not a good idea.
Because using too many centroids you can make really hard to find the Outliers, and that is what you don't want.
So let's see a good strategy:
Find a good number of K for your K-means.
That is reall important for that, if you have too many or too few you can take a bad representation of the reality. So select a good K
Take a good Training set
So, you don't need to use all the data to create a model every time and every day. You should take a example of what is your normal. You don't need to take what is not your normal because this is what you want to find. So use this to create your model and then find the Clusters.
Test it!
You need to test if it is working fine or not. Do you have any example of what you see that is strange? And you have a set that you now that is not strange. Take this an check if it is working or not. To help with it you can use Cross Validation
So, your Idea is good? Yes! It works, but make sure to not do over working in the cluster. And of course you can take your data sets of every day to train even more your model. But make this process of find the centroids once a day. And let the Euclidian distance method find what is or not in your groups.
I hope that I helped you!
I have a series of events flowing through a system (e.g a pizza ordering system) and I want to count certain properties of each event through time. For example, I might want to see how many unique people ordered pepperoni pizza in the last 5 minutes, or how many pizzas John Doe ordered in the past week.
It is a LOT of events, so we're using something like Cassandra or HBase because even the counts can't be stored in memory. Also, since we need to keep track of set membership (in order to count unique people ordering a particular kind of pizza, for example), it gets bigger.
We could store a list of orders and then query to count, but this is slow. And we mostly don't care who ordered pepperoni pizza, just how many unique orders were made, and in a given time window.
What's the best way to store this information, for example in Cassandra, such that the information can be retrieved in some time intervals?
I tried at first to use Redis + bloom filters, but storing a bloom filter bit vector would require transactions to avoid race conditions, so then I used redis sets.
Then I realized the whole thing was too big to just be in memory, so I decided to switch to a disk-backed store. However, there are no native sets like in redis.
I looked at sketches / streaming algos like HyperLogLog but the conclusion was that to save the hyperloglog object, I need to store the bit array (or pickle the object or whatever)...is that kosher, and what are the best practices for this, if this is indeed the solution?
I was tempted to save each event individually with a timestamp, then query and count on demand, but this is slow. I'm looking for something better, if it exists.
Example Requests:
How many unique people had a pepperoni pizza order in the past 10 minutes
How many unique pepperoni pizzas were ordered by some person John Doe in the past 30 minutes
There are a few ways to approach this problem from what I have learned.
Use locking + set membership / counting data structure e.g hyperloglog or bloom filter. As long as there's not that much fighting over a particular lock, things should be okay.
Use a database that has built-in sets/collections support. They pretty much implement #1 internally.
my guesses:
cassandra supports counters - i think i saw some incr operation which should work concurrently - by using free running counter on your event, you just need to setup something which samples all counters at specified intervals (5 min?) then you can give estimations between two samples
(http://wiki.apache.org/cassandra/Counters)
cassandra can timeout a column..i never really used it, but it might worth a try
I have a variety of data that I've got cached in a standard Redis hashmap, and I've run into a situation where I need to respond to client requests for ordering and filtering. Order rankings for name, average rating, and number of reviews can change regularly (multiple times a minute, possibly). Can anyone advise me on a proper strategy for attacking this problem? Consider the following example to help understand what I'm looking for:
Client makes an API request to /api/v1/cookbooks?orderBy=name&limit=20&offset=0
I should respond with the first 20 entries, ordered by name
Strategies I've considered thus far:
for each type of hashmap store (cookbooks, recipes, etc), creating a sorted set for each ordering scheme (alphabetical, average rating, etc) from a Postgres ORDER BY; then pulling out ZRANGE slices based on limit and offset
storing ordering data directly into the JSON string data for each key.
hitting postgres with an SELECT id FROM table ORDER BY _, and using the ids to pull directly from the hashmap store
Any additional thoughts or advice on how to best address this issue? Thanks in advance.
So, as mentioned in a comment below Sorted Sets are a great way to implement sorting and filtering functionality in cache. Take the following example as an idea of how one might solve the issue of needing to order objects in a hash:
Given a hash called "movies" with the scheme of bucket:objectId -> object, which is a JSON string representation (read about "bucketing" your hashes for performance here.
Create a sorted set called "movieRatings", where each member is an objectId from your "movies" hash, and its score is an average of all rating values (computed by the database). Just use a numerical representation of whatever you're trying to sort, and Redis gives you a lot of flexibility on how you can extract the slices you need.
This simple scheme has a lot of flexibility in what can be achieved - you simply ask your sorted set for a set of keys that fit your requirements, and look up those keys with HMGET from your "movies" hash. Two swift Redis calls, problem solved.
Rinse and repeat for whatever type of ordering you need, such as "number of reviews", "alphabetically", "actor count", etc. Filtering can also be done in this manner, but normal sets are probably quite sufficient for that purpose.
This depends on your needs. Each of your strategies could work.
Your first approach of storing an auxiliary sorted set for each way
you want to order is the best way to do this if you have a very big
hash and/or you run your order queries frequently. This approach will
require a lot of ram if your hash is big, but it will also scale well
in terms of time complexity as your hash gets bigger and you start
running order queries more frequently. On the other hand, it
introduces complexity in your data structures, and feels like you're
trying to use Redis for something a typical DB like Postgres, MySQL,
or Mongo would be better at.
Storing ordering data directly into your keys means you need to pull
your entire hash every time you do an order query. Maybe that's not
so bad if your hash is very small, or you don't do ordered queries very often, but this won't scale at all.
If you're already hitting Postgres to get keys, why not just store the values in Postgres as well. That would be much cheaper than hitting Postgres and then hitting Redis, and would have your code depend on fewer things. IMO, this is probably your best option and would work most naturally. Do this, unless you have some really good reason to not store values in Postgres, or some really big speed concerns, in which case go with your first strategy.
I recently spoke to someone, who works for Amazon and he asked me: How would I go about sorting terabytes of data using a programming language?
I'm a C++ guy and of course, we spoke about merge sort and one of the possible techniques is to split the data into smaller size and sort each of them and merge them finally.
But in reality, do companies like Amazon or eBay sort terabytes of data? I know, they store tons of information, but do they sorting them?
In a nutshell my question is: Why wouldn't they keep them sorted in the first place, instead of sorting terabytes of data?
But in reality, does companies like
Amazon/Ebay, sort terabytes of data? I
know, they store tons of info but
sorting them???
Yes. Last time I checked Google processed over 20 petabytes of data daily.
Why wouldn't they keep them sorted at
the first place instead of sorting
terabytes of data, is my question in a
nutshell.
EDIT: relet makes a very good point; you only need to keep indexes and have those sorted. You can easily and efficiently retrieve sort data that way. You don't have to sort the entire dataset.
Consider log data from servers, Amazon must have a huge amount of data. The log data is generally stored as it is received, that is, sorted according to time. Thus if you want it sorted by product, you would need to sort the whole data set.
Another issue is that many times the data needs to be sorted according to the processing requirement, which might not be known beforehand.
For example: Though not a terabyte, I recently sorted around 24 GB Twitter follower network data using merge sort. The implementation that I used was by Prof Dan Lemire.
http://www.daniel-lemire.com/blog/archives/2010/04/06/external-memory-sorting-in-java-the-first-release/
The data was sorted according to userids and each line contained userid followed by userid of person who is following him. However in my case I wanted data about who follows whom. Thus I had to sort it again by second userid in each line.
However for sorting 1 TB I would use map-reduce using Hadoop.
Sort is the default step after the map function. Thus I would choose the map function to be identity and NONE as reduce function and setup streaming jobs.
Hadoop uses HDFS which stores data in huge blocks of 64 MB (this value can be changed). By default it runs single map per block. After the map function is run the output from map is sorted, I guess by an algorithm similar to merge sort.
Here is the link to the identity mapper:
http://hadoop.apache.org/common/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html
If you want to sort by some element in that data then I would make that element a key in XXX and the line as value as output of the map.
Yes, certain companies certainly sort at least that much data every day.
Google has a framework called MapReduce that splits work - like a merge sort - onto different boxes, and handles hardware and network failures smoothly.
Hadoop is a similar Apache project you can play with yourself, to enable splitting a sort algorithm over a cluster of computers.
Every database index is a sorted representation of some part of your data. If you index it, you sort the keys - even if you do not necessarily reorder the entire dataset.
Yes. Some companies do. Or maybe even individuals. You can take high frequency traders as an example. Some of them are well known, say Goldman Sachs. They run very sophisticated algorithms against the market, taking into account tick data for the last couple of years, which is every change in the price offering, real deal prices (trades AKA as prints), etc. For highly volatile instruments, such as stocks, futures and options, there are gigabytes of data every day and they have to do scientific research on data for thousands of instruments for the last couple years. Not to mention news that they correlate with market, weather conditions and even moon phase. So, yes, there are guys who sort terabytes of data. Maybe not every day, but still, they do.
Scientific datasets can easily run into terabytes. You may sort them and store them in one way (say by date) when you gather the data. However, at some point someone will want the data sorted by another method, e.g. by latitude if you're using data about the Earth.
Big companies do sort tera and petabytes of data regularly. I've worked for more than one company. Like Dean J said, companies rely on frameworks built to handle such tasks efficiently and consistently. So,the users of the data do not need to implement their own sorting. But the people who built the framework had to figure out how to do certain things (not just sorting, but key extraction, enriching, etc.) at massive scale. Despite all that, there might be situations when you will need to implement your own sorting. For example, I recently worked on data project that involved processing log files with events coming from mobile apps.
For security/privacy policies certain fields in the log files needed to be encrypted before the data could be moved over for further processing. That meant that for each row, a custom encryption algorithm was applied. However, since the ratio of Encrypted to events was high (the same field value appears 100s of times in the file), it was more efficient to sort the file first, encrypt the value, cache the result for each repeated value.