On Twitter every tweet has an int64 id assigned to it, but is it generated randomly? And if so, is it possible to have two tweets with the exact same id?
I know 2^64 is a large number, but I just want to know if Twitter has a way of preventing two tweets with the same id or do they rely on the fact that 2^64 is such a large number and having two of the same id's is an extremely small chance?
Twitter uses a combination of timestamp, worker number, and sequence number to compose a unique 64-bit id for tweets. For details, see their engineering blog post Announcing Snowflake. You can also browse the code on their Snowflake GitHub repository.
These IDs are unique 64-bit unsigned integers, which are based on time, instead of being sequential. The full ID is composed of a timestamp, a worker number, and a sequence number.
documentation
Related
I was checking a system design of twitter.
Sharding based on UserID: We can try storing all the data of a user on one server. While storing, we can pass the UserID to our hash function that will map the user to a database server where we will store all of the user’s tweets, favorites, follows, etc. While querying for tweets/follows/favorites of a user, we can ask our hash function where can we find the data of a user and then read it from there. This approach has a couple of issues:
What if a user becomes hot? There could be a lot of queries on the server holding the user. This high load will affect the performance of our service.
Over time some users can end up storing a lot of tweets or having a lot of follows compared to others. Maintaining a uniform distribution of growing user data is quite difficult.
To recover from these situations either we have to repartition/redistribute our data or use consistent hashing.
Sharding based on TweetID: Our hash function will map each TweetID to a random server where we will store that Tweet. To search for tweets, we have to query all servers, and each server will return a set of tweets. A centralized server will aggregate these results to return them to the user. Let’s look into timeline generation example; here are the number of steps our system has to perform to generate a user’s timeline:
Our application (app) server will find all the people the user follows.
App server will send the query to all database servers to find tweets from these people.
Each database server will find the tweets for each user, sort them by recency and return the top tweets.
App server will merge all the results and sort them again to return the top results to the user.
This approach solves the problem of hot users, but, in contrast to sharding by UserID, we have to query all database partitions to find tweets of a user, which can result in higher latencies.
We can further improve our performance by introducing cache to store hot tweets in front of the database servers.
Sharding based on Tweet creation time: Storing tweets based on creation time will give us the advantage of fetching all the top tweets quickly and we only have to query a very small set of servers. The problem here is that the traffic load will not be distributed.
What if we can combine sharding by TweetID and Tweet creation time? If we don’t store tweet creation time separately and use TweetID to reflect that, we can get benefits of both the approaches. This way it will be quite quick to find the latest Tweets. For this, we must make each TweetID universally unique in our system and each TweetID should contain a timestamp too.
We can use epoch time for this. Let’s say our TweetID will have two parts: the first part will be representing epoch seconds and the second part will be an auto-incrementing sequence. So, to make a new TweetID, we can take the current epoch time and append an auto-incrementing number to it. We can figure out the shard number from this TweetID and store it there. How does this approach helps better than the above approaches?
I like to store the log of byte counters for 10 Million LAN devices.
Each device reports byte counter value every 15 minutes (96 samples/day), and each data sample has 500 columns. Each device is identified by its device serial dev_sn.
At the end of day, I will process the data (compute the total byte per device) for all the devices and store them into HIVE data format.
The raw data would be like this:(ex. Device sn1,sn2,and sn3 report values at t1,t2,and t3)
Method 1: Use both dev_sn and timestamp as the composite row-key.
Method 2: Use dev_sn as the row-key and store each data as the version update of the existing values.
To find the total bytes,
Method 1: Search by sn1 for composite key and sort by time and process the data
Method 2: Search by sn1 and pull all the versions and process the data
I think Method 2 is a better solution as it will create less number of row-keys, but not sure if that is really the better approach. Some advice would really helpful.
This is subjective, but I always opt for a composite row key over versioning, for the following reasons:
You can store unlimited "versions" per device. With versioning, this property is limited (as set in configuration).
It's much easier to retrieve entries from specific timestamps/time ranges with an HBase command. Prefix scans are much easier to work with than the version API.
There's no reason for you to want to reduce the number of row keys - HBase is designed specifically to store huge numbers of row keys.
What if you need to delete last Tuesday's data? With versioning that's difficult, with composite keys it's a small piece of code.
As an aside, be sure to pre-split your region servers so that the dev_sn values distribute evenly.
I want to have a try about implementing a normal chatting system after have read many artifles in confluent kafka. But I have met some problems when doing some structure design.
When using mysql as my data's db, I can give id to every meaningful message, like user_id in user table, message_id for message table. After having id in model table, it is very convinient for client and server doing some comunication.
But in Kafka stream, how can I give every meaningful model a unique id in KTable? Or is it really necessary for me to do this?
Maybe I can answer the question for myself.
In mysql, we can directly use sequenceId because all data will go to one place and then be auto allocated a new id. But when the table grows too large, we also need to split table to several little tables.In that case, we also should to regenerate the unique id for each record, because auto generated id in these tables is begun from 0.
Maybe it is the same in Kafka. When we only have one partition in kafka, we also can use the id from kafka generated id because all the message will go to only one place, so they will never be dumplicated. But when we want more partitions, we also have to be careful that these generated id from different partition is not global unique.
So what we should do is to generate id for ourself. UUID is a fast way to do this, but I we want to have a number, we can use a little algorithm to implement this. Maybe use the structure like this in a distributed enviroment:
[nodeid+threadId+current_time+auto_increased_number]
I have a table to which I add records whenever the user views a particular resource. The key fields are
Username
Resource
Date Viewed
On a history page of my app, I want to present a set number (e.g., top 5) of the user's most recently viewed Resources, but I want to group by Resource, so that if some were viewed several times, only the most recent of each one is shown.
To be clear, if the raw data looked like this:
UserA | ResourceA | Jan 1
UserA | ResourceA | Jan 2
UserA | ResourceB | Jan 3
UserA | ResourceA | Jan 4
...
...only the bottom two records would appear in the history page.
I know you can get server-side chronological sorting by using a string derived from the date in the PartitionKey or RowKey fields.
I also see that you could enable a crude grouping mechanism by using Username and Resource as your PartitionKey and RowKey fields, and then using Insert-or-update, to maintain a table in which you kept pointers for the most recent value for each combination. However, those records wouldn't be sorted chronologically.
Is there any way to design a set of tables so that I can get the data I need without retrieving tons of extra entities and sorting on the client? I'm willing to get elaborate with the design if that's what it takes. Thanks in advance!
First, I would strongly recommend that you read this excellent Azure Storage Table Design Guide: Designing Scalable and Performant Tables document from Storage team.
Yes, I would agree that it is somewhat tricky with Azure Table Storage but it is doable :).
What you have to do is keep multiple copies of the same data. Each copy will serve a different purpose.
Considering the scenario where you want to fetch most recent lines for Resource A and B, here's what your entity structure would look like:
PartitionKey: Date/Time (in Ticks) reversed i.e. DateTime.MaxValue.Ticks - LastAccessedDateTime.Ticks. Reverse ticks is required to that most recent entries will show up on the top of the table.
RowKey: Resource name.
AccessDate: Indicates the last access date/time.
User: Name of the user who accessed that resource.
So when you are interested in just finding out most recently used resources, you could start fetching records from the top.
In short, your data storage approach should be primarily governed by how you want to fetch the data. It would even mean you will have to save the same data multiple times.
UPDATE
As discussed in the comments below, Table Service doesn't directly support Server Side Grouping. This is something that you would need to do on your own. What you could do is create a separate table to store the access counts. As and when the resources are accessed, you basically either insert a new record in that table or update the count for that resource in that table.
Assuming you're always interested in finding out resource access count within a date/time range, here's what your entity structure would look like:
PartitionKey: Date/Time (in Ticks). The precision would depend on your reporting requirement. For example, if you want to maintain access counts by day then your precision would be a day.
RowKey: Resource name.
AccessCount: This field will constantly update as and when a resource is accessed.
LastAccessDateTime: This field will denote when a resource was last accessed.
For updating access counts, I would recommend that you make use of a background process. Basically in this approach, as a resource is accessed you add a message in a queue. This message will have resource name and date/time resource was last accessed. Then have a background process poll this queue and fetch messages. As the messages are received, you first get the current count and last access date/time for that resource. If no records are found, you simply insert a record in this table with count as 1. If a record is found then you compare the date/time from the table with the date/time sent in the message. If the date/time from the table is smaller than the date/time sent in the message, you update both count (increase that by 1) and last access date/time. If the date/time from the table is more than the date/time sent in the message, you only update the count.
Now to find most accessed resources in a time span, you simply query this table. Assuming there are limited number of resources (say in 100s), you can get this information from the table with at least 1 request. Since you're dealing with small amount of data, you can simply download this data on the client side and order it anyway you see fit. However to see the access details for a particular resource, you would have to fetch detailed data (1000 entities at a time).
Part of your brain might still be unconsciously trapped in relational-table design paradigms, I'm still getting to grips with that issue myself.
Rather than think of table storage as a database table (with the "query-ability" that goes with it) try visualizing it in more simple (dumb) terms.
A design problem I'm working on now is storing financial transaction data, and I want to know what the total $ amount of these transactions are. Because Azure table storage doesn't (yet?) offer aggregate functions I can't simply go .Sum(). To get around that I'm going to:
Sum the values of the transactions in my app before I pass them to azure.
I'll then pass that the result of the sum into azure as a separate piece of information, called RunningTotal.
Later on I can just return RunningTotal rather than pulling down all the transactions, and I can repeat the process by increment the value of RunningTotal each time i get new transactions.
Of course there are risks to this but the app is a personal one so the risk level is low and manageable, at least as a proof-of-concept.
Perhaps you can use a similar approach for the design of your system: compute useful values in advance. I'll almost be using table storage as a long-term cache rather than a database.
As part of trying to learn hadoop, I'm working on a project using a large number of tweets from the twitter streaming API. Of ~20M tweets, I've generated a list of the N most active users, who I want to try to cluster based on the text of all their tweets.
So I have a list of a few thousand user names, and what I want to do is concatenate the content of all the tweets from each user together, and eventually generate a word count vector for each user.
I can't figure out how to accomplish the concatenation though. I want to be able to write some mapper that takes in each tweet line, and says "if this tweet comes from a user I'm interested in, map it with key username and value tweetText, otherwise ignore it." Then it would be simple for the reducer to concatenate the tweets like I want to.
My problem is, how do I tell the mapper about this big list of users that I'm interested in? It seems like it would be nice if the mapper could have a Hashtable with all the users, but I have no idea if that's possible.
Is there a good way to accomplish this, or is the problem just not a good fit for Map/Reduce?
Aw, nevermind. I've been thinking about this for a while, but once I wrote it out here, I realized how I think I should be doing it. Instead of making a list of all the users with X number of tweets, and then going through the data again and trying to find their tweets, I can do it all at once.
Currently I am mapping [username,1] and then having the reducer sum all of the 1's together to generate tweet counts. Then I try to find the tweets of all users with more than X tweets.
To do it all at once, I should map [username,completeTweet] and then have the reducer concatenate and output data for only users who have more than X tweets, and just ignore the other users.