Redis Data Structure to Store All Clicks for All Links - data-structures

I'm trying to set up a system in which ALL links posted by users and clicked by their followers are stored in redis in such a way that the following requirements are met:
Able to get (for example, 10%) most clicked links within a time-frame (can be either today, this week, all time, or custom).
Able to query all users who posted the same link.
Since we already used many keys, the ideal is that we store all this in a single Redis key.
Can encode value to JSON if needed.
Here is what I came up so far:
-I use a single Redis Hash with each fields are single hour, so that in one day, that hash will contain 24 fields.
-In each field, I store a JSON encoded from an array with format:
array("timestamp1" => array($url1, $url2, ...)
, "timestamp2" => array($url3, $url4, ...)
, ..., ...);
-The complete structure is this hash:
[01/01/2010 00:00] => JSON(...),
[01/01/2010 01:00] => JSON(...),
....
This way, I can get all the clicks on any URL within any time-frame.
However, I can't seem to reuse this hash for getting all the users who posted the URL.
The question is: Is there any better way to do?
Updated 07/30/2011: I'm currently storing the minutes, the hours, the days, weeks, months, and years in the same hash.
So, one click is stored in many fields at once:
- in the field for the minute (format YmdHi)
- in the field for the hour (format YmdH)
- in the field for the day (format Ymd)
- in the field for the week (format YW)
- in the field for the month (format Ym)
- in the field for the year (format Y).
That's way, when trying to get a specific timeframe, I could only access the necessary fields withouth looping through the hours.
For example, if I need clicks from 07/26/2011 20:00 to 07/28/2011 02:00, I only need to query 7 fields: 1 field for the full day of 07/27/2011, 4 fields for the hours from 20:00 to 23:00 on 07/26, and then 2 more fields for hours from 00:00 to 01:00 on 07/28

If you drop the third requirement it becomes a lot easier. A lot of people seem to think that you should always use hashes instead of keys, but this stems from misunderstanding of a post about using hashes to improve performance in a particular limited set of circumstances.
To get the most clicked links, create a sorted set for each hour or day, with the value being the link and score being clicks set using ZINCRBY. Use ZCARD and ZREVRANGEBYSCORE to get the top 10%. It is simplest if the set holds all links in the system, though there are strategies you can use to drop less popular items from the set if necessary.
To get all users posting a link, store a set of users for each link. You could do this with JSON and a key or hash storing details for the link, but a set makes updating and querying easier.

I recommend using some bucket strategy like hashing Keys or keeping records of Link to User month wise as you don't have control on size of data structure how huge it may grow . There will be millions of user visiting a particular link . Now to get the details of all the user again it will be of no use if thrown at once . I believe what can be done is maintain counter or some metadata that act like current state and then maintain an archival storage not to be in mem. or go for a memory grid like GemFire

Related

How do i extract multiple tables(35-40 tables) from a html website into one excel file?

Currently, am trying to retrieve data from this page: https://www.hdb.gov.sg/cs/infoweb/residential/renting-a-flat/renting-from-the-open-market/rental-statistics , as you can see, there are 4 quarters in a year, and for each quarter, there is a different table. I wish to extract the table but currently, i am unable to automate the process, only able to take one. On top of that, i wish to add two columns to the retrieved data table which is "Quarter" and "Year". Any suggestions? Attached photos are my workflow and my excel.
Get the number of years/ loop through the years (or start with the 1st year up to the last year).
For each year try to get the data via data scraping (the elements exist, just hidden/not expanded ; do one table datascraping for data modelling and reuse it within the loop). For the datascraping you need to change the selector, to make it usable for all tables by using the year and the quarter (just a generic example, like * year * quarter *). Columns are the same for all tables.
I haven't seen details within the website menu or within the page, is good to check if robots are allowed to scrape for data
Above would be the quickest way. More complex with FindChidren activity.

Sorting time stamp values that get constantly updated via google forms

I have a google sheet that gets filled via a google form.
Time stamps are created every time a bar code (work order number) is scanned.
The work order number is in the first column.
The 4 unique time stamp fields below are populated in the 2nd column from the google form.
Setup start
Setup finish
Production start
Production finish
The time stamp is created in the 3rd column.
I am trying to do conditional formatting
where the total setup time and production time are calculated but they are tied to their respective work order number.
time stamp functionality
The difficulty is that the timestamp values all fall into one vertical column.
I don't want a mix up of timestamp values with different work order numbers.
The work order numbers along with the 4 unique time stamp values may be input at various times so the formula can't be order specific.
Is there a way to do this? Thanks!
Below is an example link of the spreadsheet I have:
https://drive.google.com/open?id=1YA86jGq_jMsx-wKe19TnZZyf9F4aW6_kUIbrz8hkLJI
Make a pivot table of the data from the form, then use simple formulas adjacent to the new pivot to get the results you are trying to get. Example Image

Advanced search in laravel

I am trying to implementing a search in my application, specifically it checks a table in my database against certain parameters specified by the user.
For example if my user wants to return a list of records created with the last 7 days. Then he types last 7 days or last seven days in the search bar.
My issue now is being able to change string like this into a valid date that can be used in a where when checking my database.
The number can also be random as in the user can simply type last 2 days or last two days or last 100 days or last hundred days, inputs like week 25 and last month should also be allowed.
Instead of creating multiple drop downs and input boxes to allow the user to select form the front end i would like to do something much simpler like this.
My question now is that is there a feature or a package in laravel that already takes care of this?
If there isn't how would i go about doing something like that??
I think what you need to look up is "NLP" or "Natural Language Processing". There are many libraries and API's within the field that can help you out so that you don't reinvent the wheel (so to speak).
Here is a package in Laravel: Laravel Aylien Wrapper or nlp-tools, but there are many others for PHP in general.
Just do a quick search or look around at Mashape to find some examples.

Cassandra DB: is it favorable, or frowned upon, to index multiple criteria per row?

I've been doing a lot of reading lately on Cassandra, and specifically how to structure rows to take advantage of indexing/sorting, but there is one thing I am still unclear on; how many "index" items (or filters if you will) should you include in a column family (CF) row?
Specifically: I am building an app and will be using Cassandra to archive log data, which I will use for analytics.
Example types of analytic searches will include (by date range):
total visits to specific site section
total visits by Country
traffic source
I plan to store the whole log object in JSON format, but to avoid having to go through each item to get basic data, or to create multiple CF just to get basic data, I am curious to know if it's a good idea to include these above "filters" as columns (compound column segment)?
Example:
Row Key | timeUUID:data | timeUUID:country | timeUUID:source |
======================================================
timeUUID:section | JSON Object | USA | example.com |
So as you can see from the structure, the row key would be a compound key of timeUUID (say per day) plus the site section I want to get stats for. This lets me query a date range quite easily.
Next, my dilemma, the columns. Compound column name with timeUUID lets me sort & do a time based slice, but does the concept make sense?
Is this type of structure acceptable by the current "best practice", or would it be frowned upon? Would it be advisable to create a separate "index" CF for each metric I want to query on? (even when it's as simple as this?)
I would rather get this right the first time instead of having to restructure the data and refactor my application code later.
I think the idea behind this is OK. It's a pretty common way of doing timeslicing (assuming I've understood your schema anyway - a create table snippet would be great). Some minor tweaks ...
You don't need a timeUUID as your row key. Given that you suggest partitioning by individual days (which are inherently unique) you don't need a UUID aspect. A timestamp is probably fine, or even simpler a varchar in the format YYYYMMDD (or whatever arrangement you prefer).
You will probably also want to swap your row key composition around to section:time. The reason for this is that if you need to specify an IN clause (i.e. to grab multiple days) you can only do it on the last part of the key. This means you can do WHERE section = 'foo' and time IN (....). I imagine that's a more common use case - but the decision is obviously yours.
If your common case is querying the most recent data don't forget to cluster your timeUUID columns in descending order. This keeps the hot columns at the head.
Double storing content is fine (i.e. once for the JSON payload, and denormalised again for data you need to query). Storage is cheap.
I don't think you need indexes, but it depends on the queries you intend to run. If your queries are simple then you may want to store counters by (date:parameter) instead of values and just increment them as data comes in.

Random exhaustive (non-repeating) selection from a large pool of entries

Suppose I have a large (300-500k) collection of text documents stored in the relational database. Each document can belong to one or more (up to six) categories. I need users to be able to randomly select documents in a specific category so that a single entity is never repeated, much like how StumbleUpon works.
I don't really see a way I could implement this using slow NOT IN queries with large amount of users and documents, so I figured I might need to implement some custom data structure for this purpose. Perhaps there is already a paper describing some algorithm that might be adapted to my needs?
Currently I'm considering the following approach:
Read all the entries from the database
Create a linked list based index for each category from the IDs of documents belonging to the this category. Shuffle it
Create a Bloom Filter containing all of the entries viewed by a particular user
Traverse the index using the iterator, randomly select items using Bloom Filter to pick not viewed items.
If you track via a table what entries that the user has seen... try this. And I'm going to use mysql because that's the quickest example I can think of but the gist should be clear.
On a link being 'used'...
insert into viewed (userid, url_id) values ("jj", 123)
On looking for a link...
select p.url_id
from pages p left join viewed v on v.url_id = p.url_id
where v.url_id is null
order by rand()
limit 1
This causes the database to go ahead and do a 1 for 1 join, and your limiting your query to return only one entry that the user has not seen yet.
Just a suggestion.
Edit: It is possible to make this one operation but there's no guarantee that the url will be passed successfully to the user.
It depend on how users get it's random entries.
Option 1:
A user is paging some entities and stop after couple of them. for example the user see the current random entity and then moving to the next one, read it and continue it couple of times and that's it.
in the next time this user (or another) get an entity from this category the entities that already viewed is clear and you can return an already viewed entity.
in that option I would recommend save a (hash) set of already viewed entities id and every time user ask for a random entity- randomally choose it from the DB and check if not already in the set.
because the set is so small and your data is so big, the chance that you get an already viewed id is so small, that it will take O(1) most of the time.
Option 2:
A user is paging in the entities and the viewed entities are saving between all users and every time user visit your page.
in that case you probably use all the entities in each category and saving all the viewed entites + check whether a entity is viewed will take some time.
In that option I would get all the ids for this topic- shuffle them and store it in a linked list. when you want to get a random not viewed entity- just get the head of the list and delete it (O(1)).
I assume that for any given <user, category> pair, the number of documents viewed is pretty small relative to the total number of documents available in that category.
So can you just store indexed triples <user, category, document> indicating which documents have been viewed, and then just take an optimistic approach with respect to randomly selected documents? In the vast majority of cases, the randomly selected document will be unread by the user. And you can check quickly because the triples are indexed.
I would opt for a pseudorandom approach:
1.) Determine number of elements in category to be viewed (SELECT COUNT(*) WHERE ...)
2.) Pick a random number in range 1 ... count.
3.) Select a single document (SELECT * FROM ... WHERE [same as when counting] ORDER BY [generate stable order]. Depending on the SQL dialect in use, there are different clauses that can be used to retrieve only the part of the result set you want (MySQL LIMIT clause, SQLServer TOP clause etc.)
If the number of documents is large the chance serving the same user the same document twice is neglibly small. Using the scheme described above you don't have to store any state information at all.
You may want to consider a nosql solution like Apache Cassandra. These seem to be ideally suited to your needs. There are many ways to design the algorithm you need in an environment where you can easily add new columns to a table (column family) on the fly, with excellent support for a very sparsely populated table.
edit: one of many possible solutions below:
create a CF(column family ie table) for each category (creating these on-the-fly is quite easy).
Add a row to each category CF for each document belonging to the category.
Whenever a user hits a document, you add a column with named and set it to true to the row. Obviously this table will be huge with millions of columns and probably quite sparsely populated, but no problem, reading this is still constant time.
Now finding a new document for a user in a category is simply a matter of selecting any result from select * where == null.
You should get constant time writes and reads, amazing scalability, etc if you can accept Cassandra's "eventually consistent" model (ie, it is not mission critical that a user never get a duplicate document)
I've solved similar in the past by indexing the relational database into a document oriented form using Apache Lucene. This was before the recent rise of NoSQL servers and is basically the same thing, but it's still a valid alternative approach.
You would create a Lucene Document for each of your texts with a textId (relational database id) field and multi valued categoryId and userId fields. Populate the categoryId field appropriately. When a user reads a text, add their id to the userId field. A simple query will return the set of documents with a given categoryId and without a given userId - pick one randomly and display it.
Store a users past X selections in a cookie or something.
Return the last selections to the server with the users new criteria
Randomly choose one of the texts satisfying the criteria until it is not a member of the last X selections of the user.
Return this choice of text and update the list of last X selections.
I would experiment to find the best value of X but I have in mind something like an X of say 16?

Resources