Question related to the use case of Redis/NoSQL database - spring

I am building a website where I will have two kind of users - X and Y. X will be few (hundreds) and Y will be many (millions). Basically for each X, there will be some set of Ys. Y will add friends in their network. X will post a message which can be seen to particular set of Y and those people can forward that message to their friends. Friends can see that message and can either forward to their friends or reply back to the sender.
So this is my use case and I was exploring different kind of databases primarily NoSQL databases because I am considering scalability and performance of the system as the main concerns in my website. I have started using Spring Data Redis APIs and found it quite useful to my use case. My question here is how do we perform 'update' operation in NoSQL databases specifically in Redis. Lets say I want to update user information stored in Redis. Another question is how do I perform operations like get me all the users who are below 30 years of age provided we have stored user information in Redis which has got 'age' field.
I am quite new in NoSQL world and have very less experience with it. I would also want to hear from experienced people about the right database for my use case. I have previously used Spring, Hibernate combination with MySQL as database and was not satisfied with the performance of the system when load is heavy on the system.
Thanks,
Sachin

My question here is how do we perform 'update' operation in NoSQL
databases specifically in Redis. Lets say I want to update user
information stored in Redis.
This depends on the structure to which you store your users. A fifteen minute introduction to Redis data types tutorial can help you to get a bigger picture of these data structures. Usually updates are done with SET operations, so if I have stored user information in hash structure and want to update certain field I would use HSET command.
Another question is how do I perform operations like get me all the
users who are below 30 years of age provided we have stored user
information in Redis which has got 'age' field.
Redis is an advanced key/value data store and it doesn't support ad hoc querying of data like you may be used to from SQL world. If you need this querying functionality you should rather look at other NoSQL solutions which supports it, for example try to look at MongoDB or some other from this list.

The best way to check if Redis can do what you want is to install it in a development machine and give redis-clia try. This built-in client has command completion and help available (just hit help <TAB> to see categories), and also a good documentation is available at http://redis.io/commands.
As for your request, you should study the sorted set datastructure, to store the user's info. This is better than sets alone, because it does allow you some basic form of queries, as you will see:
Suppose you want to be able to query all users between 18 and 30 years in your database, which contains (as an example) 3 users: Joe, 28 years; Bob, 17 years; and Adam, 50 years.
You would populate Redis as:
ZADD age 28 Joe
ZADD age 17 Bob
ZADD age 50 Adam
The syntax is ZADD [key] [score] [member]. The set is in [key], and in sorted sets each [member] has a [score] which MUST BE A DOUBLE. That means you can't query scores using string patterns or even string values (in this point Mongo should be better).
So, time to query. I will paste redis-cli queries now:
To list all users between 18 and 30 years, you would do:
redis> ZRANGEBYSCORE age 18 30
1) "Joe"
redis> ZRANGEBYSCORE age 18 30 WITHSCORES
1) "Joe"
2) "28"
The option WITHSCORES show the score of each member in the result.
To list all users under the age of 18:
redis> ZRANGEBYSCORE age 0 (18
1) "Bob"
redis> ZRANGEBYSCORE age 0 (18 WITHSCORES
1) "Bob"
2) "17"
The ( modifier allows the beginning or end interval to be open. It would be like where age >=0 and age < 18.
To list all users above the age of 30:
redis> ZRANGEBYSCORE age 30 +inf
1) "Adam"
redis> ZRANGEBYSCORE age 30 +inf WITHSCORES
1) "Adam"
2) "50"
The +inf is infinity, so there is no upper limit on the result, like where age >= 30.
And so on.
This is really just the beginning. You can do intersections, unions and select count(*) operations in sorted sets, all very fast, specially if you use an optimized library (like phpredis). Hope you get a good impression of it now.

Related

PowerBI filter table based on value of measure_A OR measure_B [duplicate]

We are trying to implement a dashboard that displays various tables, metrics and a map where the dataset is a list of customers. The primary filter condition is the disjunction of two numeric fields. We want to the user to be able to select a threshold for [field 1] and a separate threshold for [field 2] and then impose the condition [field 1] >= <threshold> OR [field 2] >= <threshold>.
After that, we want to also allow various other interactive slicers so the user can restrict the data further, e.g. by country or account manager.
Power BI naturally imposes AND between all filters and doesn't have a neat way to specify OR. Can you suggest a way to define a calculation using the two numeric fields that is then applied as a filter within the same interactive dashboard screen? Alternatively, is there a way to first prompt the user for the two threshold values before the dashboard is displayed -- so when they click Submit on that parameter-setting screen they are then taken to the main dashboard screen with the disjunction already applied?
Added in response to a comment:
The data can be quite simple: no complexity there. The complexity is in getting the user interface to enable a disjunction.
Suppose the data was a list of customers with customer id, country, gender, total value of transactions in the last 12 months, and number of purchases in last 12 months. I want the end-user (with no technical skills) to specify a minimum threshold for total value (e.g. $1,000) and number of purchases (e.g. 10) and then restrict the data set to those where total value of transactions in the last 12 months > $1,000 OR number of purchases in last 12 months > 10.
After doing that, I want to allow the user to see the data set on a dashboard (e.g. with a table and a graph) and from there select other filters (e.g. gender=male, country=Australia).
The key here is to create separate parameter tables and combine conditions using a measure.
Suppose we have the following Sales table:
Customer Value Number
-----------------------
A 568 2
B 2451 12
C 1352 9
D 876 6
E 993 11
F 2208 20
G 1612 4
Then we'll create two new tables to use as parameters. You could do a calculated table like
Number = VALUES(Sales[Number])
Or something more complex like
Value = GENERATESERIES(0, ROUNDUP(MAX(Sales[Value]),-2), ROUNDUP(MAX(Sales[Value]),-2)/10)
Or define the table manually using Enter Data or some other way.
In any case, once you have these tables, name their columns what you want (I used MinNumber and MinValue) and write your filtering measure
Filter = IF(MAX(Sales[Number]) > MIN(Number[MinCount]) ||
MAX(Sales[Value]) > MIN('Value'[MinValue]),
1, 0)
Then put your Filter measure as a visual level filter where Filter is not 0 and use MinCount and MinValues column as slicers.
If you select 10 for MinCount and 1000 for MinValue then your table should look like this:
Notice that E and G only exceed one of the thresholds and tha A and D are excluded.
To my knowledge, there is no such built-in slicer feature in Power BI at the time being. There is however a suggestion in the Power BI forum that requests a functionality like this. If you'd be willing to use the Power Query Editor, it's easy to obtain the values you're looking for, but only for hard-coded values for your limits or thresh-holds.
Let me show you how for a synthetic dataset that should fit the structure of your description:
Dataset:
CustomerID,Country,Gender,TransactionValue12,NPurchases12
51,USA,M,3516,1
58,USA,M,3308,12
57,USA,M,7360,19
54,USA,M,2052,6
51,USA,M,4889,5
57,USA,M,4746,6
50,USA,M,3803,3
58,USA,M,4113,24
57,USA,M,7421,17
58,USA,M,1774,24
50,USA,F,8984,5
52,USA,F,1436,22
52,USA,F,2137,9
58,USA,F,9933,25
50,Canada,F,7050,16
56,Canada,F,7202,5
54,Canada,F,2096,19
59,Canada,F,4639,9
58,Canada,F,5724,25
56,Canada,F,4885,5
57,Canada,F,6212,4
54,Canada,F,5016,16
55,Canada,F,7340,21
60,Canada,F,7883,6
55,Canada,M,5884,12
60,UK,M,2328,12
52,UK,M,7826,1
58,UK,M,2542,11
56,UK,M,9304,3
54,UK,M,3685,16
58,UK,M,6440,16
50,UK,M,2469,13
57,UK,M,7827,6
Desktop table:
Here you see an Input table and a subset table using two Slicers. If the forum suggestion gets implemented, it should hopefully be easy to change a subset like below to an "OR" scenario:
Transaction Value > 1000 OR Number or purchases > 10 using Power Query:
If you use Edit Queries > Advanced filter you can set it up like this:
The last step under Applied Steps will then contain this formula:
= Table.SelectRows(#"Changed Type2", each [NPurchases12] > 10 or [TransactionValue12] > 1000
Now your original Input table will look like this:
Now, if only we were able to replace the hardcoded 10 and 1000 with a dynamic value, for example from a slicer, we would be fine! But no...
I know this is not what you were looking for, but it was the best 'negative answer' I could find. I guess I'm hoping for a better solution just as much as you are!

how to improve Neo4J performance in creating edges?

i'm building a traffic schedule application using Neo4J, NodeJS and GTFS-data; currently, i'm trying to get
things working for the traffic on a single day on the Berlin subway network. these are the grand totals
i've collected so far:
10 routes
211 stops
4096 trips
83322 stoptimes
to put it simply, GTFS (General Transit Feed Specification) has the concept of a stoptime which denotes the
event of a given train or bus stopping for passengers to board and alight. stoptimes happen on a trip,
which is a series of stoptimes, they happen on a specific date and time, and they happen on a given
stop for a given route (or 'line') in a transit network. so there's a lot of references here.
the problem i'm running into is the amount of data and the time it takes to build the database. in order
to speed up things, i've already (1) cut down the data to a single day, (2) deleted the database files
and have the server create a fresh one (very effective!), (3) searched a lot to get better queries. alas,
with the figures as given above, it still takes 30~50 minutes to get all the edges of the graph.
these are the indexes i'm building:
CREATE CONSTRAINT ON (n:trip) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:stop) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:route) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:stoptime) ASSERT n.id IS UNIQUE;
CREATE INDEX ON :trip(`route-id`);
CREATE INDEX ON :stop(`name`);
CREATE INDEX ON :stoptime(`trip-id`);
CREATE INDEX ON :stoptime(`stop-id`);
CREATE INDEX ON :route(`name`);
i'd guess the unique primary keys should be most important.
and here are the queries that take up like 80% of the running time (with 10% that are unrelated to Neo4J,
and 10% needed to feed the node data using plain HTTP post requests):
MATCH (trip:`trip`), (route:`route`)
WHERE trip.`route-id` = route.id
CREATE UNIQUE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
MATCH (stoptime:`stoptime`), (trip:`trip`)
WHERE stoptime.`trip-id` = trip.id
CREATE UNIQUE (trip)-[:`trip/stoptime` {`~label`: 'trip/stoptime'}]-(stoptime);
MATCH (stoptime:`stoptime`), (stop:`stop`)
WHERE stoptime.`stop-id` = stop.id
CREATE UNIQUE (stop)-[:`stop/stoptime` {`~label`: 'stop/stoptime'}]-(stoptime);
MATCH (a:stoptime), (b:stoptime)
WHERE a.`trip-id` = b.`trip-id`
AND ( a.idx + 1 = b.idx OR a.idx - 1 = b.idx )
CREATE UNIQUE (a)-[:linked]-(b);
MATCH (stop1:stop)-->(a:stoptime)-[:next]->(b:stoptime)-->(stop2:stop)
CREATE UNIQUE (stop1)-[:distance {`~label`: 'distance', value: 0}]-(stop2);
the first query is still in the range of some minutes which i find longish given that there are only
thousands (not hundreds of thousands or millions) of trips in the database. the subsequent queries that
involve stoptimes take several ten minutes each on my desktop machine.
(i've also calculated whether the schedule really contains 83322 stoptimes each day, and yes, it's plausible:
in Berlin, subway trains run on 10 lines for 20 hours a day with 6 or 12 trips per hour, and there are 173
subway stations: 10 lines x 2 directions x 17.3 stops per line x 20 hours x 9 trips per hour gives 62280,
close enough. there are some faulty? / double / extra stop nodes in the data (211
stops instead of 173), but those are few.)
frankly, if i don't find a way to speed up things at least tenfold (rather more), it'll make little sense to use Neo4J
for this project. just in order to cover the single city of Berlin many, many more stoptimes have to be added,
as the subway is just a tiny fraction of the overall public transport here (e.g. bus and tramway have like
170 routes with 7,000 stops, so expect around 7,000,000 stoptimes each day).
Update the above edge creation queries, which i perform one by one, have now been running for over an hour and not yet finished, meaning that—if things scale in a linear fashion—the time needed to feed the Berlin public transport data for a single day would consume something like a week. therefore, the code currently performs several orders of magnitude too slow to be viable.
Update #MichaelHunger's solution did work; see my response below.
I just imported 12M nodes and 12M rels into Neo4j in 10 minutes using LOAD CSV.
You should see your issues when you run profiling on your queries in the shell.
Prefix your query with profile and look a the profile output if it mentions to use the index or rather just label-scan.
Do you use parameters for your insert queries? So that Neo4j can re-use built queries?
For queries like this:
MATCH (trip:`trip`), (route:`route`)
WHERE trip.`route-id` = route.id
CREATE UNIQUE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
It will very probably not use your index.
Can you perhaps point to your datasource? We can convert it into CSV if it isn't and then import even more quickly.
Perhaps we can create a graph gist for your model?
I would rather use:
MATCH (route:`route`)
MATCH (trip:`trip` {`route-id` = route.id)
CREATE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
For your initial import you also don't need create unique as you match every trip only once.
And I'm not sure what your "~label" is good for?
Similar for your other queries.
As the data is public it would be cool to work together on this.
Something I'd love to hear more about is how you plan do express your query use-cases.
I had a really great discussion about timetables for public transport with training attendees last time in Leipzig. You can also email me on michael at neo4j.org
Also perhaps you want to check out these links:
Tramchester
http://www.thoughtworks.com/de/insights/blog/transforming-travel-and-transport-industry-one-graph-time
http://de.slideshare.net/neo4j/graph-connect-v5
https://www.youtube.com/watch?v=AhvECxOhEX0
London Tube Graph
http://blog.bruggen.com/2013/11/meet-this-tubular-graph.html
http://www.markhneedham.com/blog/2014/03/03/neo4j-2-1-0-m01-load-csv-with-rik-van-bruggens-tube-graph/
http://www.markhneedham.com/blog/2014/02/13/neo4j-value-in-relationships-but-value-in-nodes-too/
detailed solution
i'm happy to report that #MichaelHunger's solution works like a charm. i modified the edge-building queries
from the question with the below shapes that keep to the suggested query outline:
MATCH (route:`route`)
MATCH (trip:`trip` {`route-id`: route.id})
CREATE (trip)-[:`trip/route` {`~label`: 'trip/route'}]->(route)
MATCH (trip:`trip`)
MATCH (stoptime:`stoptime` {`trip-id`: trip.id})
CREATE (trip)-[:`trip/stoptime` {`~label`: 'trip/stoptime'}]->(stoptime)
MATCH (stop:`stop`)
MATCH (stoptime:`stoptime` {`stop-id`: stop.id})
CREATE (stop)-[:`stop/stoptime` {`~label`: 'stop/stoptime'}]->(stoptime)
MATCH (a:stoptime)
MATCH (b:stoptime {`trip-id`: a.`trip-id`, `idx`: a.idx + 1})
CREATE (a)-[:linked {`~label`: 'linked'}]->(b)
MATCH (stop1:stop)--(a:stoptime)-[:linked]-(b:stoptime)--(stop2:stop)
CREATE (stop1)-[:distance {`~label`: 'distance', value: 0}]->(stop2)
as can be seen, the trick here is to give each participating node a MATCH statement of its own and to
move the WHERE clause inside the second match condition; presumably, as mentioned above, Neo4J can only
then take advantage of its indexes.
with these queries in place, the process of reading in nodes and building edges takes roughly 13 minutes;
of these 13 minutes, fetching the data from an external source, building the node representations and issuing CREATE queries
takes about 10 minutes, and building almost a half million edges between them is done in about 3 minutes.
right now none of my queries (especially the node CREATE statements and updates for stop distances) use
parametrized queries, which is another potential source for performance gains.
as for the ~label field and also the question why i use dahes in names where underscores would be more
convenient, well, that's a long story about what i perceive good and practical naming that sometimes clashes
with the syntax of some languages (of most languages, should i say). but that's boring detail. maybe more
intersting is the question: why is there a ~label attribute that repeats what the element label says (what
you write after the colon)? well, it's an attempt to comply with Neo4J conventions (we use labels here), take
advantage of the 'identifier, colon, label' syntax of cypher queries, AND to make it so the labels do
appear in the returned values.
mind you, labels are so central to graph thinking the Neo4J way, but *in query results, labels are
conspicuously absent. when you include a relationship that is marked with nothing but a label in your result set,
then that edge will arrive as an empty
object, telling you only that there is something but not what. so i decided i to duplicate the
label on each single node and each single edge. not an optimal solution but at least now i get an informative
graph display in the Neo4J browser.
as for how to express query use-cases, that's an active field of reserach for me right now. i guess it will
all start with a 'field of interest', like 'show all Berlin subway stops', or 'all busses departing within
the next 15 minutes from a bus stop near me'. the data already allows to see which stops are directly connected
by a subway line, their geographical distance, what services are present and what routes they take. the idea
is to grab the data and present them in novel, usable and beatiful ways. 9292 is quite
close to what i imagine; what's missing are graphical representations of spatial and temporal relationships.

Real time data processing

I am parsing keywords several times per second. Every second i have 1000 - 5000 keywords. So i want to find outlier, growing and other stuff which called technical analysis. One of the problem is how to store data.
I will be able to do someting like:
20-01 20-02 20-03
brother 0 3 4
table 1 0 0
cup 34 54 78
But it might be a lot of keywords. For every new part of data i need to look is this word exists? If donnt then i must to add new words and add new rows for them. What is right way to organize store? Should i use key\value database, NoSQL or something else?

Mongo multiple queries or database normalization

I'm using MongoDB for my database. The query that I'm currently working on revealed a possible deficiency in my schema. Below is the relevant layout of my collections. Note that games.players is an array of 2 players since the game is chess.
users {_id, username, ...}
games {_id, players[], ...}
msgs {_id, username, gameid, time, msg}
The data that I need is:
All msgs for games which a user is in which is newer than a given timestamp.
In a SQL database, my query would look similar to:
SELECT * FROM msgs WHERE time>=$time AND gameid IN
(SELECT _id FROM games WHERE players=$username);
But, Mongo isn't a relational database, so doesn't support sub-queries or joins. I see two possible solutions. What would be better performance-wise and efficiency-wise?
Multiple Queries
Select games the user is in, then use $in to match msgs.gameid by.
Other?
Normalization
Make users.games contain all games a user is in.
Copy games.players to msgs.players by msgs.gameid
etc.,
I'm a relative newbie to MongoDB, but I find my self frequently using a combination of the two approaches. Some things - e.g. user names - are frequently duplicated to simplify queries used for display, but any time I need to do more than display information, I wind up writing multiple queries, sometimes 2 or 3 levels deep, using $in, to gather all the documents I need to work with for a given operation.
You can "normalize" yourself. I would add an array to users that list the games he is a member of;
users {_id, username, games={game1,game2,game3}}
now you can do a query on msgs where the time>time$ and the {games._id "is in" users.games}
You will have to maintain the games list on each user.

Redis Data Structure to Store All Clicks for All Links

I'm trying to set up a system in which ALL links posted by users and clicked by their followers are stored in redis in such a way that the following requirements are met:
Able to get (for example, 10%) most clicked links within a time-frame (can be either today, this week, all time, or custom).
Able to query all users who posted the same link.
Since we already used many keys, the ideal is that we store all this in a single Redis key.
Can encode value to JSON if needed.
Here is what I came up so far:
-I use a single Redis Hash with each fields are single hour, so that in one day, that hash will contain 24 fields.
-In each field, I store a JSON encoded from an array with format:
array("timestamp1" => array($url1, $url2, ...)
, "timestamp2" => array($url3, $url4, ...)
, ..., ...);
-The complete structure is this hash:
[01/01/2010 00:00] => JSON(...),
[01/01/2010 01:00] => JSON(...),
....
This way, I can get all the clicks on any URL within any time-frame.
However, I can't seem to reuse this hash for getting all the users who posted the URL.
The question is: Is there any better way to do?
Updated 07/30/2011: I'm currently storing the minutes, the hours, the days, weeks, months, and years in the same hash.
So, one click is stored in many fields at once:
- in the field for the minute (format YmdHi)
- in the field for the hour (format YmdH)
- in the field for the day (format Ymd)
- in the field for the week (format YW)
- in the field for the month (format Ym)
- in the field for the year (format Y).
That's way, when trying to get a specific timeframe, I could only access the necessary fields withouth looping through the hours.
For example, if I need clicks from 07/26/2011 20:00 to 07/28/2011 02:00, I only need to query 7 fields: 1 field for the full day of 07/27/2011, 4 fields for the hours from 20:00 to 23:00 on 07/26, and then 2 more fields for hours from 00:00 to 01:00 on 07/28
If you drop the third requirement it becomes a lot easier. A lot of people seem to think that you should always use hashes instead of keys, but this stems from misunderstanding of a post about using hashes to improve performance in a particular limited set of circumstances.
To get the most clicked links, create a sorted set for each hour or day, with the value being the link and score being clicks set using ZINCRBY. Use ZCARD and ZREVRANGEBYSCORE to get the top 10%. It is simplest if the set holds all links in the system, though there are strategies you can use to drop less popular items from the set if necessary.
To get all users posting a link, store a set of users for each link. You could do this with JSON and a key or hash storing details for the link, but a set makes updating and querying easier.
I recommend using some bucket strategy like hashing Keys or keeping records of Link to User month wise as you don't have control on size of data structure how huge it may grow . There will be millions of user visiting a particular link . Now to get the details of all the user again it will be of no use if thrown at once . I believe what can be done is maintain counter or some metadata that act like current state and then maintain an archival storage not to be in mem. or go for a memory grid like GemFire

Resources