Real time data processing - performance

I am parsing keywords several times per second. Every second i have 1000 - 5000 keywords. So i want to find outlier, growing and other stuff which called technical analysis. One of the problem is how to store data.
I will be able to do someting like:
20-01 20-02 20-03
brother 0 3 4
table 1 0 0
cup 34 54 78
But it might be a lot of keywords. For every new part of data i need to look is this word exists? If donnt then i must to add new words and add new rows for them. What is right way to organize store? Should i use key\value database, NoSQL or something else?

Related

Generate pairs from list that hasn't already historically existed

I'm building a pairing system that is supposed to create a pairing between two users and schedule them in a meeting. The selection is based upon a criteria that I am having a hard time figuring out. The criteria is that an earlier match cannot have existed between the pair.
My input is a list of size n that contains email addresses. This list is supposed to be split into pairs. The restriction is that this match hasn't occured previously.
So for example, my list would contain a couple of user ids
list = {1,5,6,634,533,515,61,53}
At the same time i have a database table where the old pairs exist:
previous_pairs
---------------------
id date status
1 2016-10-14 12:52:24.214 1
2 2016-10-15 12:52:24.214 2
3 2016-10-16 12:52:24.214 0
4 2016-10-17 12:52:24.214 2
previous_pair_users
---------------------
id userid
1 1
1 5
2 634
2 553
3 515
3 61
4 53
4 1
What would be a good approach to solve this problem? My test solution right now is to pop two random users and checking them for a previous match. If there exists no match, i pop a new random (if possible) and push one of the incorrect users back to the list. If the two people are last they will get matched anyhow. This doesn't sound good to me since i should predict which matches that cannot occur based on my list with already "existing" pairs.
Do you have any idea on how to get me going in regards to building this procedure? Java 8 streams looks interesting and might be a way to solve this, but i am very new to that unfortunately.
The solution here was to create a list with tuples that contain the old matches using group_concat feature of MySQL:
SELECT group_concat(MatchProfiles.ProfileId) FROM Matches
INNER JOIN MatchProfiles ON Matches.MatchId = MatchProfiles.MatchId
GROUP BY Matches.MatchId
old_matches = ((42,52),(12,52),(19,52),(10,12))
After that I select the candidates and generate a new list of tuples using my pop_random()
new_matches = ((42,12),(19,48),(10,36))
When both lists are done I look at the intersection to find any duplicates
duplicates = list(set(new_matches) & set(old_matches))
If we have duplicates we simply run the randomizer again X attemps until I find it impossible.
I know that this is not very effective when having a large set of numbers but my dataset will never be that large so I think it will be good enough.

Pulling data from Crystal Reports

My data is stored in Oracle and the only way I can run a report with it is using Crystal Reports. I have a set of data that looks like this ,,,,,,,,,,1, or ,1,,,,,,, or ,1,,,,,1,,,,1,. There are more variations.
Each one means a value is true for a record. There are about 54 'ticks/commas' What I want is all records with the one at the X spot. So for one report I may want all records in the 10th spot that have a 1. There may be other times where I want the records where the 1 is after spot 36. I agree it will pull other records but the main once I want is the X spot.
How do I get this? I tried a Like command but that does not narrow the data down far enough. I am familiar with SQL but not Crystal.
Any help would be great. TIA
In Crystal, you might try setting up a Parameter Field to hold a numeric value (1 to 54) then use that in a formula as the Record Selection. You'll be prompted to enter the parameter when you run the report.
In record selection i was initially going suggest the following which would bring back all records with 1 in the 12 spot. But this makes it hard to bring back a range.
split({yourfield},",")[12] = 1
This will bring back the same
instr({#test},"1") = 12
Then for your suggestion above you could use the following to bring back any record if it has a one in any spot after 36
instr({#test},"1") >= 37
As long as there is only one 1 in the field you can use this for other ranges as well.
Actually I don't want to disturb both answerts by Clayton Morris and CoSpringsGuy hence posting my answer.
You need to combine both the solutions to get the desired result.
Create a number parameter and provide either 1 to 57 numbers or keep just a filed to enter desired number.
Now in record selection use the formula given by CoSpringsGuy
if instr({databasefield.column},"1") = {?Inputnumber} --parameter field
then {databasefield.column}

Mahout recommender returns no results for a user

I'm curious why in the example below the Mahout recommender isn't returning a recommendation for user 1.
My input file is below. I added blank lines to enhance readability. This file will need the blank lines removed before it's run through Mahout.
The columns in this file are:
User ID | item number | item rating
1 101 0
1 102 0
1 103 5
1 104 0
2 101 4
2 102 5
2 103 4
2 104 0
3 101 0
3 102 5
3 103 5
3 104 3
You'll note that item 103 is the only common item that all 3 users rated.
I ran:
hadoop jar C:\hdp\mahout-0.9.0.2.1.3.0-1981\core\target\mahout-core-0.9.0.2.1.3.0-1981-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE --input small_data_set.txt --output small_data_set_output
The Mahout recommendation output file shows:
2 [104:4.5]
3 [101:5.0]
Which I believe means:
User 2 would be recommended item 104. Since user 3 rated item 104 a 3 this may account for the 4.5 recommendation score vs. the result below…
User 3 would be recommended item 101. Since user 2 rated item 101 a "4" this may account for a slightly higher recommendation score of 5.
Is this correct?
Why isn't user 1 included in the recommendation output file? User 1 could have received a recommendation for Item 102 because user 2 and user 3 rated it. Is the data set too small?
Thanks in advance.
Several mistakes may be present in your data, the first two here will cause undefined behavior:
IDs must be contiguous non-zero integers starting at 0 so you need to map your IDs above somehow. So your-user-ID = 1 will be a Mahout-user-ID = 0. The same for items, your-item-ID = 101 will be Mahout-user-ID = 0.
You should omit the 0 values from the input altogether if you mean that the user has expressed no preference, this makes the preference "undefined" in a sense. To do this omit the lines entirely.
Always use SIMILARITY_LOGLIKELIHOOD, it is widely measured as doing significantly better than the other methods unless you are trying to predict ratings, in that case use cosine.
If you use LLR similarity you should omit the values since they will be ignored.
There are very few uses for preference values unless you are trying to predict a user's rating for an item. The preference weights are useless in determining recommendation ranking, which is the typical thing to optimize. If you want to recommend the right things in the right order toss the values and use LLR.
The other thing that people sometimes do with values is show some weight of preference so 1 = a view of a product page and 5 = a product purchase. This will not work! I tried this with a large ecommerce dataset and found the recommendations were worse when adding in product views, even though there was 100 times more data. They are fundamentally different user actions with different user intent and so can't be mixed in this way.
If you really do want to mix different actions use the new multimodal recommender based on Mahout, Spark, and Solr described on the Mahout site here: It allows cross-cooccurrence type indicator calculations so you can use user location, likes and dislikes, view and purchase. Virtually the entire user clickstream can be used. But only with cross-cooccurrence correlating one action to the canonical "best" action, the one you want to recommend.

how to improve Neo4J performance in creating edges?

i'm building a traffic schedule application using Neo4J, NodeJS and GTFS-data; currently, i'm trying to get
things working for the traffic on a single day on the Berlin subway network. these are the grand totals
i've collected so far:
10 routes
211 stops
4096 trips
83322 stoptimes
to put it simply, GTFS (General Transit Feed Specification) has the concept of a stoptime which denotes the
event of a given train or bus stopping for passengers to board and alight. stoptimes happen on a trip,
which is a series of stoptimes, they happen on a specific date and time, and they happen on a given
stop for a given route (or 'line') in a transit network. so there's a lot of references here.
the problem i'm running into is the amount of data and the time it takes to build the database. in order
to speed up things, i've already (1) cut down the data to a single day, (2) deleted the database files
and have the server create a fresh one (very effective!), (3) searched a lot to get better queries. alas,
with the figures as given above, it still takes 30~50 minutes to get all the edges of the graph.
these are the indexes i'm building:
CREATE CONSTRAINT ON (n:trip) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:stop) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:route) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:stoptime) ASSERT n.id IS UNIQUE;
CREATE INDEX ON :trip(`route-id`);
CREATE INDEX ON :stop(`name`);
CREATE INDEX ON :stoptime(`trip-id`);
CREATE INDEX ON :stoptime(`stop-id`);
CREATE INDEX ON :route(`name`);
i'd guess the unique primary keys should be most important.
and here are the queries that take up like 80% of the running time (with 10% that are unrelated to Neo4J,
and 10% needed to feed the node data using plain HTTP post requests):
MATCH (trip:`trip`), (route:`route`)
WHERE trip.`route-id` = route.id
CREATE UNIQUE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
MATCH (stoptime:`stoptime`), (trip:`trip`)
WHERE stoptime.`trip-id` = trip.id
CREATE UNIQUE (trip)-[:`trip/stoptime` {`~label`: 'trip/stoptime'}]-(stoptime);
MATCH (stoptime:`stoptime`), (stop:`stop`)
WHERE stoptime.`stop-id` = stop.id
CREATE UNIQUE (stop)-[:`stop/stoptime` {`~label`: 'stop/stoptime'}]-(stoptime);
MATCH (a:stoptime), (b:stoptime)
WHERE a.`trip-id` = b.`trip-id`
AND ( a.idx + 1 = b.idx OR a.idx - 1 = b.idx )
CREATE UNIQUE (a)-[:linked]-(b);
MATCH (stop1:stop)-->(a:stoptime)-[:next]->(b:stoptime)-->(stop2:stop)
CREATE UNIQUE (stop1)-[:distance {`~label`: 'distance', value: 0}]-(stop2);
the first query is still in the range of some minutes which i find longish given that there are only
thousands (not hundreds of thousands or millions) of trips in the database. the subsequent queries that
involve stoptimes take several ten minutes each on my desktop machine.
(i've also calculated whether the schedule really contains 83322 stoptimes each day, and yes, it's plausible:
in Berlin, subway trains run on 10 lines for 20 hours a day with 6 or 12 trips per hour, and there are 173
subway stations: 10 lines x 2 directions x 17.3 stops per line x 20 hours x 9 trips per hour gives 62280,
close enough. there are some faulty? / double / extra stop nodes in the data (211
stops instead of 173), but those are few.)
frankly, if i don't find a way to speed up things at least tenfold (rather more), it'll make little sense to use Neo4J
for this project. just in order to cover the single city of Berlin many, many more stoptimes have to be added,
as the subway is just a tiny fraction of the overall public transport here (e.g. bus and tramway have like
170 routes with 7,000 stops, so expect around 7,000,000 stoptimes each day).
Update the above edge creation queries, which i perform one by one, have now been running for over an hour and not yet finished, meaning that—if things scale in a linear fashion—the time needed to feed the Berlin public transport data for a single day would consume something like a week. therefore, the code currently performs several orders of magnitude too slow to be viable.
Update #MichaelHunger's solution did work; see my response below.
I just imported 12M nodes and 12M rels into Neo4j in 10 minutes using LOAD CSV.
You should see your issues when you run profiling on your queries in the shell.
Prefix your query with profile and look a the profile output if it mentions to use the index or rather just label-scan.
Do you use parameters for your insert queries? So that Neo4j can re-use built queries?
For queries like this:
MATCH (trip:`trip`), (route:`route`)
WHERE trip.`route-id` = route.id
CREATE UNIQUE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
It will very probably not use your index.
Can you perhaps point to your datasource? We can convert it into CSV if it isn't and then import even more quickly.
Perhaps we can create a graph gist for your model?
I would rather use:
MATCH (route:`route`)
MATCH (trip:`trip` {`route-id` = route.id)
CREATE (trip)-[:`trip/route` {`~label`: 'trip/route'}]-(route);
For your initial import you also don't need create unique as you match every trip only once.
And I'm not sure what your "~label" is good for?
Similar for your other queries.
As the data is public it would be cool to work together on this.
Something I'd love to hear more about is how you plan do express your query use-cases.
I had a really great discussion about timetables for public transport with training attendees last time in Leipzig. You can also email me on michael at neo4j.org
Also perhaps you want to check out these links:
Tramchester
http://www.thoughtworks.com/de/insights/blog/transforming-travel-and-transport-industry-one-graph-time
http://de.slideshare.net/neo4j/graph-connect-v5
https://www.youtube.com/watch?v=AhvECxOhEX0
London Tube Graph
http://blog.bruggen.com/2013/11/meet-this-tubular-graph.html
http://www.markhneedham.com/blog/2014/03/03/neo4j-2-1-0-m01-load-csv-with-rik-van-bruggens-tube-graph/
http://www.markhneedham.com/blog/2014/02/13/neo4j-value-in-relationships-but-value-in-nodes-too/
detailed solution
i'm happy to report that #MichaelHunger's solution works like a charm. i modified the edge-building queries
from the question with the below shapes that keep to the suggested query outline:
MATCH (route:`route`)
MATCH (trip:`trip` {`route-id`: route.id})
CREATE (trip)-[:`trip/route` {`~label`: 'trip/route'}]->(route)
MATCH (trip:`trip`)
MATCH (stoptime:`stoptime` {`trip-id`: trip.id})
CREATE (trip)-[:`trip/stoptime` {`~label`: 'trip/stoptime'}]->(stoptime)
MATCH (stop:`stop`)
MATCH (stoptime:`stoptime` {`stop-id`: stop.id})
CREATE (stop)-[:`stop/stoptime` {`~label`: 'stop/stoptime'}]->(stoptime)
MATCH (a:stoptime)
MATCH (b:stoptime {`trip-id`: a.`trip-id`, `idx`: a.idx + 1})
CREATE (a)-[:linked {`~label`: 'linked'}]->(b)
MATCH (stop1:stop)--(a:stoptime)-[:linked]-(b:stoptime)--(stop2:stop)
CREATE (stop1)-[:distance {`~label`: 'distance', value: 0}]->(stop2)
as can be seen, the trick here is to give each participating node a MATCH statement of its own and to
move the WHERE clause inside the second match condition; presumably, as mentioned above, Neo4J can only
then take advantage of its indexes.
with these queries in place, the process of reading in nodes and building edges takes roughly 13 minutes;
of these 13 minutes, fetching the data from an external source, building the node representations and issuing CREATE queries
takes about 10 minutes, and building almost a half million edges between them is done in about 3 minutes.
right now none of my queries (especially the node CREATE statements and updates for stop distances) use
parametrized queries, which is another potential source for performance gains.
as for the ~label field and also the question why i use dahes in names where underscores would be more
convenient, well, that's a long story about what i perceive good and practical naming that sometimes clashes
with the syntax of some languages (of most languages, should i say). but that's boring detail. maybe more
intersting is the question: why is there a ~label attribute that repeats what the element label says (what
you write after the colon)? well, it's an attempt to comply with Neo4J conventions (we use labels here), take
advantage of the 'identifier, colon, label' syntax of cypher queries, AND to make it so the labels do
appear in the returned values.
mind you, labels are so central to graph thinking the Neo4J way, but *in query results, labels are
conspicuously absent. when you include a relationship that is marked with nothing but a label in your result set,
then that edge will arrive as an empty
object, telling you only that there is something but not what. so i decided i to duplicate the
label on each single node and each single edge. not an optimal solution but at least now i get an informative
graph display in the Neo4J browser.
as for how to express query use-cases, that's an active field of reserach for me right now. i guess it will
all start with a 'field of interest', like 'show all Berlin subway stops', or 'all busses departing within
the next 15 minutes from a bus stop near me'. the data already allows to see which stops are directly connected
by a subway line, their geographical distance, what services are present and what routes they take. the idea
is to grab the data and present them in novel, usable and beatiful ways. 9292 is quite
close to what i imagine; what's missing are graphical representations of spatial and temporal relationships.

Question related to the use case of Redis/NoSQL database

I am building a website where I will have two kind of users - X and Y. X will be few (hundreds) and Y will be many (millions). Basically for each X, there will be some set of Ys. Y will add friends in their network. X will post a message which can be seen to particular set of Y and those people can forward that message to their friends. Friends can see that message and can either forward to their friends or reply back to the sender.
So this is my use case and I was exploring different kind of databases primarily NoSQL databases because I am considering scalability and performance of the system as the main concerns in my website. I have started using Spring Data Redis APIs and found it quite useful to my use case. My question here is how do we perform 'update' operation in NoSQL databases specifically in Redis. Lets say I want to update user information stored in Redis. Another question is how do I perform operations like get me all the users who are below 30 years of age provided we have stored user information in Redis which has got 'age' field.
I am quite new in NoSQL world and have very less experience with it. I would also want to hear from experienced people about the right database for my use case. I have previously used Spring, Hibernate combination with MySQL as database and was not satisfied with the performance of the system when load is heavy on the system.
Thanks,
Sachin
My question here is how do we perform 'update' operation in NoSQL
databases specifically in Redis. Lets say I want to update user
information stored in Redis.
This depends on the structure to which you store your users. A fifteen minute introduction to Redis data types tutorial can help you to get a bigger picture of these data structures. Usually updates are done with SET operations, so if I have stored user information in hash structure and want to update certain field I would use HSET command.
Another question is how do I perform operations like get me all the
users who are below 30 years of age provided we have stored user
information in Redis which has got 'age' field.
Redis is an advanced key/value data store and it doesn't support ad hoc querying of data like you may be used to from SQL world. If you need this querying functionality you should rather look at other NoSQL solutions which supports it, for example try to look at MongoDB or some other from this list.
The best way to check if Redis can do what you want is to install it in a development machine and give redis-clia try. This built-in client has command completion and help available (just hit help <TAB> to see categories), and also a good documentation is available at http://redis.io/commands.
As for your request, you should study the sorted set datastructure, to store the user's info. This is better than sets alone, because it does allow you some basic form of queries, as you will see:
Suppose you want to be able to query all users between 18 and 30 years in your database, which contains (as an example) 3 users: Joe, 28 years; Bob, 17 years; and Adam, 50 years.
You would populate Redis as:
ZADD age 28 Joe
ZADD age 17 Bob
ZADD age 50 Adam
The syntax is ZADD [key] [score] [member]. The set is in [key], and in sorted sets each [member] has a [score] which MUST BE A DOUBLE. That means you can't query scores using string patterns or even string values (in this point Mongo should be better).
So, time to query. I will paste redis-cli queries now:
To list all users between 18 and 30 years, you would do:
redis> ZRANGEBYSCORE age 18 30
1) "Joe"
redis> ZRANGEBYSCORE age 18 30 WITHSCORES
1) "Joe"
2) "28"
The option WITHSCORES show the score of each member in the result.
To list all users under the age of 18:
redis> ZRANGEBYSCORE age 0 (18
1) "Bob"
redis> ZRANGEBYSCORE age 0 (18 WITHSCORES
1) "Bob"
2) "17"
The ( modifier allows the beginning or end interval to be open. It would be like where age >=0 and age < 18.
To list all users above the age of 30:
redis> ZRANGEBYSCORE age 30 +inf
1) "Adam"
redis> ZRANGEBYSCORE age 30 +inf WITHSCORES
1) "Adam"
2) "50"
The +inf is infinity, so there is no upper limit on the result, like where age >= 30.
And so on.
This is really just the beginning. You can do intersections, unions and select count(*) operations in sorted sets, all very fast, specially if you use an optimized library (like phpredis). Hope you get a good impression of it now.

Resources