The answer to this question may be obvious to someone with more experience in data-warehousing and BI, but I am looking for some guidance.
I'm building a system that uses multiple JMS queues to process millions of messages per day. I need visibility into the activity of these queues, so that I can create reports like..."Yesterday at 11:01am, how many messages entered queue A that had the word 'Foo' in them?"
To make matters worse, I have about 200k words I need to run this report on, for every minute of every day, across several queues, each processing millions of messages per day.
When I think of implementing a custom solution for this, I start going down a wormhole into a pit of despair. Surely I can't be the only person who has ever faced this problem before.
Anybody got any bright ideas?
In general for any MOM problem I suggest this site and the book it is based on.
In particular check out the Message Store pattern and possibly the Detour
Related
I am working on a project that involves many clients connecting to a server(servers if need be) that contains a bunch of graph info (node attributes and edges). They will have the option to introduce a new node or edge anytime they want and then request some information from the graph as a whole (shortest distance between two nodes, graph coloring, etc).
This is obviously quite easy to develop the naive algorithm for, but then I am trying to learn to scale this so that it can handle many users updating the graph at the same time, many users requesting information from the graph, and the possibility of handling a very large (500k +) nodes and possibly a very large number of edges as well.
The challenges I can foresee:
with a constantly updating graph, I need to process the whole graph every time someone requests information...which will increase computation time and latency quite a bit
with a very large graph, the computation time and latency will obviously be a lot higher (I read that this was remedied by some companies by batch processing a ton of results and storing them with an index for later use...but then since my graph is being constantly updated and users want the most up to date info, this is not a viable solution)
a large number of users requesting information which will be quite a load on the servers since it has to process the graph that many times
How do I start facing these challenges? I looked at hadoop and spark, but they seem have high latency solutions (with batch processing) or solutions that address problems where the graph is not constantly changing.
I had the idea of maybe processing different parts of the graph and indexing them, then keeping track of where the graph is updated and re-process that section of the graph (a kind of distributed dynamic programming approach), but im not sure how feasible that is.
Thanks!
How do I start facing these challenges?
I'm going to answer this question, because it's the important one. You've enumerated a number of valid concerns, all of which you'll need to deal with and none of which I'll address directly.
In order to start, you need to finish defining your semantics. You might think you're done, but you're not. When you say "users want the most up to date info", does "up to date" mean
"everything in the past", which leads to total serialization of each transaction to the graph, so that answers reflect every possible piece of information?
Or "everything transacted more than X seconds ago", which leads to partial serialization, which multiple database states in the present that are progressively serialized into the past?
If 1. is required, you may well have unavoidable hot spots in your code, depending on the application. You have immediate information for when to roll back a transaction because it of inconsistency.
If 2. is acceptable, you have the possibility for much better performance. There are tradeoffs, though. You'll have situations where you have to roll back a transaction after initial acceptance.
Once you've answered this question, you've started facing your challenges and, I assume, will have further questions.
I don't know much about graphs, but I do understand a bit of networking.
One rule I try to keep in mind is... don't do work on the server side if you can get the client to do it.
All your server needs to do is maintain the raw data, serve raw data to clients, and notify connected clients when data changes.
The clients can have their own copy of raw data and then generate calculations/visualizations based on what they know and the updates they receive.
Clients only need to know if there are new records or if old records have changed.
If, for some reason, you ABSOLUTELY have to process data server side and send it to the client (for example, client is 3rd party software, not something you have control over and it expects processed data, not raw data), THEN, you do have a bit of an issue, so get a bad ass server... or 3 or 30. In this case, I would have to know exactly what the data is and how it's being processed in order to make any kind of suggestions on scaled configuration.
I have a GTFS feed defined for my fleet. This tells the routes, trips and timings. Now using this GTFS feed, is it possible to optimize the utilization of my fleet's vehicles? Can I schedule the vehicles such that once it completes a trip, it can be assigned to serve a trip of another route?
I have constriants such as no vehicle should be running more than 12 hours, every vehicle will undergo a health check for 2 hrs, etc.
To me this sounds like a case of the Knapsack problem.
If such a project exists, kindly let me know. Is there an algorithm that can solve this problem?
Thanks,
Yash
You're asking a question that is typically assigned to a scheduling system, one which would produce GTFS files from the get-go. In smaller systems, this actually is not difficult to do, but as the number of routes (or "trip patterns") increases, the process gets more complex.
Before you undertake any project like this, I suggest reading over the TCRP manual on scheduling, paying close attention to the terms "cycle time," "headway," and "interlining."
While I'd love to help more, I don't have time right now to get into the specifics. I performed a similar analysis with automatically collected cycle times on a limited set of routes in my masters thesis, starting on page 118.
I hope this helps. If you have any follow-up questions, post a comment and I'll respond when I have time.
I'm somewhat new to noSQL databases (I'm fairly good with relational databases though), and I'm wondering what the most efficient way to handle an inbox system with threaded messages would be.
Each 'message' will have a single sender and recipient. The number of received / sent messages will vary widely between users. This system should scale well to over 1k+ users.
I've read up on fan out on write / read but I'm not sure how well this would work for threaded messages.
Since I'm new to MongoDB / NoSQL in general, I'm not really used to structuring data efficiently this way.
I'm guessing there's going to be nested objects in any sort of efficient way of handling this...but I can't settle on a design that seems both efficient and convenient for threaded conversations between 2 users.
I thought of storing data with an array of the 2 users, combined with an array of 'message' objects. But then there's the issue of the order of the 2 user's usernames. (ex. [UserA, UserB] and [UserB, UserA] are both possible and would be problematic, so that seemed like a bad idea).
I thought of doing the whole fan out on read / write thing, but that doesn't seem efficient for threaded messages (since if grabbing messages by recipient is convenient, grabbing messages by sender won't be and vice versa).
I'm leaning towards favoring grabbing messages by recipient (since the inbox loads multiple messages, and sending only involves one [albeit with a longer look-up time]). But I'd really like to grab a threaded conversation in one go, as well as the list of users that a user has threaded conversations with (for the list of threads).
If someone could give me an efficient schema for threaded conversations I'd be very grateful. I've been researching this and trying to settle on a design for hours, and I'm exhausted. I keep finding flaws in my designs and scrapping them and I'd really just like some input from someone more experienced with NoSQL databases / MongoDB so I can avoid making a huge design flaw and/or writing logic that could've been handled with a better database design.
Thanks in advance for any and all help.
On this particular topic you are in luck, there is a great post discussing the various approaches to the schema here (it's a slight twist on what you are looking at, but not much different):
http://blog.mongodb.org/post/65612078649/schema-design-for-social-inboxes-in-mongodb
Then, this topic was also covered in detail at MongoDB World 2014 in three parts by Darren Wood and Asya Kamsky:
Part 1 Outline and Video
Part 2 Outline and Video
Part 3 Outline and Video
Also at MongoDB World the guys at Dropbox talked about the lessons they learned when building their Mailbox:
http://www.mongodb.com/presentations/mongodb-mailbox
And then, to round it off, there is a full reference architecture with code called Socialite on Github written by the aforementioned Darren Wood:
https://github.com/10gen-labs/socialite
We have some data that we are trying to synchronize between N machines and a centralized server, and I'm looking for a way to do this that is relatively efficient and robust.
Looking around, it appears that this is called a "set reconciliation problem". It's good to have a label for it, but searching on that turns up a lot of fairly academic work, which is at times a bit difficult to gauge in terms of its usefulness for our data, which is best described as contact lists in terms of its properties: objects (people) with multiple fields that do get updated, but not that often.
Our system involves a central server and machines connected to it. The central server, ideally, is the 'good' copy. A feature that's nice to have also, is the ability to force the machines to resend by tweaking something on the server.
So far, my thinking is along the lines of a UUID for each object and something like a version or timestamp (per object and or per collection of objects?) to use to tell which data to attempt to synchronize... but my thinking is still a bit fuzzy, and I thought asking would probably lead to a better solution than trying to invent this on my own.
It is not easy, and the perfect solution is academical. So you are on the good track.
You can craft a sync algorithm for your own problem, relaxing some of the requirements of the general solution.
I delivered a presentation on these topics at the last JsDay in Italy.
Here are my slides: http://www.slideshare.net/matteocollina/operational-transformation-12962149
Let me know if they help you, or if you need some assistance.
I'm planning on creating a social network and I don't think I quite understand how the status update module of facebook is designed. Hoping I can find some help here. At algorithmic and datastructure level, what is the most efficient way to create a status update mechanism in a social network?
A full table scan for all friends and then sorting their updates is very naive and costly. Do we use some sort of mechanism based on hashing or something else? Please let me know.
P.S: I'm not talking about their EdgeRank algorithm but the basic status update. How do they find and fetch them from the database?
Thanks in advance for the help!
Here is a great presentation that answers your question. The specific answer comes up at around minute 55:40, but I suggest that you watch the entire presentation to understand how the solution fits into the entire architecture.
In short:
A particular server ("leaf") stores all feed items for a particular user. So data for each of your friends is stored entirely at a specific destination.
When you want to view your news feed, one of the aggregator servers sends request to all the leaf servers for your friends and ranks the results. The aggregator knows which servers to send requests to based on the userid of each friend.
This is terribly simplified, of course. This only works because all of it is memcached, the system is designed to minimize latency, some ranking is done at the leaf server that contains the friend's feed items, etc.
You really don't want to be hitting the database for any of this to work at a reasonable speed. FB use MySql mostly as a key-value store; JOINing tables is just impossible at their scale. Then they put memcache servers in front of the databases and application servers.
Having said that, don't worry about scaling problems until you have them (unless, of course, you are worrying about them for the fun of it.) On day one, scaling is the least of your problems.