All web developers run into this problem when the amount of data in their project grows, and I have yet to see a definitive, intuitive best practice for solving it. When you start a project, you often create forms with tags to help pick related objects for one-to-many relationships.
For instance, I might have a system with Neighbors and each Neighbor belongs to a Neighborhood. In version 1 of the application I create an edit user form that has a drop down for selecting users, that simply lists the 5 possible neighborhoods in my geographically limited application.
In the beginning, this works great. So long as I have maybe 100 records or less, my select box will load quickly, and be fairly easy to use. However, lets say my application takes off and goes national. Instead of 5 neighborhoods I have 10,000. Suddenly my little drop-down takes forever to load, and once it loads, its hard to find your neighborhood in the massive alphabetically sorted list.
Now, in this particular situation, having hierarchical data, and letting users drill down using several dynamically generated drop downs would probably work okay. However, what is the best solution when the objects/records being selected are not hierarchical in nature? In the past, of done this with a popup with a search box, and a list, but this seems clunky and dated. In today's web 2.0 world, what is a good way to find one object amongst many for ones forms?
I've considered using an Ajaxifed search box, but this seems to work best for free text, and falls apart a little when the data to be saved is just a reference to another object or record.
Feel free to cite specific libraries with generic solutions to this problem, or simply share what you have done in your projects in a more general way
I think an auto-completing text box is a good approach in this case. Here on SO, they also use an auto-completing box for tags where the entry already needs to exist, i.e. not free-text but a selection. (remember that creating new tags requires reputation!)
I personally prefer this anyways, because I can type faster than select something with the mouse, but that is programmer's disease I guess :)
Auto-complete is usually the best solution in my experience for searches, but only where the user is able to provide text tokens easily, either as part of the object name or taxonomy that contains the object (such as a product category, or postcode).
However this doesn't always work, particularly where 'browse' behavior would be more suitable - to give a real example, I once wrote a page for a community site that allowed a user to send a message to their friends. We used auto-complete there, allowing multiple entries separated by commas.
It works great when you know the names of the people you want to send the message to, but we found during user acceptance that most people didn't really know who was on their friend list and couldn't use the page very well - so we added a list popup with friend icons, and that was more successful.
(this was quite some time ago - everyone just copies Facebook now...)
Different methods of organizing large amounts of data:
Hierarchies
Spatial (geography/geometry)
Tags or facets
Different methods of searching large amounts of data:
Filtering (including autocomplete)
Sorting/paging (alphabetically-sorted data can also be paged by first letter)
Drill-down (assuming the data is organized as above)
Free-text search
Hierarchies are easy to understand and (usually) easy to implement. However, they can be difficult to navigate and lead to ambiguities. Spatial visualization is by far the best option if your data is actually spatial or can be represented that way; unfortunately this applies to less than 1% of the data we normally deal with day-to-day. Tags are great, but - as we see here on SO - can often be misused, misunderstood, or otherwise rendered less effective than expected.
If it's possible for you to reorganize your data in some relatively natural way, then that should always be the first step. Whatever best communicates the natural ordering is usually the best answer.
No matter how you organize the data, you'll eventually need to start providing search capabilities, and unlike organization of data, search methods tend to be orthogonal - you can implement more than one. Filtering and sorting/paging are the easiest, and if an autocomplete textbox or paged list (grid) can achieve the desired result, go for that. If you need to provide the ability to search truly massive amounts of data with no coherent organization, then you'll need to provide a full textual search.
If I could point you to some list of "best practices", I would, but HID is rarely so clear-cut. Use the aforementioned options as a starting point and see where that takes you.
Related
Imagine: someone has a huge website selling, let's say, T-shirts.
we want to show paginated sorted listings of offers, also with options to filter by parameters, let's say - T-shirt colour.
offers should be sortable by any of 5 properties (creating date,
price, etc...)
Important requirement 1: we have to give a user an ability to browse all the 15 million offers, and not just the "top-N".
Important requirement 2: they must be able to jump to a random page at any time, not just flick through them sequentially
we use some sort of a traditional data storage (MongoDB, to be precise).
The problem is that MongoDB (as well as other traditional databases) performs poorly when it comes to big offsets. Imagine if a user wants to fetch a page of results somewhere in the middle of this huge list sorted by creation date with some additional filters (for instance - by colour)
There is an article describing this kind of problem:
http://openmymind.net/Paging-And-Ranking-With-Large-Offsets-MongoDB-vs-Redis-vs-Postgresql/
Okay now, so we are told that redis is a solution for similar kind of problem. You "just" need to prepare certain data structures and search them instead of your primary storage.
the question is:
What kind of structures and approaches whould you suggest to use in order to solve this with Redis?
Sorted Sets, paging through with ZRANGE.
I was recently asked this in an interview (Software Engineer) and didn't really know how to go about answering the question.
The question was focused on both the algorithm of the spreadsheet and how it would interact with the browser. I was a bit confused on what data structure would be optimal to handle the cells and their values. I guess any form of hash table would work with cells being the unique key and the value being the object in the cell? And then when something gets updated, you'd just update that entry in your table. The interviewer hinted at a graph but I was unsure of how a graph would be useful for a spreadsheet.
Other things I considered were:
Spreadsheet in a browser = auto-save. At any update, send all the data back to the server
Cells that are related to each other, i.e. C1 = C2+C3, C5 = C1-C4. If the value of C2 changes, both C1 and C5 change.
Usage of design patterns? Does one stand out over another for this particular situation?
Any tips on how to tackle this problem? Aside from the algorithm of the spreadsheet itself, what else could the interviewer have wanted? Does the fact that its in a browser as compared to a separate application add any difficulties?
Thanks!
For an interview this is a good question. If this was asked as an actual task in your job, then there would be a simple answer of use a third party component, there are a few good commercial ones.
While we can't say for sure what your interviewer wanted, for me this is a good question precisely because it is so open ended and has so many correct possible answers.
You can talk about the UI and how to implement the kind of dynamic grid you need for a spreadsheet and all the functionality of the cells and rows and columns and selection of cells and ranges and editing of values and formulas. You probably could talk for a while on the UI implications alone.
Alternatively you can go the data route, talk about data structures to hold a spreadsheet, talk exactly about links between cells for formulas, talk about how to detect and deal with circular references, talk about how in a browser you have less control over memory and for very large spreadsheets you could run into problems earlier. You can talk about what is available in JavaScript vs a native language and how this impacts the data structures and calculations. Also along with data, a big important issue with spreadsheets is numerical accuracy and floating point number calculations. Floating point numbers are made to be fast but are not necessarily accurate in extreme levels of precision and this leads to a lot of confusing questions. I believe very recently Excel switched to their own representation of a fixed decimal number as it's now viable to due spreadsheet level calculations without using the built-in floating point calculations. You can also talk about data structures and calculation and how they affect performance. In a browser you don't have threads (yet) so you can't run all the calculations in the background. If you have 100,000 rows with complex calculations and change one value that cascades across everything, you can get a warning about a slow script. You need to break up the calculation.
Finally you can run form the user experience angle. How is the experience in a browser different from a native application? What are the advantages and what cool things can you do in a browser that may be difficult in a desktop application? What things are far more complicated or even totally impossible (example, associate your spreadsheet app with a file type so a user can double-click a file and open it in your online spreadsheet app, although I may be wrong about that still being unsupported).
Good question, lots of right answers, very open ended.
On the other hand, you could also have had a bad interviewer that is specifically looking for the answer they want and in that case you're pretty much out of luck unless you're telepathic.
You can say hopelessly too much about this. I'd probably start with:
If most of the cells are filled, use a simply 2D array to store it.
Otherwise use a hash table of location to cell
Or perhaps something like a kd-tree, which should allow for more efficient "get everything in the displayed area" queries.
By graph, your interviewer probably meant have each cell be a vertex and each reference to another cell be a directed edge. This would allow you to do checks for circular references fairly easily, and allow for efficiently updating of all cells that need to change.
"In a browser" (presumably meaning "over a network" - actually "in a browser" doesn't mean all that much by itself - one can write a program that runs in a browser but only runs locally) is significant - you probably need to consider:
What are you storing locally (everything or just the subset of cells that are current visible)
How are you sending updates to the server (are you sending every change or keeping a collection of changed cells and only sending updates on save, or are you not storing changes separately and just sending the whole grid across during save)
Auto-save should probably be considered as well
Will you have an "undo", will this only be local, if not, how will you handle this on the server and how will you send through the updates
Is only this one user allowed to work with it at a time (or do you have to cater for multi-user, which brings dealing with conflicts, among other things, to the table)
Looking at the CSS cursor property just begs for one to create
a spreadsheet web application.
HTML table or CSS grid? HTML tables are purpose built for tabular
data.
Resizing cell height and width is achievable with offsetX and
offsetY.
Storing the data is trivial. It can be Mongo, mySQL, Firebase,
...whatever. On blur, send update.
Javascrip/ECMA is more than capable of delivering all the Excel built-in
functions. Did I mention web workers?
Need to increment letters as in column ID's? I got you covered.
Most importantly, don't do it. Why? Because it's already been done.
Find a need and work that project.
I'm aggregating concert listings from several different sources, none of which are both complete and accurate. Some of the data comes from users (such as on last.fm), and may be incorrect. Other data sources are highly accurate, but may not contain every event. I can use attributes such as the event date, and the city/state to try to match listings from disparate sources. I'd like to be reasonably certain that the events are valid. It seems like it would be a good strategy to consume as many different sources as possible to validate listings on error-prone sources.
I'm not sure what the technical term for this is, as I'd like to research it further. Is it data mining? Are there any existing algorithms? I understand a solution will never be completely accurate.
Here is an approach that locates it within statistics - specifically, it uses a Hidden Markov Model (http://en.wikipedia.org/wiki/Hidden_Markov_model):
1) Use your matching process to produce a cleaned list of possible events. Consider each event to be marked "true" or "bogus", even though the markings are hidden from you. You might imagine that some source of events produces them, generating them as either "true" or "bogus" according to a probability which is an unknown parameter.
2) Associate unknown parameters with each source of listings. These give the probability that this source will report a true event produced by the source of events, and the probability that it will report a bogus event produced by the source.
3) Notice that if you could see the markings of "true" or "bogus" you could easily work out the probabilities for each source. Unfortunately, of course, you can't see these hidden markings.
4) Let's call these hidden markings "Latent Variables" because then you can use the http://en.wikipedia.org/wiki/Em_algorithm to hillclimb to promising solutions for this problem, from random starts.
5) You can obviously make the problem more complicated by dividing events up into classes, and giving sources of listing parameters which make them more likely to report some classes of events than others. This might be useful if you have sources that are extremely reliable for some sorts of events.
I believe the term you are looking for is Record Linkage -
the process of bringing together two or more records relating to the same entity(e.g., person, family, event, community, business, hospital, or geographical area)
This presentation (PDF) looks like a nice introduction to the field. One algorithm you might use is Fellegi-Holt - a statistical method for editing records.
One potential search term is "fuzzy logic".
I'd use a float or double to store a probability (0.0 = disproved ... 1.0 = proven) of some event details being correct. As you encounter sources, adjust the probabilities accordingly. There's a lot for you to consider though:
attempting to recognise when multiple sources have copied from each other and reduce their impact
giving more weight to more recent data or data that explicitly acknowledges the old data (e.g. given a 100% reliable site saying "concert X to be held on 4th August", and a unknown blog alleging "concert X moved from 4th August to 9th", you might keep the probability of there being such a concert at 100% but have a list with both dates and whatever probabilities you think appropriate...)
beware assuming things are discrete; contradictory information may reflect multiple similar events, dual billing, same-surnamed performers etc. - the more confident you are that the same things are referenced, the more the data can combined to reinforce or negate each other
you should be able to "backtest" your evolving logic by using data related to a set of concerts where you now have full knowledge of their actual staging or lack thereof; process data posted before various cut-off dates prior to the events to see how the predictions you derive reflect the actual outcomes, tweak and repeat (perhaps automatically)
It may be most practical to start scraping from the sites you have, then consider the logical implications of the types of information you're seeing. Which aspects of the problem need to be handled using fuzzy logic can then be decided. An evolutionary approach may mean reworking things, but may end up faster than getting bogged down in a nebulous design phase.
Data mining is about finding information from structured sources like a database, or a post where the fields are separated for you. There's some text mining in here when you have to parse the information out of free text. In either case, you could keep track of how many data sources agree on a show as a confidence measure. Either display the confidence measure or use it to decide if your data is good enough. There's lots to play with. Having a list of legitimate cities, venues and acts can help you decide if a string represents a legitimate entity. Your lists might even be in a database that lets you compare city and venue for consistency.
In our application we have a repository that contains things (they are called methods and queries, but this is not particularly relevant for this question). Each thing has a title, description (though some may lack both) and some other data. Users save things to repository and load and use things from repository.
I wonder what is the best way to organize the repository from usability point of view. There seems to be two major approaches. The first approach is to put things in folders, subfolders and so on, and have a hierarchical structure similar to a filesystem. The second approach (that has become fashionable) is two have a flat space and assign zero or more tags to each thing, so that users can view a list of things for a particular tag.
Currently we use flat space, tags and search. It appears to be somewhat unmanageable. I am not sure if switching to folders/subfolders will make it better.
I would like to learn more about the pros and cons of each approach and what properties of the collection and the things themselves suggest using one or another approach or a combination of both. If anybody can point me to some studies or discussions of those, I would really appreciate that.
There is no reason you can't use both methods. To some extent finding things is dependent upon what the thing is and why it is being looked for. A hierachical design can work well when somebody knows what they are looking for and a tag / keyword based system can work better when the structure is less obvious.
Also a network structure that links similar things can also be very good as you can see with the internet or a wikipedia.
I use the law of symmetry to help me in this situation.
First you build the tree like structure in the back end and then build the tagging system for the front end.
You use both to organize your data collection.
A tag cloud works better than a hierarchy if
the taxonomy is uncertain
("Now is this a small car or a large truck?")
there is no central authority for classification
there is no obivous or natural order between the classes
(cars can be classified by color or by size, there is no obvious rank between color and size)
new categories may be created on the fly
Otherwise, a hierarchy gives more confidence in completeness, as every item has exactly one obviously correct location: did I find all documents about birds? Is there really no document about five-story houses?
Tag clouds need some maintenance, I am not sure if this can be completely user-provided:
Dealing with synonyms, tag synonyms, merging tags, clarifying tags (e.g. is "blue" a feeling or a color?)
Another option are attribute-value pairs. They can be built upon a well-maintained tag cloud, e.g. grouping "red / black / blue" tags under "color". They can also work with floating values, search can be extended to similar values in case of not enough results (such as age, date, even multidimensionals like color).
However, this requires to know ahead what search criteria users need. If you need to introduce a new category, you need to re-tag the entire body of documents.
See also my request for clarification: what are the problems? Not enough tagging? tagging to distinct? Users not finding what they are loking for? Users not confident in search results?
I want to know the effective algorithms/data structures to identify the below information in streaming data.
Consider a real-time streaming data like twitter. I am mainly interested in the below queries rather than storing the actual data.
I need my queries to run on actual data but not any of the duplicates.
As I am not interested in storing the complete data, it will be difficult for me to identify the duplicate posts. However, I can hash all the posts and check against them. But I would like to identify near duplicate posts also. How can I achieve this.
Identify the top k topics being discussed by the users.
I want to identify the top topics being discussed by users. I don't want the top frequency words as shown by twitter. Instead I want to give some high level topic name of the most frequent words.
I would like my system to be real-time. I mean, my system should be able to handle any amount of traffic.
I can think of map reduce approach but I am not sure how to handle synchronization issues. For example, duplicate posts can reach different nodes and both of them could store them in the index.
In a typical news source, one will be removing any stop words in the data. In my system I would like to update my stop words list by identifying top frequent words across a wide range of topics.
What will be effective algorithm/data structure to achieve this.
I would like to store the topics over a period of time to retrieve interesting patterns in the data. Say, friday evening everyone wants to go to a movie. what will be the efficient way to store this data.
I am thinking of storing it in hadoop distributed file system, but over a period of time, these indexes become so large that I/O will be my major bottleneck.
Consider multi-lingual data from tweets around the world. How can I identify similar topics being discussed across a geographical area?
There are 2 problems here. One is identifying the language being used. It can be identified based on the person tweeting. But this information might affect the privacy of the users. Other idea, could be running it through a training algorithm. What is the best method currently followed for this. Other problem is actually looking up the word in a dictionary and associating it to common intermediate language like say english. How to take care of word sense disambiguation like a same word being used in different contests.
Identify the word boundaries
One possibility is to use some kind of training algorithm. But what is the best approach followed. This is some way similar to word sense disambiguation, because you will be able to identify word boundaries based on the actual sentence.
I am thinking of developing a prototype and evaluating the system rather than the concrete implementation. I think its not possible to scrap the real-time twitter data. I am thinking this approach can be tested on some data freely available online. Any ideas, where I can get this data.
Your feedback is appreciated.
Thanks for your time.
-- Bala
There are a couple different questions buried in here. I can't understand all that you're asking, but here's a the big one as I understand it: You want to categorize messages by topic. You also want to remove duplicates.
Removing duplicates is (relatively) easy. To remove "near" duplicates, you could first remove uninteresting parts from your data. You could start by removing capitalization and punctuation. You could also remove the most common words. Then you could add the resulting message to a Bloom filter. Hashing isn't good enough for Twitter, as the hashed messages wouldn't be much smaller than the full messages. You'd end up with a hash that doesn't fit in memory. That's why you'd use a Bloom filter instead. It might have to be a very large Bloom filter, but it will still be smaller than the hash table.
The other part is a difficult categorization problem. You probably do not want to write this part yourself. There are a number of libraries and programs available for categorization, but it might be hard to find one that fits your needs. An example is the Vowpal Wabbit project, which is a fast online algorithm for categorization. However, it only works on one category at a time. For multiple categories, you would have to run multiple copies and train them separately.
Identifying the language sounds less difficult. Don't try to do something smart like "training", instead put the most common words from each language in a dictionary. For each message, use the language whose words appeared most frequently.
If you want the algorithm to come up with categories on its own, good luck.
I'm not really sure if I'm answering your main question, but you could determine the similarity of two messages by calculating the Levenshtein distance between them. You can think of this as the "edit difference" between two strings (I.E., how many edits would need to be made to one, to convert it to the other).
Hello we have created a very similar demo using api.cortical.io functionality.
There you can create semantic fingerprints of each tweet. (you could also extract the top most keywords or some similar terms, that don't need to actually be part of the tweet).
We have used the fingerprints to filter the twitter stream based on content.
On twistiller.com you can see the result. The public 1% twitter stream is monitored for four different topic areas.