Can I filter large lists with angular on client-side? - performance

I'm trying to find some information about the performance of angular.
If I had a list of 10k (or 50k) objects with 20 attributes each, would an average pc be able to filter this array in a reasonable time?
(Assuming a good implementation and this beeing the only operation executed)
Does anybody have any expierience in this direction?

Generally speaking, unless you can find a way to do this asynchronously; it's going to cause blocking UI issues. I've found the average "PC" (and let's not forget browsers are not created equal) is too slow for my use cases and that was with more simple objects (less attributes).
If you had a very limited set of attributes you actually want to filter on, there is a chance you could speed up the work by leveraging IndexedDB.

Related

Efficient way to represent locations, and query based on proximity?

I'm pondering over how to efficiently represent locations in a database, such that given an arbitrary new location, I can efficiently query the database for candidate locations that are within an acceptable proximity threshold to the subject.
Similar things have been asked before, but I haven't found a discussion based on my criteria for the problem domain.
Things to bear in mind:
Starting from scratch, I can represent data in any way (eg. long&lat, etc)
Any result set is time-sensitive, in that it loses validity within a short window of time (~5-15mins) so I can't cache indefinitely
I can tolerate some reasonable margin of error in results, for example if a location is slightly outside of the threshold, or if a row in the result set has very recently expired
A language agnostic discussion is perfect, but in case it helps I'm using C# MVC 3 and SQL Server 2012
A couple of first thoughts:
Use an external API like Google, however this will generate thousands of requests and the latency will be poor
Use the Haversine function, however this looks expensive and so should be performed on a minimal number of candidates (possibly as a Stored Procedure even!)
Build a graph of postcodes/zipcodes, such that from any node I can find postcodes/zipcodes that border it, however this could involve a lot of data to store
Some optimization ideas to reduce possible candidates quickly:
Cache result sets for searches, and when we do subsequent searches, see if the subject is within an acceptable range to a candidate we already have a cached result set for. If so, use the cached result set (but remember, the results expire quickly)
I'm hoping the answer isn't just raw CPU power, and that there are some approaches I haven't thought of that could help me out?
Thank you
ps. Apologies if I've missed previously asked questions with helpful answers, please let me know below.
What about using GeoHash? (refer to http://en.wikipedia.org/wiki/Geohash)

jQuery Chosen is slow on IE, show results after x chars?

I'm using the chosen plugin (http://harvesthq.github.io/chosen/) with 10K items in the select box
On IE9 & IE10 it's very slow.
Is there a way to speed up to plugin?
Was thinking that results only would show up after x chars searched for, but can't find any documentation on that.
That number of entries will likely be slow, no matter which plugin you use (assuming the plugin keeps the elements in memory the whole time).
If you REALLY need to have all of those options available, it may be faster to have a search performed server-side, and return the resulting elements, rebuilding the select box after wards. I'm not sure if 'Chosen' has this capability, but I'm sure there is a jQuery plugin around somewhere that would provide this functionality.
10k is a lot of elements to go through and prune - and it's fair enough to say, IE has always been a tad on the slow side for JS.
With regards to why it's not speeding up after providing a certain number of characters, I imagine it's searching the whole data set every time, instead of (if a character is added) a sub set (the previously returned results).
This could be improved, perhaps by using a result set history of some kind, but would required substantial development.
Edit: Something like this possibly? http://ivaynberg.github.io/select2/

Scalable real time item based mahout recommender with precomputed item similarities using item similarity hadoop job?

I have the following setup:
boolean data: (userid, itemid)
hadoop based mahout itemSimilarityJob with following arguements:
--similarityClassname Similarity_Loglikelihood
--maxSimilaritiesPerItem 50 & others (input,output..)
item based boolean recommender:
-model MySqlBooleanPrefJDBCDataModel
-similarity MySQLJDBCInMemoryItemSimilarity
-candidatestrategy AllSimilarItemsCandidateItemsStrategy
-mostSimilarItemsCandidateStrategy AllSimilarItemsCandidateItemsStrategy
Is there a way to use similarity cooccurence in my setup to get final recommendations? If I plug SIMILARITY_COOCCURENCE in the job, the MySqlJDBCInMemorySimilarity precondition checks fail since the counts become greater than 1. I know I can get final recommendations by running the recommender job on the precomputed similarities. Is there way to do this real time using the api like in the case of similarity loglikelihood (and other similarity metrics with similarity values between -1 & 1) using MysqlInMemorySimilarity?
How can we cap the max no. of similar items per item in the item similarity job. What I mean here is that the allsimilaritemscandidatestrategy calls .allsimilaritems(item) to get all possible candidates. Is there a way I can get say top 10/20/50 similar items using the API. I know we can pass a --maxSimilaritiesPerItem to the item similarity job but i am not completely sure as to what is stands for and how it works. If I set this to 10/20/50, will I be able to achieve what stated above. Also is there way to accomplish this via the api?
I am using a rescorer for filtering out and rescoring final recommendations. With rescorer, the calls to /recommend/userid?howMany=10&rescore={..} & to /similar/itemid?howMany=10&rescore{..} are taking way to longer (300ms-400ms) compared to (30-70ms) without the rescorer. I m using redis as an in memory store to fetch rescore data. The rescorer also receives some run-time data as shown above. There are only a few checks that happen in rescorer. The problem is that as the no. of item preferences for a particular user increase (> 100), the no. of calls to isFiltered() & rescore() increase massively. This is mainly due to the fact that for every user preference, the call to candidateStrategy.getCandidatItems(item) returns around (100+) similar items for each and the rescorer is called for each of these items. Hence the need to cap the max number of similar items per item in the job. Is this correct or am I missing something here? Whats the best way to optimise the rescorer in this case?
The MysqlJdbcInMemorySimilarity uses GenericItemSimilarity to load item similarities in memeory and its .allsimilaritems(item) returns all possible similar items for a given item from the precomputed item similarities in mysql. Do i need to implement my own item similarity class to return top 10/20/50 similar items. What about the if user's no. of preferences continue to grow?
It would be really great if anyone can tell me how to achieve the above? Thanks heaps !
What Preconditions check are you referring to? I don't see them; I'm not sure if similarity is actually prohibited from being > 1. But you seem to be asking whether you can make a similarity function that just returns co-occurrence, as an ItemSimilarity that is not used with Hadoop. Yes you can; it does not exist in the project. I would not advise this; LogLikelihoodSimilarity is going to be much smarter.
You need a different CandidateItemStrategy, particularly, look at SamplingCandidateItemsStrategy and its javadoc. But this is not related to Hadoop, rather than run-time element, and you mention a flag to the Hadoop job. That is not the same thing.
If rescoring is slow, it means, well, the IDRescorer is slow. It is called so many times that you certainly need to cache any lookup data in memory. But, reducing the number of candidates per above will also reduce the number of times this is called.
No, don't implement your own similarity. Your issue is not the similarity measure but how many items are considered as candidates.
I am the author of much of the code you are talking about. I think you are wrestling with exactly the kinds of issues most people run into when trying to make item-based work at significant scale. You can, with enough sampling and tuning.
However I am putting new development into a different project and company called Myrrix, which is developing a sort of 'next-gen' recommender based on the same APIs, but which ought to scale without these complications as it's based on matrix factorization. If you have time and interest, I strongly encourage you to have a look at Myrrix. Same APIs, the real-time Serving Layer is free/open, and the Hadoop-based Computation Layer backed in also available for testing.

2 approaches for tracking online users with Redis. Which one is faster?

Recently I found an nice blog post presenting 2 approaches for tracking online users of a web site with the help of Redis.
1) Smart-keys and setting their expiration
http://techno-weenie.net/2010/2/3/where-s-waldo-track-user-locations-with-node-js-and-redis
2) Set-s and intersects
http://www.lukemelia.com/blog/archives/2010/01/17/redis-in-practice-whos-online/
Can you judge which one should be faster and why?
For knowing whether or not a particular user is online, the first method will be a lot faster - nothing is faster than reading a single key.
Finding users on a particular page is not as clear (I haven't seen hard numbers on the performance of either intersection or wildcard keys), but if the set is big enough to cause performance problems in either implementation it isn't practical to display them all anyway.
For matching users to a friends list I would probably go with the first approach also - even a few hundred get operations (checking the status of everyone in the list) should outperform intersection on multiple sets if those sets have a large number of records and are difficult to maintain.
Redis sets are more appropriate for things that can't be done with keys, particularly where getting all items in the set is more important than checking if a particular item is in the set.

Ok to use memcache in this way? or need a system re-architecture?

I have a "score" i need to calculate for multiple items for multiple users. Each user has many many scores unique to them, and calculating can be time/processor intensive. (the slowness isn't on the database end). To deal with this, I'm making extensive use of memcached. Without memcache some pages would take 10 seconds to load! Memcache seems to work well because the scores are very small pieces of information, but take awhile to compute. I'm actually setting the key to never expire, and then I delete them on the occasional circumstances the score changes.
I'm entering a new phase on this product, and am considering re-architecting the whole thing. There seems to be a way I can calculate the values iteratively, and then store them in a local field. It'll be a bit similar to whats happening now, just the value updates will happen faster, and the cache will be in the real database, and managing it will be a bit more work (I think I'd still use memcache on top of that though).
if it matters, its all in python/django.
Is intending on the cache like this bad practice? is it ok? why? should I try and re-architect things?
If it ain't broke...don't fix it ;^) It seems your method is working, so I'd say stick with it. You might look at memcachedb (or tokyo cabinet) , which is a persistent version of memcache. This way, when the memcache machine crashes and reboots, it doesn't have to recalc all values.
You're applying several architectural patterns here, and each of them certainly has a place. There's not enough information here for me to evaluate whether your current solution needs rearchitecting or whether your ideas will work. It does seem likley to me that as your understanding of the user's requirements grows you may want to improve things.
As always, prototype, measure performance, consider the trade off between complexity and performance - you don't need to be as fast as possible, just fast enough.
Caching in various forms is often the key to good performance. The question here is whether there's merit in persisting the caclulated, cahced values. If they're stable over time then this is often an effective strategy. Whether to persist the cache or make space for them in your database schema will probably depend upon the access patterns. I there are various query paths then a carefully designed database scheme may be appropriate.
Rather than using memcached, try storing the computed score in the same place as your other data; this may be simpler and require fewer boxes.
Memcached is not necessarily the answer to everything; it's intended for systems which need to read-scale very highly. It sounds like in your case, it doesn't need to, it simply needs to be a bit more efficient.

Resources