We are defining in our profile the structure of services/procedures that a health care establishment offers.
Imagine, for example, a clinic that performs Endoscopy.
Should my HealthcareService simply be "Endoscopy" or should it be, for example, "Esophagogastroduodenoscopy" (SNOMED code: 760090000)? And in this second case would I have to have the entire SNOMED Endoscopy procedure list represented as HealthcareServices?
The granularity at which you code the HealthcareService instances depends on the granularity with which you want to manage them. You can have a HealthcareService that says "we offer imaging services", "we offer X-ray", "we offer X-ray of the teeth and jaw" down to very narrow procedure codes.
Low granularity is most useful when exposing a list of services such that other systems have an idea of what your site is capable of. ("Yes, we offer psychiatry", "No, we don't offer gynecology"). Fine-grained codes are useful if you're wanting to expose exactly what can be ordered.
SNOMED typically offers a wide range of granularity and other code systems can be used when the granularity offered by SNOMED doesn't fit what you need.
tldr: Use the granularity of coding that meets your business objectives.
Related
I want to build a recommendation system, and the target is to deal with really big data set, like 1 TB data.
And each user has really huge amount of items, however the number of user is small, like thousands or 10 thousands.
I have search from google, I found there is some open-source recommendation engine based on hadoop like Mahout, I guess it may have ability to deal with such big data, however I'm not sure.
I also find some engine write in C++ python, even php, I don't think script languages can deal with such big data, cause memory can't contain the whole dataset.
Or I'm wrong? Could some give me some recommendation?
Your question title is:
Which opensource recommendation system should I choose to deal with
big dataset?
and in the first line you say
I want to build a recommendation system, and the target is to deal with really big data set, > like 1 TB data.
And you are asking for an recommendation as an answer.
To answer your second question first. In my experience of building recommender systems I would advise you do not "build" a recommender system from the ground up if you can avoid it. Recommender Systems are complex and can use a wide range of techniques to provide a user with a recommendation. So my recommendation is unless you are really committed, and have a team of people with a range of experience and knowledge in recommender systems, statistics, and software engineering then look to implement an existing recommender system rather than building your own.
In terms of which open source recommender system you should choose, this is actually pretty difficult to answer with great accuracy. Let me try to answer this by breaking it down.
Consider the open source license, its restrictions and your requirements.
Consider which algorithm you want to use to make recommendations
Consider the environment you will be running your recommender system on.
I recommend you look more into the algorithm side as it will be the determining factor as to which tool you can use, or whether you will need to roll your own. Start reading here http://www.ibm.com/developerworks/library/os-recommender1/ for a very brief insight in to the different approaches that recommender systems use. In summary the different approaches are:
Content based
Neighbourhood / Collaborative filtering based
Constraint based
Graph-based
In your case to keep things relatively straightforward it sounds like you should consider a user-user collaborative filtering algorithm for this. The reasons being:
Neighbourhood Collaborative Filtering is quite intuitive to understand and it can be relatively easy to implement.
With this method you can also justify your recommendations to your users in a basic way
There is no requirement to build a model for training, and the processing of neighbours can be done "offline", to provide quick recommendations to the end user.
Storing neighbours is actually quite memory efficient, which means better scalability. Something it sounds like you will need lots of.
The user-based part of my suggestion is because it sounds like you have less users than you do items. In a user-based nearest neighbourhood a predicted rating of a new item I for user U is calculated by looking at the other users who have also rated item I and are most similar to user U. Because you have fewer users than items in your system it will be faster to compute user-based collaborative filtering compared with item-based collaborative filtering.
Within the user-based collaborative filtering you need to consider what rating normalisation (mean-centering vs z-score) you want to use, the similarity weight computation method (e.g. Cosine vs Pearsons correlation vs other similarity measures) you want to use, neighbourhood selection criteria (pre-filtering of neighbours, number of neighbours involved in the prediction), and any Dimensionality Reduction methods (SVD, SVD++) you want to implement (with a large dataset like yours you will want to seriously consider DM).
So really instead of looking for an open source that will be able to process your data set you should consider your algorithm choice first, then look to find a tool that has an implementation of this algorithm, and then assess whether it can process your the volume involved in your dataset.
In saying all of that, if you do choose to go down the user-based collaborative filtering route then I am confident that Apache Mahout will be able to solve your problem, and if not it will certainly help you understand the complexity involved in building your own (just look at their source code).
Please note the advice is really consider the algorithm choice. "Good" recommender systems are so much more than just being able to process a large dataset. You need to think about accuracy, coverage, confidence, novelty, serendipity, diversity, robustness, privacy, risk user trust, and finally scalability. You should also consider how you are going to perform experiments and evaluate your recommendations, remember if the recommendations you are churning out are rubbish and it is turning your users off then there is no point to have a recommender system!
It is such a big area with lots to think about, there is probably no one single tool that is going to help you with everything, so be prepared to do a lot of reading and research as well as implementing lots of different open source tools to help you.
In saying that, start looking at Apache Mahout. Going back to the break-down of the 3 areas I said you should think about.
It has a commercial-friendly open-source license,
it has really great implementation of the algorithms you are likely going to need to use, and
it can work on distributed environments (read scalable).
Hope that helps, and good luck.
I understood almost all types of interactions specified by the scorm data model element cmi.interactions.n.type(true_false, multiple_choice, fill_in, long_fill_in, matching, performance, sequencing, likert, numeric, other) ,it remains to understand the type performance. I found an explanation of Ostyn but it remains ambiguous .
The Performance interaction is the most flexible and rich of the
standard interaction types in SCORM. It allows the capture of a number
of arbitrary steps performed by a learner, along with information
about every step. (Claud Ostyn)
AFAIK it does exactly that, i.e. stores arbitrary data related to an ambiguous non-standard interaction (e.g. a 3D simulation). LMS are not supposed to do anything with interaction data anyway, at least not regarding completion and grading, so it is mostly used by instructional designers who need a deeper insight into what the learners are doing and then adjust the training, e.g. exercise difficulty.
I've been studying Queuing Theory and I have been searching for well known techniques/algorithms applied to customer queues for systems that can provide multiple services associated to the same queue. In other words, algorithms where the queue discipline isn't a pure FIFO discipline.
For example, the system provides the service A, B and C and each service may have a priority of service time: A (50%), B(30%) and C(20%). I'd like to find articles or books that focus on these scenarios and how to do a fair management of the queue to serve customers for real world scenarios.
I'm interested mostly in M/M/s queues.
UPDATE: I've been searching a lot on this subject and I've been reading about Weighted Fair Queuing and Start-Time Fair Queuing. Does anyone know of implementations or procedures describing these algorithms? I'm not working with routers or any network related device. I'm doing a software for customer attendance. I don't need to deal with bursts of packets and such things.
Best regards,
Manuel Felício.
In general, you should be searching for queueing systems with admission policies. I would begin with a google scholar search for the same. Next, you could go deeper depending on what exactly you want to study. For instance, there is a large amount of literature on achieveable performance in queueing systems. See, for instance, Characterization and Optimization of Achievable Performance in General Queueing Systems. In such problems, an admission scheme is researched that will result in certain exogeneously specified soujourn/waiting times for different customer classes (or classes with priority as in your case). Although queueing theory has been studied for a long time, analytically tractable models are restricted to M/M/s models generally. Study of other models (especially M/G/s systems) usually requires simulation/approximations.
You may want to consider WF2Q: worst-case fair weighted fair queueing. However if you are planning to implement as a quick algo then you may want to consider WF2Q+.
EDIT
Additionally some of the book resource
The word "Intelligence" has two meanings:
1: the ability to learn, apply knowledge, or think abstractly.
2: a: information concerning an enemy or organization or group with the task of gathering such information. b: news or information of any kind.
In the term Business Intelligence, which one is the meaning of Intelligence? 1 or 2?
Are there any articles on web about this distinction?
This is not a programming question. However, Business Intelligence systems are systems that are able to gather information, process them by analyzing the patterns and at the end of the day, forecast.
Well, the way we use it (with Cognos BI) is in the second sense.
Business intelligence is the information about your organisation and Cognos BI is one way to gather and present that information in a way that's easily adaptable and useful.
Having moved from BIRT to Cognos last year, I have to admit that, despite the increased cost of the product, things are a lot easier to do. It always seemed to me that BIRT was very little "BI" and mostly "RT" :-)
Sorry to sound like an advertorial back there, I was just stating how we use the term.
Business intelligence (BI) refers to computer-based techniques used in spotting, digging-out, and analyzing business data, such as sales revenue by products and/or departments or associated costs and incomes
It is pretty clear to me that retrieve and analyze data is a way to learn, apply knowledge, or think abstractly.
Yes business is about competition. Competition is about fighting.
So IMHO, two are valid
What technology goes in behind the screens of Amazon recommendation technology? I believe that Amazon recommendation is currently the best in the market, but how do they provide us with such relevant recommendations?
Recently, we have been involved with similar recommendation kind of project, but would surely like to know about the in and outs of the Amazon recommendation technology from a technical standpoint.
Any inputs would be highly appreciated.
Update:
This patent explains how personalized recommendations are done but it is not very technical, and so it would be really nice if some insights could be provided.
From the comments of Dave, Affinity Analysis forms the basis for such kind of Recommendation Engines. Also here are some good reads on the Topic
Demystifying Market Basket Analysis
Market Basket Analysis
Affinity Analysis
Suggested Reading:
Data Mining: Concepts and Technique
It is both an art and a science. Typical fields of study revolve around market basket analysis (also called affinity analysis) which is a subset of the field of data mining. Typical components in such a system include identification of primary driver items and the identification of affinity items (accessory upsell, cross sell).
Keep in mind the data sources they have to mine...
Purchased shopping carts = real money from real people spent on real items = powerful data and a lot of it.
Items added to carts but abandoned.
Pricing experiments online (A/B testing, etc.) where they offer the same products at different prices and see the results
Packaging experiments (A/B testing, etc.) where they offer different products in different "bundles" or discount various pairings of items
Wishlists - what's on them specifically for you - and in aggregate it can be treated similarly to another stream of basket analysis data
Referral sites (identification of where you came in from can hint other items of interest)
Dwell times (how long before you click back and pick a different item)
Ratings by you or those in your social network/buying circles - if you rate things you like you get more of what you like and if you confirm with the "i already own it" button they create a very complete profile of you
Demographic information (your shipping address, etc.) - they know what is popular in your general area for your kids, yourself, your spouse, etc.
user segmentation = did you buy 3 books in separate months for a toddler? likely have a kid or more.. etc.
Direct marketing click through data - did you get an email from them and click through? They know which email it was and what you clicked through on and whether you bought it as a result.
Click paths in session - what did you view regardless of whether it went in your cart
Number of times viewed an item before final purchase
If you're dealing with a brick and mortar store they might have your physical purchase history to go off of as well (i.e. toys r us or something that is online and also a physical store)
etc. etc. etc.
Luckily people behave similarly in aggregate so the more they know about the buying population at large the better they know what will and won't sell and with every transaction and every rating/wishlist add/browse they know how to more personally tailor recommendations. Keep in mind this is likely only a small sample of the full set of influences of what ends up in recommendations, etc.
Now I have no inside knowledge of how Amazon does business (never worked there) and all I'm doing is talking about classical approaches to the problem of online commerce - I used to be the PM who worked on data mining and analytics for the Microsoft product called Commerce Server. We shipped in Commerce Server the tools that allowed people to build sites with similar capabilities.... but the bigger the sales volume the better the data the better the model - and Amazon is BIG. I can only imagine how fun it is to play with models with that much data in a commerce driven site. Now many of those algorithms (like the predictor that started out in commerce server) have moved on to live directly within Microsoft SQL.
The four big take-a-ways you should have are:
Amazon (or any retailer) is looking at aggregate data for tons of transactions and tons of people... this allows them to even recommend pretty well for anonymous users on their site.
Amazon (or any sophisticated retailer) is keeping track of behavior and purchases of anyone that is logged in and using that to further refine on top of the mass aggregate data.
Often there is a means of over riding the accumulated data and taking "editorial" control of suggestions for product managers of specific lines (like some person who owns the 'digital cameras' vertical or the 'romance novels' vertical or similar) where they truly are experts
There are often promotional deals (i.e. sony or panasonic or nikon or canon or sprint or verizon pays additional money to the retailer, or gives a better discount at larger quantities or other things in those lines) that will cause certain "suggestions" to rise to the top more often than others - there is always some reasonable business logic and business reason behind this targeted at making more on each transaction or reducing wholesale costs, etc.
In terms of actual implementation? Just about all large online systems boil down to some set of pipelines (or a filter pattern implementation or a workflow, etc. you call it what you will) that allow for a context to be evaluated by a series of modules that apply some form of business logic.
Typically a different pipeline would be associated with each separate task on the page - you might have one that does recommended "packages/upsells" (i.e. buy this with the item you're looking at) and one that does "alternatives" (i.e. buy this instead of the thing you're looking at) and another that pulls items most closely related from your wish list (by product category or similar).
The results of these pipelines are able to be placed on various parts of the page (above the scroll bar, below the scroll, on the left, on the right, different fonts, different size images, etc.) and tested to see which perform best. Since you're using nice easy to plug and play modules that define the business logic for these pipelines you end up with the moral equivalent of lego blocks that make it easy to pick and choose from the business logic you want applied when you build another pipeline which allows faster innovation, more experimentation, and in the end higher profits.
Did that help at all? Hope that give you a little bit of insight how this works in general for just about any ecommerce site - not just Amazon. Amazon (from talking to friends that have worked there) is very data driven and continually measures the effectiveness of it's user experience and the pricing, promotion, packaging, etc. - they are a very sophisticated retailer online and are likely at the leading edge of a lot of the algorithms they use to optimize profit - and those are likely proprietary secrets (you know like the formula to KFC's secret spices) and guaarded as such.
This isn't directly related to Amazon's recommendation system, but it might be helpful to study the methods used by people who competed in the Netflix Prize, a contest to develop a better recommendation system using Netflix user data. A lot of good information exists in their community about data mining techniques in general.
The team that won used a blend of the recommendations generated by a lot of different models/techniques. I know that some of the main methods used were principal component analysis, nearest neighbor methods, and neural networks. Here are some papers by the winning team:
R. Bell, Y. Koren, C. Volinsky, "The BellKor 2008 Solution to the Netflix Prize", (2008).
A. Töscher, M. Jahrer, “The BigChaos Solution to the Netflix Prize 2008", (2008).
A. Töscher, M. Jahrer, R. Legenstein, "Improved Neighborhood-Based Algorithms for Large-Scale Recommender Systems", SIGKDD Workshop on Large-Scale Recommender Systems and the Netflix Prize Competition (KDD’08) , ACM Press (2008).
Y. Koren, "The BellKor Solution to the Netflix Grand Prize", (2009).
A. Töscher, M. Jahrer, R. Bell, "The BigChaos Solution to the Netflix Grand Prize", (2009).
M. Piotte, M. Chabbert, "The Pragmatic Theory solution to the Netflix Grand Prize", (2009).
The 2008 papers are from the first year's Progress Prize. I recommend reading the earlier ones first because the later ones build upon the previous work.
I bumped on this paper today:
Amazon.com Recommendations: Item-to-Item Collaborative Filtering
Maybe it provides additional information.
(Disclamer: I used to work at Amazon, though I didn't work on the recommendations team.)
ewernli's answer should be the correct one -- the paper links to Amazon's original recommendation system, and from what I can tell (both from personal experience as an Amazon shopper and having worked on similar systems at other companies), very little has changed: at its core, Amazon's recommendation feature is still very heavily based on item-to-item collaborative filtering.
Just look at what form the recommendations take: on my front page, they're all either of the form "You viewed X...Customers who also viewed this also viewed...", or else a melange of items similar to things I've bought or viewed before. If I specifically go to my "Recommended for You" page, every item describes why it's recommended for me: "Recommended because you purchased...", "Recommended because you added X to your wishlist...", etc. This is a classic sign of item-to-item collaborative filtering.
So how does item-to-item collaborative filtering work? Basically, for each item, you build a "neighborhood" of related items (e.g., by looking at what items people have viewed together or what items people have bought together -- to determine similarity, you can use metrics like the Jaccard index; correlation is another possibility, though I suspect Amazon doesn't use ratings data very heavily). Then, whenever I view an item X or make a purchase Y, Amazon suggests me things in the same neighborhood as X or Y.
Some other approaches that Amazon could potentially use, but likely doesn't, are described here: http://blog.echen.me/2011/02/15/an-overview-of-item-to-item-collaborative-filtering-with-amazons-recommendation-system/
A lot of what Dave describes is almost certainly not done at Amazon. (Ratings by those in my social network? Nope, Amazon doesn't have any of my social data. This would be a massive privacy issue in any case, so it'd be tricky for Amazon to do even if they had that data: people don't want their friends to know what books or movies they're buying. Demographic information? Nope, nothing in the recommendations suggests they're looking at this. [Unlike Netflix, who does surface what other people in my area are watching.])
I don't have any knowledge of Amazon's algorithm specifically, but one component of such an algorithm would probably involve tracking groups of items frequently ordered together, and then using that data to recommend other items in the group when a customer purchases some subset of the group.
Another possibility would be to track the frequency of item B being ordered within N days after ordering item A, which could suggest a correlation.
As far I know, it's use Case-Based Reasoning as an engine for it.
You can see in this sources: here, here and here.
There are many sources in google searching for amazon and case-based reasoning.
If you want a hands-on tutorial (using open-source R) then you could do worse than going through this:
https://gist.github.com/yoshiki146/31d4a46c3d8e906c3cd24f425568d34e
It is a run-time optimised version of another piece of work:
http://www.salemmarafi.com/code/collaborative-filtering-r/
However, the variation of the code on the first link runs MUCH faster so I recommend using that (I found the only slow part of yoshiki146's code is the final routine which generates the recommendation at user level - it took about an hour with my data on my machine).
I adapted this code to work as a recommendation engine for the retailer I work for.
The algorithm used is - as others have said above - collaborative filtering. This method of CF calculates a cosine similarity matrix and then sorts by that similarity to find the 'nearest neighbour' for each element (music band in the example given, retail product in my application).
The resulting table can recommend a band/product based on another chosen band/product.
The next section of the code goes a step further with USER (or customer) based collaborative filtering.
The output of this is a large table with the top 100 bands/products recommended for a given user/customer
Someone did a presentation at our University on something similar last week, and referenced the Amazon recommendation system. I believe that it uses a form of K-Means Clustering to cluster people into their different buying habits. Hope this helps :)
Check this out too: Link and as HTML.