I'm currently trying to build a music generator. In order to improve my deal with patterns in music, I have read this article, which states that "This algorithm (exon-chaining algorithm) can be modified to accommodate the pattern selection problem by replacing the weight of an interval with its duration".(page 9).
However, I'm having trouble understanding the meaning of the exon-chaining problem. I have looked for this problem in many different presentations and articles but still couldn't find satisfying information. I would really appreciate it if someone could explain it to me.
Thanks in advance.
I am trying to find any open source big data application, but the only stuff I found was basic examples like word count and so on. Can anyone advise where I can find what I need?
You just need to search if someone is providing such things. For example, Wikipedia does. Weather data is also a famous candidate and one quick search gives: National Weather Service. Just search for data you want to harvest.
Could be tweets, weather information, cars sales, usenet archive, etc.
You can find a number of practical examples of using Map-Reduce in real life here. Check the chapter 2 of the newest edition of this book.
I've been thinking about this for a while now, so I thought I would ask for suggestions:
I have some crawler which enters the root of some site (could be anything from www.StackOverFlow.com, www.SomeDudesPersonalSite.se or even www.Facebook.com). Then I need to determin what "kind of homepage" I'm visiting.. Different types could for instance be:
Forum
Blog
Link catalog
Social media site
News site
"One man site"
I've been brainstorming for a while, and the best solution seems to be some heuristic with a point system. By this I mean different trends gives some points to the different types, and then the program makes a guess afterwards.
But this is where I get stuck.. How do you detect trends?
Catalogs could be easy: If sitesIndexed/Outgoing links is very high, catalogs should get several points.
News sites/Blogs could be easy: If a high amount of sites indexed has a datetime, those types should get several points..
BUT I can't really find too many trends.
SO: My question is:
Any ideas on how to do this?
Thanks so much..
I believe you are attempting document classification, which is a well-researched topic.
http://en.wikipedia.org/wiki/Document_classification
You will see a considerable list of many different methods. But to suggest any one of those (or neural networks or the like) prior to determining the "trends" as you call them is to suggest it prematurely. I would recommend looking into "web document classification" or the like. It is evidently a considerable subset of document classification, and if you have access to academic journals there are plenty of incomprehensible articles for your enjoyment.
I did also find your idea as a homework assignment -- perhaps if you are particularly audacious you could contact the professor.
http://uhaweb.hartford.edu/compsci/ccli/wdc.htm
Lastly, I believe that this is an accessible (if strangely formatted) website that has a general and perhaps outdated discussion:
http://www.webology.ir/2008/v5n1/a52.html
I'm afraid I don't have much personal knowledge of the topic, so the most I could do was tell you the keyword "document classification" and provide some quick googling. However, if I wanted to play around with this concept, I think simply looking for the rate of certain keywords is a decent starting "trend." ("Sale" or "purchase" or "customers" are trends for shopping sites, "my," "opinion," "comment," for blogs, and so on)
You could train a neural network to recognise them. Give it number/types of links, maybe types of HTML tags as well.
I think otherwise you're just going to be second-guessing what makes a site what it is.
Recently, I found several web site have something like : "Recommended for You", for example youtube, or facebook, the web site can study my using behavior, and recommend some content for me... ...I would like to know how they analysis this information? Is there any Algorithm to do so? Thank you.
Amazon and Netflix (among others) use a technique called Collaborative filtering to suggest things you might like based on the likes/dislikes of others who have made purchases and selections similar to yours.
Is there any Algorithm to do so?
Yes
Yes. One fairly common one is to look at things you've selected in the past, find other people who've made those selections, then find the other selections most common among those other people, and guess that you're likely to be interested in those as well.
Yup there are lots of algorithms. Things such as k-nearest neighbor: http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm.
Here is a pretty good book on the subject that covers making these sorts of systems along with others: http://www.amazon.com/gp/product/0596529325?ie=UTF8&tag=ianburriscom-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0596529325.
It's generally done by matching you with other users who have similar usage history / profile and then recommending other things that they've purhased/watched/whatever.
Searching for "recommendation algorithm" yields lots of papers. Most algorithms incorporate "machine learning" algorithms to determine groups of things (comedy movies, books on gardening, orchestral music, etc.). Your matching with those groups yields recommendations. Some companies use humans to classify things, too.
Such an algorithm is going to vary wildly from company to company. In many cases, it analyzes some combination of your search history, purchase history, physical location, and other factors. It probably will also compare purchases/searches amongst other people to find what those people have purchased/searched for, and recommend some of those products to you.
There are probably hundreds of these algorithms out there, but I doubt you can use any of them (that are actually good). Probably you are better off figuring it out yourself.
If you can categorize your contents (i.e. by tagging or content analysis), you can also categorize your users and their preferences.
For example: you have a video portal with 5 million videos .. 1 mio of them are tagged mostly red. If 80% of all videos watched by a user (who is defined by an IP, a persistent user account, ...) are tagged mostly red, you might want to recommend even more red videos to him. You might want to refine your recommendations by looking at his further actions: does he like your recommendations -- if so, why not give him even more, if not, try the second-best guess, maybe he's not looking for color, but for the background music ...
There's no absolute algorithm to do it, but all implementations will go into a similar direction. It's always basing on observing users, which scares me from time to time :-)
There's whole lot of algorithms tackling the issue: Wiki article. It's a Machine Learning domain problem. Computer's can be learned using two main techniques: classification and clustering. They require some datasets as input. If the dataset is informative (really holds some useful patterns) than those ML techniques can dig most of it.
Clustering could be best to use for this kind of problem. It's main usage is to find similarities among points in provided dataset. If the points are, e.g. your search history, they can be grouped together to form certain clusters. If Your search history closely relates to another, a hint can be given - picking links that are most similar to Your's.
The same comes with book recommendations - it's obvious what dataset they use: "Other people who bought this product also bought Product A, Product B,...". The key here is to match your profile to other's and use the most similar to recommend.
The computer retrieves information from the human brain with complex memory scan process, sorts it accordingly and outputs results based on what you have experienced in your life so far.
I would use a hash table and use ISBN number as key. As this will give me a look up time of O(1)....as avg time of look up in hash table is O(1)....
we can also use Binary search tree.....look up time is O(nlogn)...
What data structure would you guys use and why?
This sounds like a homework or interview question. If I were asking it, I would be interested in more than just whether you understand a couple of data structures. I would also want to know how you analyze a real-world problem and translate it to the world of computers and data structures.
As such, you should probably think about what operations you need to perform on the data before you pick a data structure. You should also think some about real libraries and some of the "gotchas" that could come up with any data structure you chose.
If all you need to do is translate from an ISBN to the catalog entry for the corresponding book, then a hash table might be a reasonable choice. But you might want to think about how you would deal with popular books, such as best sellers, that a library could have many copies of.
But is ISBN lookup really the important use case? I use my local library all the time, and I never look up books by ISBN. Some of things that I do are:
Look up a specific book by title. Sometimes there are different books with the same title.
Browse the list of books by an author I like
Find where books on a particular subject are shelved, so I can browse them.
Librarians probably have additional uses for a catalog system:
Add new books to the catalog
Mark books as checked out
Change listing information, such as subject classification, for a book
So I guess my recommendation would be to think more carefully about what problem you want to solve before you decide on the solution.
Apologies for asking more questions instead of providing an answer. I hope this is helpful anyway.
Well ... I don't think the hardest problem to solve with designing a data structure to store information about books is that of look-up speed.
And I would certainly not settle for a system that only allowed searching if you know the ISBN. What if you only remember the author, or a few words from the title? If there is to be any gains in having a computerized system for this, you must support flexible searches, in my opinion.
I would probably look into using Dublin Core, but I'm not at all sure that's the "right" thing to do. It seems people have spent a great deal of time thinking about that one, though.