I want to share some data between all the splits in the hadoop framework, more specifically, I have a file that contains alot of terms that I’m concerned to search for, and write how many times appear in each document, but the problem is that in case the word does not appear in some splits, i need to return 0 for this file, but the problem that I cant pass the terms I’m searching for to all the nodes in splits, can anybody give me some idea
Generally, the DistributedCache is the way to share data across nodes. However since it is deprecated, check this answer.
Related
I'm just trying to understand how Hadoop distinguish multiple files in HDFS. I want to do sentiment analysis using Hadoop (just a test). I have two files positive.json and negative.json. I'm trying to use Naive Bayes Classification. So, when I train the model, I want to know which ones are positive and which ones are negative. How do I do this? I haven't written any codes to show; I'm stuck in the first part. Any suggestions? I did read tons of papers, and I think I do have a basic concept. I want to see if I can use this concept in Rhipe. Or do you have any other better and easier solutions?
Ok, I am attempting to learn Hadoop and mapreduce. I really want to start with mapreduce and what I find are many, many simplified examples of mappers and reducers, etc. However, I seen to be missing something.
While an example showing how many occurrences of a word are in a document is simple to understand it does not really help me solve any "real world" problems. Does anybody know of a good tutorial of implementing mapreduce in a psuedo-realistic situation. Say, for instance, I want to use hadoop and mapreduce on top of a data store similar to Adventureworks. Now I want to get orders for a given product in the month of may. How would that look from a hadoop/mapreduce perspective? (I realize this may not be the type of problem mapreduce is intended to solve but, it just came to mind quickly.)
Any direction would help.
The book Hadoop: The Definitive Guide is a good place to start. The introductory chapters should be really useful to you to figure out where MapReduce is useful and when you should use it. The more advanced chapters have plenty of more realistic examples than word count.
If you want to dive deeper, you may want to check out Data-Intensive Text Processing with MapReduce. This definitely has plenty of "real-world" use cases, but it doesn't sound like you are interested in doing text processing.
For your particular example, the main things to realize are:
The map phase is mostly for parsing, transforming data, and filtering out data. Think record-by-record, shared-nothing approach to record processing. In word count, this is parsing the line and splitting out the words.
The reduce phase is all about aggregation: counting, averaging, min/max, etc. In word count, this is counting up the instances of the word.
So, if you would want all the records for a given product in the month of May, you could use a map-only job to filter through all the data and only keep the records you want. However, you really should read about what Hadoop is useful for. The question that would fit Hadoop better would be: give me a count of how many times every item was purchased in every month (to build a matrix, perhaps). Very rarely are you looking for specific records like you suggest.
If you are looking for a more real-time access platform, you should check out HBase once you are done learning about Hadoop.
Hadoop can be used for a wide variety of problems. Check this blog entry from atbrox. Also, there is a lot of information on the internet about Hadoop and MapReduce and it's easy to get lost. So, here is the consolidated list of resources on Hadoop.
BTW, Hadoop - The Definitive Guide 3rd edition is due in May. Looks like it also covers MRv2 (NextGen MapReduce) and also includes more case studies. The 2nd edition is worth as mentioned by orangeoctopus.
MapReduce can be a complex topic so I found it easier to understand it by applying its approach to a simple problem. Then I go on to describe how MapReduce makes it straightforward to solve the same problem in a cluster. You can take a look in my article here: Intro to Parallel Processing with MapReduce.
Let me know if you think this article makes it easier to understand MapReduce and Hadoop.
I want to allow people to put in simple text search terms, run a pig job (if that's best? it's what I know best) and output the results (the tsv file results?) so I can show them in a web interface.
Is there anything that approaches this problem?
Anything known to link a few disjointed pieces of the flow I am going for, together?
Thanks
Why don't you index the docs into Lucene or Solr? Then you can do text search in real-time. Hadoop is designed for batch oriented processes, which doesn't seem like what you want in this case.
Well, it depends on your project's requirements. Does it need low-latency, and how complex is the ad hoc search. Well I think hbase+pig might be a comprised solution. hbase can be used for search real-time search purpose (although its search function is not so powerful than RDBMS) and pig for batch_processing of large amount for data.
I want to incrementally cluster text documents reading them as data streams but there seems to be a problem. Most of the term weighting options are based on vector space model using TF-IDF as the weight of a feature. However, in our case IDF of an existing attribute changes with every new data point and hence previous clustering does not remain valid anymore and hence any popular algorithms like CluStream, CURE, BIRCH cannot be applied which assumes fixed dimensional static data.
Can anyone redirect me to any existing research related to this or give suggestions? Thanks !
Have you looked at
TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams
Here's an idea off the top of my head:
What's your input data like? I'm guessing it's at least similarly themed, so you could start with a base phrase dictionary and use that for idf. Apache Lucene is a great indexing engine. Since you have a base dictionary, you can run kmeans or whatever you'd like. As documents come in, you'll have to rebuild the dictionary at some frequency (which can be off-loaded to another thread/machine/etc) and then re-cluster.
With the data indexed in a high-performance, flexible engine like Lucene, you could run queries even as new documents are being indexed. I bet if you do some research on different clustering algorithms you'll find some good ideas.
Some interesting paper/links:
http://en.wikipedia.org/wiki/Document_classification
http://www.scholarpedia.org/article/Text_categorization
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Without more information, I can't see why you couldn't re-cluster every once in a while. You might wanna take a look at some of the recommender systems already out there.
I want to know the effective algorithms/data structures to identify the below information in streaming data.
Consider a real-time streaming data like twitter. I am mainly interested in the below queries rather than storing the actual data.
I need my queries to run on actual data but not any of the duplicates.
As I am not interested in storing the complete data, it will be difficult for me to identify the duplicate posts. However, I can hash all the posts and check against them. But I would like to identify near duplicate posts also. How can I achieve this.
Identify the top k topics being discussed by the users.
I want to identify the top topics being discussed by users. I don't want the top frequency words as shown by twitter. Instead I want to give some high level topic name of the most frequent words.
I would like my system to be real-time. I mean, my system should be able to handle any amount of traffic.
I can think of map reduce approach but I am not sure how to handle synchronization issues. For example, duplicate posts can reach different nodes and both of them could store them in the index.
In a typical news source, one will be removing any stop words in the data. In my system I would like to update my stop words list by identifying top frequent words across a wide range of topics.
What will be effective algorithm/data structure to achieve this.
I would like to store the topics over a period of time to retrieve interesting patterns in the data. Say, friday evening everyone wants to go to a movie. what will be the efficient way to store this data.
I am thinking of storing it in hadoop distributed file system, but over a period of time, these indexes become so large that I/O will be my major bottleneck.
Consider multi-lingual data from tweets around the world. How can I identify similar topics being discussed across a geographical area?
There are 2 problems here. One is identifying the language being used. It can be identified based on the person tweeting. But this information might affect the privacy of the users. Other idea, could be running it through a training algorithm. What is the best method currently followed for this. Other problem is actually looking up the word in a dictionary and associating it to common intermediate language like say english. How to take care of word sense disambiguation like a same word being used in different contests.
Identify the word boundaries
One possibility is to use some kind of training algorithm. But what is the best approach followed. This is some way similar to word sense disambiguation, because you will be able to identify word boundaries based on the actual sentence.
I am thinking of developing a prototype and evaluating the system rather than the concrete implementation. I think its not possible to scrap the real-time twitter data. I am thinking this approach can be tested on some data freely available online. Any ideas, where I can get this data.
Your feedback is appreciated.
Thanks for your time.
-- Bala
There are a couple different questions buried in here. I can't understand all that you're asking, but here's a the big one as I understand it: You want to categorize messages by topic. You also want to remove duplicates.
Removing duplicates is (relatively) easy. To remove "near" duplicates, you could first remove uninteresting parts from your data. You could start by removing capitalization and punctuation. You could also remove the most common words. Then you could add the resulting message to a Bloom filter. Hashing isn't good enough for Twitter, as the hashed messages wouldn't be much smaller than the full messages. You'd end up with a hash that doesn't fit in memory. That's why you'd use a Bloom filter instead. It might have to be a very large Bloom filter, but it will still be smaller than the hash table.
The other part is a difficult categorization problem. You probably do not want to write this part yourself. There are a number of libraries and programs available for categorization, but it might be hard to find one that fits your needs. An example is the Vowpal Wabbit project, which is a fast online algorithm for categorization. However, it only works on one category at a time. For multiple categories, you would have to run multiple copies and train them separately.
Identifying the language sounds less difficult. Don't try to do something smart like "training", instead put the most common words from each language in a dictionary. For each message, use the language whose words appeared most frequently.
If you want the algorithm to come up with categories on its own, good luck.
I'm not really sure if I'm answering your main question, but you could determine the similarity of two messages by calculating the Levenshtein distance between them. You can think of this as the "edit difference" between two strings (I.E., how many edits would need to be made to one, to convert it to the other).
Hello we have created a very similar demo using api.cortical.io functionality.
There you can create semantic fingerprints of each tweet. (you could also extract the top most keywords or some similar terms, that don't need to actually be part of the tweet).
We have used the fingerprints to filter the twitter stream based on content.
On twistiller.com you can see the result. The public 1% twitter stream is monitored for four different topic areas.