Could you please tell me if I can achieve parallel sorting using Apache beam? For the documentation it is given that Apache Beam can sort using a single machine. Is there any way to achieve parallel sorting?
Ah, so you're doing merely a per-key sort, not a global sort. Please use the SortValues transform. Each individual key will be sorted using a single machine, but I presume that the amount of data you have per key is not that big. Please let me know if that's not the case and if after trying this transform you find that it performs unacceptably.
I have a large neo4j database. I need to check for multiple patterns existing across the graph, which I was thinking would be easily done in hadoop. However, I'm not sure of the best way to feed tuples from neo4j into hadoop. Any suggestions?
In my opinion, while it can be done, I don't think MapReduce (which I believe is what you mean when you say "Hadoop") is a good (or at least performant) choice for graph analytics. You want a Bulk Synchronous Parallel approach instead. If you want to perform cloud-scale graph analytics, you want Apache Giraph, which "understands" the Hadoop ecosystem.
Then again, I would ask why you need to use anything outside of Neo4J at all. I don't know your use case obviously, but first make sure you can't do what you need to do within Neo4J.
I want to allow people to put in simple text search terms, run a pig job (if that's best? it's what I know best) and output the results (the tsv file results?) so I can show them in a web interface.
Is there anything that approaches this problem?
Anything known to link a few disjointed pieces of the flow I am going for, together?
Thanks
Why don't you index the docs into Lucene or Solr? Then you can do text search in real-time. Hadoop is designed for batch oriented processes, which doesn't seem like what you want in this case.
Well, it depends on your project's requirements. Does it need low-latency, and how complex is the ad hoc search. Well I think hbase+pig might be a comprised solution. hbase can be used for search real-time search purpose (although its search function is not so powerful than RDBMS) and pig for batch_processing of large amount for data.
I want to incrementally cluster text documents reading them as data streams but there seems to be a problem. Most of the term weighting options are based on vector space model using TF-IDF as the weight of a feature. However, in our case IDF of an existing attribute changes with every new data point and hence previous clustering does not remain valid anymore and hence any popular algorithms like CluStream, CURE, BIRCH cannot be applied which assumes fixed dimensional static data.
Can anyone redirect me to any existing research related to this or give suggestions? Thanks !
Have you looked at
TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams
Here's an idea off the top of my head:
What's your input data like? I'm guessing it's at least similarly themed, so you could start with a base phrase dictionary and use that for idf. Apache Lucene is a great indexing engine. Since you have a base dictionary, you can run kmeans or whatever you'd like. As documents come in, you'll have to rebuild the dictionary at some frequency (which can be off-loaded to another thread/machine/etc) and then re-cluster.
With the data indexed in a high-performance, flexible engine like Lucene, you could run queries even as new documents are being indexed. I bet if you do some research on different clustering algorithms you'll find some good ideas.
Some interesting paper/links:
http://en.wikipedia.org/wiki/Document_classification
http://www.scholarpedia.org/article/Text_categorization
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
Without more information, I can't see why you couldn't re-cluster every once in a while. You might wanna take a look at some of the recommender systems already out there.
I recently spoke to someone, who works for Amazon and he asked me: How would I go about sorting terabytes of data using a programming language?
I'm a C++ guy and of course, we spoke about merge sort and one of the possible techniques is to split the data into smaller size and sort each of them and merge them finally.
But in reality, do companies like Amazon or eBay sort terabytes of data? I know, they store tons of information, but do they sorting them?
In a nutshell my question is: Why wouldn't they keep them sorted in the first place, instead of sorting terabytes of data?
But in reality, does companies like
Amazon/Ebay, sort terabytes of data? I
know, they store tons of info but
sorting them???
Yes. Last time I checked Google processed over 20 petabytes of data daily.
Why wouldn't they keep them sorted at
the first place instead of sorting
terabytes of data, is my question in a
nutshell.
EDIT: relet makes a very good point; you only need to keep indexes and have those sorted. You can easily and efficiently retrieve sort data that way. You don't have to sort the entire dataset.
Consider log data from servers, Amazon must have a huge amount of data. The log data is generally stored as it is received, that is, sorted according to time. Thus if you want it sorted by product, you would need to sort the whole data set.
Another issue is that many times the data needs to be sorted according to the processing requirement, which might not be known beforehand.
For example: Though not a terabyte, I recently sorted around 24 GB Twitter follower network data using merge sort. The implementation that I used was by Prof Dan Lemire.
http://www.daniel-lemire.com/blog/archives/2010/04/06/external-memory-sorting-in-java-the-first-release/
The data was sorted according to userids and each line contained userid followed by userid of person who is following him. However in my case I wanted data about who follows whom. Thus I had to sort it again by second userid in each line.
However for sorting 1 TB I would use map-reduce using Hadoop.
Sort is the default step after the map function. Thus I would choose the map function to be identity and NONE as reduce function and setup streaming jobs.
Hadoop uses HDFS which stores data in huge blocks of 64 MB (this value can be changed). By default it runs single map per block. After the map function is run the output from map is sorted, I guess by an algorithm similar to merge sort.
Here is the link to the identity mapper:
http://hadoop.apache.org/common/docs/r0.16.4/api/org/apache/hadoop/mapred/lib/IdentityMapper.html
If you want to sort by some element in that data then I would make that element a key in XXX and the line as value as output of the map.
Yes, certain companies certainly sort at least that much data every day.
Google has a framework called MapReduce that splits work - like a merge sort - onto different boxes, and handles hardware and network failures smoothly.
Hadoop is a similar Apache project you can play with yourself, to enable splitting a sort algorithm over a cluster of computers.
Every database index is a sorted representation of some part of your data. If you index it, you sort the keys - even if you do not necessarily reorder the entire dataset.
Yes. Some companies do. Or maybe even individuals. You can take high frequency traders as an example. Some of them are well known, say Goldman Sachs. They run very sophisticated algorithms against the market, taking into account tick data for the last couple of years, which is every change in the price offering, real deal prices (trades AKA as prints), etc. For highly volatile instruments, such as stocks, futures and options, there are gigabytes of data every day and they have to do scientific research on data for thousands of instruments for the last couple years. Not to mention news that they correlate with market, weather conditions and even moon phase. So, yes, there are guys who sort terabytes of data. Maybe not every day, but still, they do.
Scientific datasets can easily run into terabytes. You may sort them and store them in one way (say by date) when you gather the data. However, at some point someone will want the data sorted by another method, e.g. by latitude if you're using data about the Earth.
Big companies do sort tera and petabytes of data regularly. I've worked for more than one company. Like Dean J said, companies rely on frameworks built to handle such tasks efficiently and consistently. So,the users of the data do not need to implement their own sorting. But the people who built the framework had to figure out how to do certain things (not just sorting, but key extraction, enriching, etc.) at massive scale. Despite all that, there might be situations when you will need to implement your own sorting. For example, I recently worked on data project that involved processing log files with events coming from mobile apps.
For security/privacy policies certain fields in the log files needed to be encrypted before the data could be moved over for further processing. That meant that for each row, a custom encryption algorithm was applied. However, since the ratio of Encrypted to events was high (the same field value appears 100s of times in the file), it was more efficient to sort the file first, encrypt the value, cache the result for each repeated value.