Can Apache beam support Parallel Sorting? - sorting

Could you please tell me if I can achieve parallel sorting using Apache beam? For the documentation it is given that Apache Beam can sort using a single machine. Is there any way to achieve parallel sorting?

Ah, so you're doing merely a per-key sort, not a global sort. Please use the SortValues transform. Each individual key will be sorted using a single machine, but I presume that the amount of data you have per key is not that big. Please let me know if that's not the case and if after trying this transform you find that it performs unacceptably.

Related

Sorting an entire dataset in apache beam

Let's say that I have a massive collection of strings and I wish to use apache beam to sort it. Is this possible? I only managed to find documentation about running sort on a single machine, but what I'm looking for is a distributed sort algorithm.

ELKI cluster extraction HiSC HiCO

I'm comuputing HiCO and HiSC clustering algorithms on my dataset. If I'm not mistaken, the algorithms use different approach to define relevant subspaces for clusters in the 1st step and in the 2nd they apply OPTICS for clustering. I'm getting only cluster order file after I run the algorithms.
Is there any way to extract clusters from it? for example like OPTICSXi? (I know there are 3 extraction methods under hierarchical clustering but I can't see anything for HiCO or HiSC)
Thank you in advance for any hints
Use OPTICSXi as algorithm, then use HiCO or HiSC "inside".
The Xi extraction can be parameterized to use a different OPTICS variant like HiCO, HiSC, and DeLi-Clu. It just defaults to using regular OPTICS.
-algorithm clustering.optics.OPTICSXi
-opticsxi.algorithm de.lmu.ifi.dbs.elki.algorithm.clustering.correlation.HiCO
respectively
-algorithm clustering.optics.OPTICSXi
-opticsxi.algorithm de.lmu.ifi.dbs.elki.algorithm.clustering.subspace.HiSC
We currently don't have implementations of the other extraction methods in ELKI yet, sorry.

Grafana + elastic, find ratio of queries

I'm trying to find a ratio of users that have some attribute versus those that do not.
I assume, if it is possible at all, I'd have to use the inline scripting option. I am unable to figure out (or find good documentation), what are the limitations of inline scripts. Can it be used for mapreduce?
Has someone figured out, how to find ratio between counts in grafana, using elastic as the datasource?

How to use neo4j as input to hadoop?

I have a large neo4j database. I need to check for multiple patterns existing across the graph, which I was thinking would be easily done in hadoop. However, I'm not sure of the best way to feed tuples from neo4j into hadoop. Any suggestions?
In my opinion, while it can be done, I don't think MapReduce (which I believe is what you mean when you say "Hadoop") is a good (or at least performant) choice for graph analytics. You want a Bulk Synchronous Parallel approach instead. If you want to perform cloud-scale graph analytics, you want Apache Giraph, which "understands" the Hadoop ecosystem.
Then again, I would ask why you need to use anything outside of Neo4J at all. I don't know your use case obviously, but first make sure you can't do what you need to do within Neo4J.

Computing percentiles

I'm writing a program that's going to generate a bunch of data. I'd like to find various percentiles over that data.
The obvious way to do this is to store the data in some kind of sorted container. Are there any Haskell libraries which offer a container which is automatically sorted and offers fast random access to arbitrary indexes?
The alternative is to use an unordered container and perform sorting at the end. I don't know if that's going to be any faster. Either way, we're still left with needing a container which offers fast random access. (An array, perhaps...)
Suggestions?
(Another alternative is to build a histogram, rather than keep the entire data set in memory. But since the objective is to compute percentiles extremely accurately, I'm reluctant to go down that route. I also don't know the range of my data until I generate it...)
Are there any Haskell libraries which offer a container which is automatically sorted and offers fast random access to arbitrary indexes?
Yes, it's your good old Data.Map. See elemAt and other functions under the «Indexed» category.
Data.Set doesn't offer these, but you can emulate it with Data.Map YourType ().

Resources