Sorted results from hbase scanner - hadoop

How to retrieve hbase column family "values" in any sorted order of the same?
like column family value
---------------------------------
column:1 1
column:3 2
column:4 3
column:2 4

HBase itself won't do that, instead you could retrieve the list of KeyValues using the Result.raw[1] method, put that in a List and sort it by passing your own comparator to Collections.sort[2].
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html#raw()
http://download.oracle.com/javase/6/docs/api/java/util/Collections.html#sort(java.util.List, java.util.Comparator)

Related

Cloudant couchdb query custom sort

I want to sort the results of couchdb query a.k.a mango queries based on custom sort. I need custom sort because if Status can be one of the following:
Active = 1
Sold = 2
Contingent = 4
Pending = 3
I want to sort the results on Status but not in an alphabetical order, rather my own weightage I assign to each value which can be seen in the above list. Here's the selector for Status query I'm using:
{type:"Property", Status:{"$or":[{$eq: "Pending"}, {$eq:"Active"}, {$eq: "Sold"}]}}
If I use the sort array in my json with Status I think it'll sort alphabetically which I don't want.
You are actually looking for results based on "Status". You can create a view similar to this:
function(doc) { if (doc.type == "Property") { emit(doc.Status, doc);}}
When you use it, invoke it 4 times in the order you need and you'll get the result you need. This would eliminate the need to sort.

Map IDs to matrix rows in Hadoop/MapReduce

I have data about users buying products. I want to create a binary matrix of size |users| x |products| such that the element (i,j) in the matrix is 1 iff user_i has bought product_j, else the value is 0.
Now, my data looks something like
userA, productX
userB, productY
userA, productZ
...
UserIds and productIds are all strings. My problem is, how to map these IDs to row indices (for users) and column indices (for products) in the matrix.
There are over a million unique userIds and roughly 3 million productIds.
To make the problem well defined: given the user1, product1 like input above, how do I convert it to something like
1,1
2,2
1,3
where userA is mapped to row 0 of the matrix, userB is mapped to row 1, productX is mapped to column 0 and so on.
Given the size of data, I would have to use Hadoop Map-Reduce but can't think of a foolproof way of efficiently doing this.
This can be solved if we can do the following:
Dump unique userIds.
Dump unique productIds.
Map each unique userId in (1) to a row index.
Map each unique productId in (2) to a column index.
I can do (1) and (2) easily but having trouble coming up with an efficient approach to solve (3) (4 will be solved if we solve 3).
I have a couple of solutions but they are not foolproof.
Solution 1 (naive) for step 3 above
Map all userIds and emit the same key (say "1") for all map tasks.
Have a long counter initialized to 0 in setup() of the reducer.
In the reduce(), emit the counter value along with the input userId and increment the counter by 1.
This would be very inefficient since all 100 million userIds would be processed by a single reducer.
Solution 2 for step 3 above
While mapping userIds, emit each userId against a key which is an integer uniformly sampled from 1,2,3....N (where N is configurable. N = 100 for example). In a way, we are partitioning the input set.
Within the mapper, use Hadoop counters to count the number of userIds assigned to each of those random partitions.
In the reducer setup, first access the counters in the mapping stage to determine how many IDs were assigned to each partition. Use these counters to determine the start and end values for that partition.
Iterate (while counting) over each userId in reduce and generate matrix rowId as start_of_partition + counter.
context.write(userId, matrix row Id)
This method should work but I am not sure how to handle cases when reducer tasks failed/killed.
I believe there should be ways of doing this which I am not aware of. Can we use hashing/modulo to achieve this? How would we handle collisions at scale?

ElasticSearch sort by id's in array

Is there a way to sort some elasticsearch response in the same direction, which I am posting an array with ids?
Example: array[23,45,67] and the results should be sort in the same way like the id's are: first come all rows with ID 23, after that all rows with ID 45 and at the end all rows with ID 67 ?
Thanks
Nik
You can use scripting in sort or the other option is to use bool - should query where you boost documents with these values.

How do you write a query that returns an ordered list starting from a specific item?

If you have table with {1,3,4,5,9,15,43} how do you write a query that returns the three items following "5" in the natural sort order?
db.table
.Where(x=>x.field>5)
.OrderBy(x=>x.field)
.Take(3);

Sorting a pandas DataFrame by the order of a list

So I have a pandas DataFrame, df, with columns that represent taxonomical classification (i.e. Kingdom, Phylum, Class etc...) I also have a list of taxonomic labels that correspond to the order I would like the DataFrame to be ordered by.
The list looks something like this:
class_list=['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes', 'Clostridia', 'Bacilli', 'Actinobacteria', 'Betaproteobacteria', 'delta/epsilon subdivisions', 'Synergistia', 'Mollicutes', 'Nitrospira', 'Spirochaetia', 'Thermotogae', 'Aquificae', 'Fimbriimonas', 'Gemmatimonadetes', 'Dehalococcoidia', 'Oscillatoriophycideae', 'Chlamydiae', 'Nostocales', 'Thermodesulfobacteria', 'Erysipelotrichia', 'Chlorobi', 'Deinococci']
This list would correspond to the Dataframe column df['Class']. I would like to sort all the rows for the whole dataframe based on the order of the list as df['Class'] is in a different order currently. What would be the best way to do this?
You could make the Class column your index column
df = df.set_index('Class')
and then use df.loc to reindex the DataFrame with class_list:
df.loc[class_list]
Minimal example:
>>> df = pd.DataFrame({'Class': ['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes'], 'Number': [3, 5, 6]})
>>> df
Class Number
0 Gammaproteobacteria 3
1 Bacteroidetes 5
2 Negativicutes 6
>>> df = df.set_index('Class')
>>> df.loc[['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']]
Number
Bacteroidetes 5
Negativicutes 6
Gammaproteobacteria 3
Alex's solution doesn't work if your original dataframe does not contain all of the elements in the ordered list i.e.: if your input data at some point in time does not contain "Negativicutes", this script will fail. One way to get past this is to append your df's in a list and concatenate them at the end. For example:
ordered_classes = ['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']
df_list = []
for i in ordered_classes:
df_list.append(df[df['Class']==i])
ordered_df = pd.concat(df_list)

Resources