PIG (Hadoop) - rows with variable columns - hadoop

Playing with Pig, my input file is:
1, 4, 6
1, 2, 7, 9
2, 5, 1
1, 3, 5, 1
2, 6, 2, 8
The first value in each row is the ID; the remainder of the row are simply unique values (each row can have a different number of columns).
I want to transpose the above into:
1, 2, 4, 6, 7, 9, 3, 5, 1
2, 5, 1, 6, 2, 8
So basically GROUP by ID, then flatten the rest of the columns and output that as each row.
Is PIG even the right approach here? I have a way to do this in M/R, but thought Pig might be ideal for this sort of thing.
Many thanks for any hints provided
Duncan
PS I do not care about the order of the values.

Untested, but here's the general approach I'd take: Get a variable containing the ID and a bag of values, flatten it so you got rows of just ids and a single value, take the distinct rows, then group by the ID. This will give you a bag of values for each ID which you can convert to a string if you wanted to output.
A = LOAD 'input' USING TextLoader() as line:chararray;
B = FOREACH A GENERATE STRSPLIT(line,',',2) as (id:chararray,values:chararray)
C = FOREACH B GENERATE id, FLATTEN(TOBAG(STRSPLIT(values,','))) as value:chararray;
D = DISTINCT C; -- I'm assuming you actually want distinct values, wasn't clear.
E = GROUP D by id;
F = FOREACH E GENERATE group as id, BagToString(D.value) as valueString:chararray;

Related

Identifying the number of unique entries in a column of csv and removing if <=1

very new to Bash so please no hate. I would like to learn how to count the number of unique values in a column of data entries within a csv file and remove it entirely if the the number is <= 1.
Something similar to this which identifies the empty columns
Code snippet
Okay so here is an example of what I'm talking about.
If the csv file was something like:
A, B, C, D .... (many columns)
1, 7, x, l
1, 3, , d
1, 2, , g
1, 6, , b
1, 8, x, j
1, 9, , y
(many rows)
The algorithm following something similar in the snap accompanying it, would remove columns A and C since they have one value or less.

Reorder columns and rows of Holoviews Heatmap based on similarity measure (e.g. cosine similarity etc.)

I was surprised that no one seems to have asked this before.
Assuming I have a pandas dataframe (random example), I can get a heatmap with Holoviews and Bokeh renderer:
rownames = 'ABCDEFGHIJKLMNO'
df = pd.DataFrame(np.random.randint(0,20,size=(20, len(rownames))), columns=list(rownames))
hv.HeatMap({'x': df.columns, 'y': df.index, 'z': df},
kdims=[('x', 'Col Categories'), ('y', 'Row Categories')],
vdims='z').opts(cmap="viridis", width=520, height=520)
The data (x and y) is categorical, therefore the initial order of rows or columns is unimportant. I wanted to sort rows/columns based on some similarity measure.
One way is to use seaborn clustermap:
heatmap_sns = sns.clustermap(df, metric="cosine", standard_scale=1, method="ward", cmap="viridis")
The output looks like this:
Columns and rows have been ordered according to similarity (in this case, cosine based on dot product; others are available such as 'correlation' etc.).
However, I want to display the clustermap in Holoviews. How do I update ordering of the original dataframe from the seaborn matrix?
A much cleaner approach to Alex's answer (i.e. that was the accepted answer earlier) is to use the data2d property of the returned object from sns.clustermap() function. This property contains the reordered data (i.e. the data after clustering). So:
df_ro = heatmap_sns.data2d
replaces all the following lines:
# get col and row names by ID
colname_list = [df.columns[col_id] for col_id in
heatmap_sns.dendrogram_col.reordered_ind]
rowname_list = [df.index[row_id] for row_id in
heatmap_sns.dendrogram_row.reordered_ind]
# update dataframe
df_ro = df.reindex(rowname_list)
df_ro = df_ro[colname_list]
It is possible to access the indices of reordered columns/rows from the seaborn clustermap using:
> print(f'rows: {heatmap_sns.dendrogram_row.reordered_ind}')
> print(f'columns: {heatmap_sns.dendrogram_col.reordered_ind}')
rows: [5, 0, 13, 2, 18, 7, 4, 16, 12, 19, 14, 15, 10, 3, 8, 6, 17, 11, 1, 9]
columns: [7, 1, 10, 5, 9, 0, 8, 13, 2, 6, 14, 3, 4, 11, 12]
To update row/column order of the original dataframe:
# get col and row names by ID
colname_list = [df.columns[col_id] for col_id in heatmap_sns.dendrogram_col.reordered_ind]
rowname_list = [df.index[row_id] for row_id in heatmap_sns.dendrogram_row.reordered_ind]
# update dataframe
df_ro = df.reindex(rowname_list)
df_ro = df_ro[colname_list]
I've done it here by first getting the names, perhaps there's even a direct way to update columns/rows by indices.
hv.HeatMap({'x': df_ro.columns, 'y': df_ro.index, 'z': df_ro},
kdims=[('x', 'Col Categories'), ('y', 'Row Categories')],
vdims='z').opts(cmap="viridis", width=520, height=520)
Since I have used random data, there's little order in the categories, but still the picture looks a little less noisy. Note that holoviews/df y axis is simply inverse compared to the seaborn clustermap-matrix, that's why the graphic looks flipped.

Computing number of sequences

I saw the following problem that I was unable to solve. What kind of algorithm will solve it?
We have been given a positive integer n. Let A be the set of all possible strings of length n where characters are from the set {1,2,3,4,5,6}, i.e. the results of dice thrown n times. How many elements of A contains at least one of the following strings as a substring:
1, 2, 3, 4, 5, 6
1, 1, 2, 2, 3, 3
4, 4, 5, 5, 6, 6
1, 1, 1, 2, 2, 2
3, 3, 3, 4, 4, 4
5, 5, 5, 6, 6, 6
1, 1, 1, 1, 1, 1
2, 2, 2, 2, 2, 2
3, 3, 3, 3, 3, 3
4, 4, 4, 4, 4, 4
5, 5, 5, 5, 5, 5
6, 6, 6, 6, 6, 6
I was wondering some kind of recursive approach but I got only mess when I tried to solve the problem.
I suggest reading up on the Aho-Corasick algorithm. This constructs a finite state machine based on a set of strings. (If your list of strings is fixed, you could even do this by hand.)
Once you have a finite state machine (with around 70 states), you should add an extra absorbing state to mark when any of the strings has been detected.
Now you problem is reduced to finding how many of the 6**n strings end up in the absorbing state after being pushed through the state machine.
You can do this by expressing the state machine as a matrix . Entry M[i,j] tells the number of ways of getting to state i from state j when one letter is added.
Finally you compute the matrix raised to the power n applied to an input vector that is all zeros except for a 1 in the position corresponding to the initial state. The number in the absorbing state position will tell you the total number of strings.
(You can use the standard matrix exponentiation algorithm to generate this answer in O(logn) time.)
What's wrong with your recursive approach, can you elaborate on that, anyway this can be solved using a recursive approach in O(6^n), but can be optimized using dp, using the fact that you only need to track the last 6 elements, so it can be done in O ( 6 * 2^6 * n) with dp.
rec (String cur, int step) {
if(step == n) return 0;
int ans = 0;
for(char c in { '1', '2', '3', '4', '5', '6' } {
if(cur.length < 6) cur += c
else {
shift(cur,1) // shift the string to the left by 1 step
cur[5] = c // add the new element to the end of the string
}
if(cur in list) ans += 1 + rec(cur, step+1) // list described in the question
else ans += rec(cur, step+1)
}
return ans;
}

equal value = equal rank

I would like to rank the elements of a list such that elements that have the same value also get the same rank:
list = {1, 2, 3, 4, 4, 5}
desired output:
ranks = {5, 4, 3, 2, 2, 1}
Ordering[] does almost what I want but assigns different ranks to the two instances of 4 in the list.
I am not sure that I cover everything you have in mind, but the following code will give the desired output. It presupposes that the smallest value is the highest rank, and should work with numerical values or as long as you are ok with the standard sorting order of Mathematica. The local variable dv is a shortname for "distinct values".
FromListToRanks[k_List]:= Module[ {dv=Reverse[Union[k]]},
k /. Thread[dv -> Range[Length[dv]]] ]
FromListToRanks[list]
{5,4,3,2,2,1}

Get distinct values based on selected field with LINQ

I have the following table
Members
Id, GroupId, Age
1, 1, 12
2, 1, 20
3, 1, 33
4, 2, 12
5, 2, 7
How can I write a LINQ query that will give me a list of the oldest member of each group?
The result should be
Id, GroupId, Age
3, 1, 33
4, 2, 12
from m in members
group m by m.GroupId into g
select g.OrderByDescending(m => m.Age).First()

Resources