Convert a space separated file (eachy row = vector) to SequenceFile - hadoop

I created the large text file (4 GB) as followings.
0 1 2 3 2 1
3 6 2 0 6 4
3 0 6 3 0 0
1 6 7 3 9 4
Each row describes a vector, and each column denotes each element of the vector. Each element is separated by one space.
Now, I would like to execute K-Means clustering for all the vectors with Apache Mahout, but I received the error "not a SequenceFile".
How can I create the file whose format meets the requirement of mahout?

Related

Sort string in Java

I need way to sort according to the name .
‏ According to the number of letters of the alphabet, the word starts from A to Z,
‏ it's mean you want to count how many a in the two word and the word who have the largest number of letter a, you want to put this word first (swap)
‏ And if their number of a is equal, you will compare the letter after it means b, and if the number of the word is equal, you will compare C, and this is what ... and he will tell you that this is the case Suppose that there are no students who are inspired by the same number of all letters in the same class
My Code contains a class contain a name type of string and main drive contain a array of objects
As I'm a C++ and Python Developer. I can't help you with the Java Code, but according to your query, I think Count Sort is the most suitable for this kind of problem because while sorting numbers it sorts all of them using their Digits.
Example
Input data: 1, 4, 1, 2, 7, 5, 2
Take a count array to store the count of each unique object.
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 2 2 0 1 1 0 1 0 0
Modify the count array such that each element at each index
stores the sum of previous counts.
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 2 4 4 5 6 6 7 7 7
The modified count array indicates the position of each object in
the output sequence.
Rotate the array clockwise for one time.
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 0 2 4 4 5 6 6 7 7
Output each object from the input sequence followed by
increasing its count by 1.
Process the input data: 1, 4, 1, 2, 7, 5, 2. Position of 1 is 0.
Put data 1 at index 0 in output. Increase count by 1 to place next data 1 at an index 1 greater than this index.
Above Example is Taken from https://www.geeksforgeeks.org/counting-sort/

Increase the numbers in apl

I have the following data:
a b c d
5 9 6 0
3 1 3 2
Characters in the first row, numbers in the second row.
How do I get the character corresponding to the highest number in the second row, and how do I increase the corresponding number in the second row? (For example, here, column b has the highest number, 9, so increase that number by 10%.)
I use Dyalog version 17.1.
With:
⎕←data←3 4⍴'a' 'b' 'c' 'd' 5 9 6 0 3 1 3 2
a b c d
5 9 6 0
3 1 3 2
You can extract the second row with:
2⌷data
5 9 6 0
Now grade it descending, that is, find the indices that would sort it from highest to lowest:
⍒2⌷data
2 3 1 4
The first number is the column we're looking for:
⊃⍒2⌷data
2
Now we can use this to extract the character from the first row:
data[⊂1,⊃⍒2⌷data]
b
But we only need the column index, not the actual character. The full index of the number we want to increase is:
2,⊃⍒2⌷data
2 2
Extracting the data to see that we got the right index:
data[⊂2,⊃⍒2⌷data]
9
Now we can either create a new array with the target value increased by 10%:
1.1×#(⊂2,⊃⍒2⌷data)⊢data
a b c d
5 9.9 6 0
3 1 3 2
Or change it in-place:
data[⊂2,⊃⍒2⌷data]×←1.1
data
a b c d
5 9.9 6 0
3 1 3 2
Try it online!

APL: how to search for a value's index in a matrix

In APL, matrices and vectors are used to hold data. I was wondering if there was a way to search within a matrix for a given value, and have that values index returned. For example, say I have the following 2-dimensional matrices:
VALUES ← 1 2 3 4 5 6 7 8 9 10 11... all the way up to 36
KINDS ← 0 0 0 2 0 0 0 3 0 ... filled with 0's the rest of the way to 36 length.
If I laminated these two matrices with
kinds,[.5] values
so that they are laminated one on top of the other
1 2 3 4 5 6 7 8 9 10...
0 0 0 2 0 0 0 3 0 ....
is there a functionally easy way to search for the index of the 2 value in the "second row" of the newly laminated matrix? eg. the column containing
4
2
and return that matrix index?
The value 2 also appears in row 1 of your newly laminated matrix (nlm), and as you stated, you really do not want to search the whole matrix, but only the second row. So, since you're only searching within a given row, getting the column index in that row gives you the complete answer:
row←2
⎕←col←nlm[row;]⍳2
4
nlm[;col] ⍝ values in matched column
4 2
Try it online!

Move square inside large matrix, find minimum number in overlapping

I have a sqaure matrix and a smaller square which moves inside the matrix at all possible positions (does not go out of the matrix). I need to find the smallest number in all such possible overlappings.
The problem is that the sizes of both can go upto thousands. Any fast way to do that?
I know one way - if there's an array instead of a matrix and a window instead of a square, we can do that in linear time using a deque.
Thanks in advance.
EDIT: Examples
Matrix:
1 3 6 2 5
8 2 3 4 5
3 8 6 1 5
7 4 8 2 1
8 0 9 0 5
For a square of size 3, total 9 overlappings are possible. For each overlapping the minimum numbers in matrix form are:
1 1 1
2 1 1
0 0 0
It is possible in O(k * n^2) with your deque idea:
If your smaller square is k x k, iterate the first row of elements from 1 to k in your matrix and treat it as an array by precomputing the minimum of the elements from 1 to k, from 2 to k + 1 etc in each column of the matrix (this precomputation will take O(k * n^2)). This is what your first row will be:
*********
1 3 6 2 5
8 2 3 4 5
3 8 6 1 5
*********
7 4 8 2 1
8 0 9 0 5
The precomputation I mentioned will give you the minimum in each of its columns, so you will have reduced the problem to your 1d array problem.
Then continue with the row of elements from 2 to k + 1:
1 3 6 2 5
*********
8 2 3 4 5
3 8 6 1 5
7 4 8 2 1
*********
8 0 9 0 5
There will be O(n) rows and you will be able to solve each one in O(n) because our precomputation allows us to reduce them to basic arrays.

Sparse Matrices Storage formats - Conversion

Is there an efficient way of converting a sparse matrix in Compressed Row Storage(CRS) format to Coordinate List (COO) format ?
Have a look at Yousef Saad's library SPARSKIT -- he has subroutines to convert back and forth between compressed sparse row and coordinate formats, as well as several other sparse matrix storage schemes.
Anyhow, to see how to get the coordinate format from the compressed one, it's easiest to consider how you could have come up with the compressed row format in the first place. Say you have a sparse matrix in COO, where you've put everything in order, for example
rows: 1 1 1 1 2 2 2 2 2 3 3 3 ...
cols: 1 3 5 9 2 3 7 9 11 1 2 3 ...
So the non-zero entries in row 1 are (1,1), (1,3), (1,5), (1,9) and so forth. You're storing a lot of redundant data in the array of rows; you can instead just have an array ia such that ia(i) tells you the starting address in the array cols for row i. In our example above, we would then have
ia : 1 5 10 ...
cols: 1 3 5 9 2 3 7 9 11 1 2 3 ...
To go from COO to CSR, we just use the fact that
ia(i+1) = ia(i) + number of non-zero entries in row i
for any i. Knowing that, you can work backwards to get the COO format from CSR.

Resources