sort a file by grouping column

sort a file by grouping column - sorting

I have a file like the following:
PSG1 B41M 3
PSG1 G03G 1
PSG1 C09D 2
PSG2 H01L 4
PSG2 C08L 3
PSG10 H01B 2
PSG10 C08J 4
I want to sort the values in the third column but only when they have the same PSG.
For the given example, I want the output file:
PSG1 B41M 3
PSG1 C09D 2
PSG1 G03G 1
PSG2 H01L 4
PSG2 C08L 3
PSG10 C08J 4
PSG10 H01B 2
I tried to sort the file based on the first and third column using command sort but it does not work as PSG10 appears exactly after PSG1 (before PSG2).
Any other ideas? I do not care if it is a script or Java code
Thank you.

I think you could use Map<K, V> data structure to hold data and sort on the values then you can have a reference how to sort a Map<K, V>

Related

Most efficiently insert a number in a maintained large sorted variable

I need to most efficiently insert a number in a maintained large sorted variable. Is there a better method than test1?
test1 is quite a bit faster vs test2 which is just to append a variable then resort.
q←1000000⍴0 ⋄ q←10 9 8 7 6 5 4 3 2,q ⍝q is kept sorted
test1←{
y←⍺(⍳∘1≤)⍵ ⍝ very fast
(y↑⍺),⍵,(y↓⍺) ⍝ is there a tacit version here and without copying?
}
10↑q test1 6
10 9 8 7 6 6 5 4 3 2
cmpx 'q test1 6'
3.2E¯4
test2←{y←⍵,⍺ ⋄ y[⍒y]}
10↑q test2 6
10 9 8 7 6 6 5 4 3 2
cmpx 'q test2 6'
1.5E¯3
I tried presorted variable. With test1 is quicker than appending then sorting. Perhaps test1 refactored with better tacit?

Possibly not the answer you are looking for, but in a production application, if access to these sorted keys with frequent appends was an important performance consideration in a Dyalog APL application, you might resort to something like the following class. The strategy is to have an unsorted variable data which can be appended to efficiently using a method called Append. Sorting is done on demand, if needed (there is room for further optimisation by checking whether the appended value is greater than the last element in the list, which would be worthwhile if that was a common case).
:Class Sorted
:Property Values
:Access Public
∇Set value
data←value
sorted←0
∇
∇r←Get value
:If ~sorted
sorteddata←data[⍒data]
sorted←1
:EndIf
r←sorteddata
∇
:EndProperty
∇ Make initial
:Implements Constructor
:Access Public
data←initial
sorted←0
∇
∇ r←Append values
:Access Public
data,←values
r←sorted←0
∇
:EndClass
Usage would be along the lines of:
s←⎕NEW Sorted (10 9 8 7 6 5 4 3 2,1E6⍴0)
s.Append 6
s.Append 7
≢s.Values
100011

How to show top N number of results with customization in spark rdd?

val sorting = sc.parallelize(List(1,1,1,2,2,2,2,3,3,3,4,4,4,4,5,5,5,6,6,7,8,8,8,8,8))
sorting.map(x=>(x,1)).reduceByKey((a,b)=>a+b).map(x=>(x._1,"==>",x._2)).sortBy(s=>s._2,false).collect.foreach(println)
output:
(8,==>,5)
(1,==>,3)
(2,==>,4)
(3,==>,3)
(4,==>,4)
(5,==>,3)
(6,==>,2)
(7,==>,1)
I want to show only top 3 results and remove , (comma) from the result.

use take(3) instead of collect to get the top 3 results, and then clean up the output manually:
sorting.map(x=>(x,1)).reduceByKey((a,b)=>a+b).sortBy(s=>s._2,false).map(x=>s"${x._1} ${x._2}").take(3).foreach(println)
8 5
2 4
4 4

spss search for value in dataset

I'd like to find any cases of a value (e.g., 0) in any cell in an SPSS database. What syntax would accomplish this?
(I came across a python script but don't have that option.)

It is still not very clear how you want to select those cases. But the below syntax will list in the output any cases which have ate least one "0" in any of the variables var1,var2 or var3. I am assuming CaseID is the case identifier variable.
TEMPORARY.
SELECT IF ANY(0,var1,var2,var3).
LIST CaseID var1 var2 var3.
You can use as many variables as you want in the ANY function, and also on the LIST command.

The following syntax will create a list of appearances of 0 within your data - In a separate file:
First creating some fake data to demonstrate on.
data list list/ID (a6) test1 to test6 (6f2).
begin data
ID_001 2 3 2 3 0 3
ID_002 3 4 0 4 3 4
ID_003 0 4 2 4 2 4
ID_004 7 0 1 2 8 3
ID_005 5 5 5 0 5 5
ID_006 4 5 4 5 4 0
end data.
dataset name origData.
Now to create the list:
dataset copy ForList.
dataset activate ForList. /* the list will be created from a copy of the data.
varstocases /make vals from test1 to test6/index testNum(vals).
select if vals=0.
You can use the list in the new file, or put it in the output window:
list ID testNum.

bash: add column if row name is repeated

I have a file with several variables in rows and values of these variables in columns. Some rows are repeated and only contain data for some of the columns (e.g. is the example below, the second time "A" appears, it only contains data in columns S1 and S2)
Example:
Variable S1 S2 S3
A 3 5 6
B 4 5 6
A some_string another_string
C 2 5 6
What I want is to add another (or several) columns that contain the data from the repeated row
Output example:
Variable S1 S2 S3 new_column1 new_column2
A 3 5 6 some_string another_string
B 4 5 6
C 2 5 6
I am thinking that something like the code below could get me there, but it's still erroneous and I'm not sure if it is even possible to do in bash?
My code would only be able to create ONE new column and I don't know how I can add the data to that new column.
I found those pieces of code in an other question that was similar, but not quite what I want, so I would appreciate any help!
awk 'NR==1{$5="new_column";print;next} seen[$1]++ {$5=$2}' file

Bash: find all pair of lines such that the difference of their first field is less than a threshold

my problem is the following. I have a BIG file with many rows containing ordered numbers (repetitions are possible)
1
1.5
3
3.5
6
6
...
1504054
1504056
I would like to print all the pair of row numbers such that their difference is smaller than a given threshold thr. Let us say for instance thr=2.01, I want
0 1
0 2
1 2
1 3
2 3
4 5
...
N-1 N
I wrote a thing in python but the file is huge and I think I need a smart way to do this in bash.
Actually, in the complete data structure there exists also a second column containing a string:
1 s0
1.5 s1
3 s2
3.5 s3
6 s4
6 s5
...
1504054 sN-1
1504056 sN
and, if easy to do, I would like to write in each row the pair of linked strings, possibly separated by "|":
s0|s1
s0|s2
s1|s2
s1|s3
s2|s3
s4|s5
...
sN-1|sN
Thanks for your help, I am not too familiar with bash

In any language you can white a program implementing this pseudo code:
while read line:
row = line.split(sep)
new_kept_rows = []
for kr in kept_rows :
if abs(kr[0], row[0])<=thr:
print "".join(kr[1:]) "|" "".join(row[1:])
new_kept_rows.append(kr)
kept_rows = new_kept_rows
This program only keep the few lines which could match the condition. All other are freed from memory. So the memory footprint should remain small even for big files.
I would use awk language because I'm comfortable with. But python would fit too (the pseudo code I give is very close to be python).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

sort a file by grouping column - sorting

I think you could use Map<K, V> data structure to hold data and sort on the values then you can have a reference how to sort a Map<K, V>

Related

Most efficiently insert a number in a maintained large sorted variable

How to show top N number of results with customization in spark rdd?

spss search for value in dataset

bash: add column if row name is repeated

Bash: find all pair of lines such that the difference of their first field is less than a threshold

Categories

Resources