I have a file with several variables in rows and values of these variables in columns. Some rows are repeated and only contain data for some of the columns (e.g. is the example below, the second time "A" appears, it only contains data in columns S1 and S2)
Example:
Variable S1 S2 S3
A 3 5 6
B 4 5 6
A some_string another_string
C 2 5 6
What I want is to add another (or several) columns that contain the data from the repeated row
Output example:
Variable S1 S2 S3 new_column1 new_column2
A 3 5 6 some_string another_string
B 4 5 6
C 2 5 6
I am thinking that something like the code below could get me there, but it's still erroneous and I'm not sure if it is even possible to do in bash?
My code would only be able to create ONE new column and I don't know how I can add the data to that new column.
I found those pieces of code in an other question that was similar, but not quite what I want, so I would appreciate any help!
awk 'NR==1{$5="new_column";print;next} seen[$1]++ {$5=$2}' file
Related
I need to most efficiently insert a number in a maintained large sorted variable. Is there a better method than test1?
test1 is quite a bit faster vs test2 which is just to append a variable then resort.
q←1000000⍴0 ⋄ q←10 9 8 7 6 5 4 3 2,q ⍝q is kept sorted
test1←{
y←⍺(⍳∘1≤)⍵ ⍝ very fast
(y↑⍺),⍵,(y↓⍺) ⍝ is there a tacit version here and without copying?
}
10↑q test1 6
10 9 8 7 6 6 5 4 3 2
cmpx 'q test1 6'
3.2E¯4
test2←{y←⍵,⍺ ⋄ y[⍒y]}
10↑q test2 6
10 9 8 7 6 6 5 4 3 2
cmpx 'q test2 6'
1.5E¯3
I tried presorted variable. With test1 is quicker than appending then sorting. Perhaps test1 refactored with better tacit?
Possibly not the answer you are looking for, but in a production application, if access to these sorted keys with frequent appends was an important performance consideration in a Dyalog APL application, you might resort to something like the following class. The strategy is to have an unsorted variable data which can be appended to efficiently using a method called Append. Sorting is done on demand, if needed (there is room for further optimisation by checking whether the appended value is greater than the last element in the list, which would be worthwhile if that was a common case).
:Class Sorted
:Property Values
:Access Public
∇Set value
data←value
sorted←0
∇
∇r←Get value
:If ~sorted
sorteddata←data[⍒data]
sorted←1
:EndIf
r←sorteddata
∇
:EndProperty
∇ Make initial
:Implements Constructor
:Access Public
data←initial
sorted←0
∇
∇ r←Append values
:Access Public
data,←values
r←sorted←0
∇
:EndClass
Usage would be along the lines of:
s←⎕NEW Sorted (10 9 8 7 6 5 4 3 2,1E6⍴0)
s.Append 6
s.Append 7
≢s.Values
100011
I have a Google Sheet in which I have to calculate a moving average conditioned to the 'ID' that calculates the average of the last 3 periods.
Any idea on how to do it?
I leave an example with the final results (column "Mean Average (last 3)").
Regards!
ID value Mean Average (last 3)
1 12 12,00
1 19 12,00
1 19 15,50
1 18 16,67
1 13 18,67
2 11 11,00
2 18 11,00
2 15 14,50
2 17 14,67
2 11 16,67
3 11 11,00
3 16 11,00
3 10 13,50
3 11 12,33
I've got an answer that may work for you. Assuming that your sample data is in columns A4:C (see my sample sheet), try the following formula in column D, in the same row as your data headers.
={"Mean Avg";ArrayFormula(
IF(ROW(A4:A18)<ROW(A$4)+2,
C$4,
IF(NOT(EQ(A4:A18,OFFSET(A4:A18,-1,0))),
B4:B19,
IF(NOT(EQ(A4:A18,OFFSET(A4:A18,-2,0))),
B3:B18,
IF(NOT(EQ(A4:A18,OFFSET(A4:A18,-3,0))),
(B2:B17+B3:B18)/2,
(B1:B16+B2:B17+B3:B18)/3)))))}
The first IF checks whether it is one of the first two data rows, to force the initial values.
The next IF checks if the ID is not equal to the row above, and forces the start of a new Average, with just one value. The next IF checks if it is the second ID in a series (NOT EQual to the ID 2 rows up), and if yes, also uses the single value from the row above.
The next IF checks up three rows, and if the IDs are different, it averages the values from the two rows above.
Otherwise, this is the fourth data row in a series with the same ID, and the formula takes the values from the three rows above, and averages them.
Due to the offsets, it seems quite sensitive to ranges, so it may need some tuning if you move it.
Let me know if this helps.
I'd like to find any cases of a value (e.g., 0) in any cell in an SPSS database. What syntax would accomplish this?
(I came across a python script but don't have that option.)
It is still not very clear how you want to select those cases. But the below syntax will list in the output any cases which have ate least one "0" in any of the variables var1,var2 or var3. I am assuming CaseID is the case identifier variable.
TEMPORARY.
SELECT IF ANY(0,var1,var2,var3).
LIST CaseID var1 var2 var3.
You can use as many variables as you want in the ANY function, and also on the LIST command.
The following syntax will create a list of appearances of 0 within your data - In a separate file:
First creating some fake data to demonstrate on.
data list list/ID (a6) test1 to test6 (6f2).
begin data
ID_001 2 3 2 3 0 3
ID_002 3 4 0 4 3 4
ID_003 0 4 2 4 2 4
ID_004 7 0 1 2 8 3
ID_005 5 5 5 0 5 5
ID_006 4 5 4 5 4 0
end data.
dataset name origData.
Now to create the list:
dataset copy ForList.
dataset activate ForList. /* the list will be created from a copy of the data.
varstocases /make vals from test1 to test6/index testNum(vals).
select if vals=0.
You can use the list in the new file, or put it in the output window:
list ID testNum.
my problem is the following. I have a BIG file with many rows containing ordered numbers (repetitions are possible)
1
1.5
3
3.5
6
6
...
1504054
1504056
I would like to print all the pair of row numbers such that their difference is smaller than a given threshold thr. Let us say for instance thr=2.01, I want
0 1
0 2
1 2
1 3
2 3
4 5
...
N-1 N
I wrote a thing in python but the file is huge and I think I need a smart way to do this in bash.
Actually, in the complete data structure there exists also a second column containing a string:
1 s0
1.5 s1
3 s2
3.5 s3
6 s4
6 s5
...
1504054 sN-1
1504056 sN
and, if easy to do, I would like to write in each row the pair of linked strings, possibly separated by "|":
s0|s1
s0|s2
s1|s2
s1|s3
s2|s3
s4|s5
...
sN-1|sN
Thanks for your help, I am not too familiar with bash
In any language you can white a program implementing this pseudo code:
while read line:
row = line.split(sep)
new_kept_rows = []
for kr in kept_rows :
if abs(kr[0], row[0])<=thr:
print "".join(kr[1:]) "|" "".join(row[1:])
new_kept_rows.append(kr)
kept_rows = new_kept_rows
This program only keep the few lines which could match the condition. All other are freed from memory. So the memory footprint should remain small even for big files.
I would use awk language because I'm comfortable with. But python would fit too (the pseudo code I give is very close to be python).
I have a file like the following:
PSG1 B41M 3
PSG1 G03G 1
PSG1 C09D 2
PSG2 H01L 4
PSG2 C08L 3
PSG10 H01B 2
PSG10 C08J 4
I want to sort the values in the third column but only when they have the same PSG.
For the given example, I want the output file:
PSG1 B41M 3
PSG1 C09D 2
PSG1 G03G 1
PSG2 H01L 4
PSG2 C08L 3
PSG10 C08J 4
PSG10 H01B 2
I tried to sort the file based on the first and third column using command sort but it does not work as PSG10 appears exactly after PSG1 (before PSG2).
Any other ideas? I do not care if it is a script or Java code
Thank you.
I think you could use Map<K, V> data structure to hold data and sort on the values then you can have a reference how to sort a Map<K, V>