I am trying to fill in ID1 variable with same ID number when rdq=adq for each permco in SAS. Here is an example of my data.
permco rdq adq ID ID1
1 333 331 1 .
1 333 332 2 .
1 333 333 3 3
1 333 334 4 .
1 333 335 5 .
1 333 336 6 .
1 555 552 1 .
1 555 553 2 .
1 555 554 3 .
1 555 555 4 4
1 555 556 5 .
1 555 557 6 .
1 555 558 7 .
2 333 331 1 .
2 333 332 2 .
2 333 333 3 3
2 333 334 4 .
2 333 335 5 .
2 333 336 6 .
2 555 552 1 .
2 555 553 2 .
2 555 554 3 .
2 555 555 4 4
2 555 556 5 .
2 555 557 6 .
2 555 558 7 .
And what I desire to have is...
permco rdq adq ID ID1
1 333 331 1 3
1 333 332 2 3
1 333 333 3 3
1 333 334 4 3
1 333 335 5 3
1 333 336 6 3
1 555 552 1 4
1 555 553 2 4
1 555 554 3 4
1 555 555 4 4
1 555 556 5 4
1 555 557 6 4
1 555 558 7 4
2 333 331 1 3
2 333 332 2 3
2 333 333 3 3
2 333 334 4 3
2 333 335 5 3
2 333 336 6 3
2 555 552 1 4
2 555 553 2 4
2 555 554 3 4
2 555 555 4 4
2 555 556 5 4
2 555 557 6 4
2 555 558 7 4
I would like to fill in ID1 with ID number when rdq=adq.
Double DoW loop solution:
data have01;
infile cards truncover expandtabs;
input permco rdq adq ID ID1 ;
cards;
1 333 331 1 .
1 333 332 2 .
1 333 333 3 3
1 333 334 4 .
1 333 333 5 5
1 333 336 6 .
1 555 552 1 .
1 555 553 2 .
1 555 554 3 .
1 555 555 4 4
1 555 556 5 .
1 555 557 6 .
1 555 558 7 .
2 333 331 1 .
2 333 332 2 .
2 333 333 3 3
2 333 334 4 .
2 333 335 5 .
2 333 336 6 .
2 555 552 1 .
2 555 553 2 .
2 555 554 3 .
2 555 555 4 .
2 555 556 5 .
2 555 557 6 .
2 555 558 7 .
;
run;
data want;
do _n_ = 1 by 1 until (last.rdq);
set have01;
by permco rdq;
if first.rdq then call missing(ID1);
if adq = rdq then t_ID1 = ID1;
drop t_ID1;
end;
do _n_ = 1 to _n_;
set have01;
ID1 = t_ID1;
output;
end;
run;
This assumes that if there are multiple matches, the last one should take precedence. If there are no matches then every row for that group gets a missing value.
If your data is like for a combination of Permco and RDQ you have unique value of ID 1 and you need to fill same value through out for for that combination :
An Alternative can be :
Create a separate data set having only 3 columns : Permco ,RDQ and
ID1 ,remove the rows having blank ID1 .
Merge your input data with this data on Permco and RDQ .
Data intermediate ;
set input_data (keep Permco RDQ ID1);
If ID1=. then delete;
Run;
proc sort data=input_data out=input_data_1(drop ID1);by Permco RDQ;run;
proc sort data=intermediate;by Permco RDQ;run;
data final;
merge input_data_1(in=a) intermediate(in=b);
by Permco RDQ;
if a;
run;
I think loops are unnecessary here. All you need to do is find the value you want in each group and merge it back onto the original dataset:
proc sql;
create table temp as select distinct
permco, rdq, id
from have (where = (rdq = adq));
quit;
proc sql;
create table want as select distinct
a.*, b.id as id_filled
from have as a
left join temp as b
on a.permco = b.permco and a.rdq = b.rdq;
quit;
I assume you want to have the same number within the by-group defined by permco and rdq. The possibility to have two or more matches within one by group is handled with a variable matchespergroup. If no matches are found in one group id1 is a missing value.
data have01;
infile cards truncover expandtabs;
input permco rdq adq ID ID1 ;
cards;
1 333 331 1 .
1 333 332 2 .
1 333 333 3 3
1 333 334 4 .
1 333 333 5 5
1 333 336 6 .
1 555 552 1 .
1 555 553 2 .
1 555 554 3 .
1 555 555 4 4
1 555 556 5 .
1 555 557 6 .
1 555 558 7 .
2 333 331 1 .
2 333 332 2 .
2 333 333 3 3
2 333 334 4 .
2 333 335 5 .
2 333 336 6 .
2 555 552 1 .
2 555 553 2 .
2 555 554 3 .
2 555 555 4 4
2 555 556 5 .
2 555 557 6 .
2 555 558 7 .
run;
data want(drop=rv);
if 0 then set have01;
if _N_=1 then
do;
declare hash hh(dataset:"have01(where=(adq=rdq))",ordered:'A',multidata:'Y');
hh.definekey('permco','rdq');
hh.definedata('id1');
hh.definedone();
end;
do until(theend);
set have01 end=theend;
rv = hh.find();
hh.has_next(result: matchespergroup);
if rv=0 then do; matchespergroup+1; output;end;
else do; id1 = .;output;end;
end;
run;
Related
I have a blastn output file with tens of thousands of rows. I'm only interested in rows where part of the query sequence ID does not match with part of the subject sequence ID, which I'd like to put into a new text file. Here is an excerpt of the massive output file for which I want to extract information from, as an example:
qseqid qlen qstart qend sseqid slen sstart send evalue bitscore length pident nident mismatch gaps
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 121 679 OFAS003927-RA-EXON03_Anisoscelini_Anisoscelis_flavolineatus_CMF_0018_S7_L005_UQ_trinity_assembled 557 1 557 0 832 562 93.594 526 28 8
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 155 650 OFAS003927-RA-EXON03_Placoscelini_Plaxiscelis_limbata_CMF_0072_S29_L005_UQ_trinity_assembled 820 327 819 0 808 496 96.169 477 16 3
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 222 686 OFAS003927-RA-EXON03_Anisoscelini_Leptoscelis_tricolor_CMF_0079_S32_L005_UQ_trinity_assembled 465 1 465 0 793 465 97.419 453 12 0
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 429 635 OFAS003927-RA-EXON03B_Clavigrallini_Clavigralla_sp_CMF_0335_S81_L005_UQ_trinity_assembled 655 1 207 4.30E-87 316 207 94.203 195 12 0
OFAS003927-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 744 531 629 OFAS003927-RA-EXON07_Mictini_Anoplocnemis_sp_CMF_0052_S20_L005_UQ_trinity_assembled 668 1 99 9.92E-39 156 99 94.949 94 5 0
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 0 1286 696 100 696 0 0
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_declivis_CMF_0069_S26_L005_UQ_trinity_assembled 1060 332 1025 0 1212 696 98.132 683 11 2
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 696 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_thomasi_CMF_0028_S13_L005_UQ_trinity_assembled 814 50 745 0 1147 698 96.418 673 21 4
OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_alata_CMF_0025_S10_L005_UQ_trinity_assembled 696 1 695 OFAS007459-RA-EXON03_Acanthocephalini_Acanthocephala_confraterna_CMF_0123_S44_L005_UQ_trinity_assembled 1313 578 1274 0 1131 699 95.994 671 22 6
qseqid = query sequence ID
sseqid = subject sequence ID
What should be matching is the OFAS#-RA-EXON# between the two ID's for each row. When this isn't the case, e.g., the 4th and 5th row, I want to extract the entire row and place into a new text file. I know some regex pattern will need to be employed, but how to indicate columns and search on a per row basis isn't clear to me.
This will work with GNU Awk :
tail -n+2 input.txt | awk '{ if( substr($1,0,21) != substr($5,0,21)) { print $0 } }'
Regards!
I do not want to wait for Oracle DataDump expdb to finish writing to dump file.
So I start reading data from the moment it's created.
Then I write this data to another file.
It worked ok - file sizes are the same (the one that OracleDump created and the one my data monitoring script created).
But when I run cmp it shows difference in 27 bytes:
cmp -l ora.dmp monitor_10k_rows.dmp
3 263 154
4 201 131
5 174 173
6 103 75
48 64 70
58 0 340
64 0 1
65 0 104
66 0 110
541 60 61
545 60 61
552 60 61
559 60 61
20508 0 15
20509 0 157
20510 0 230
20526 0 10
20532 0 15
20533 0 225
20534 0 150
913437 0 226
913438 0 37
913454 0 10
913460 0 1
913461 0 104
913462 0 100
ls -al ora.dmp
-rw-r--r-- 1 oracle oinstall 999424 Jun 20 11:35 ora.dmp
python -c 'print 999424-913462'
85962
od ora.dmp -j 913461 -N 1
3370065 000100
3370066
od monitor_10k_rows.dmp -j 913461 -N 1
3370065 000000
3370066
Even if I extract more data the difference is still 27 bytes but different addresses/values:
cmp -l ora.dmp monitor_30k_rows.dmp
3 245 134
4 222 264
5 377 376
6 54 45
48 36 43
57 0 2
58 0 216
64 0 1
65 0 104
66 0 120
541 60 61
545 60 61
552 60 61
559 60 61
20508 0 50
20509 0 126
20510 0 173
20526 0 10
20532 0 50
20533 0 174
20534 0 120
2674717 0 226
2674718 0 47
2674734 0 10
2674740 0 1
2674741 0 104
2674742 0 110
Some writes are the same.
Is there a way know addresses of bytes which will differ?
ls -al ora.dmp
-rw-r--r-- 1 bicadmin bic 2760704 Jun 20 11:09 ora.dmp
python -c 'print 2760704-2674742'
85962
How can update my monitored copy after DataDump updated the original at adress 2674742 using Python for example?
Exact same thing happens if I use COMPRESSION=DATA_ONLY option.
Update: Figured how to sync bytes that differ between 2 files:
def patch_file(fn, diff):
for line in diff.split(os.linesep):
if line:
addr, to_octal, _ = line.strip().split()
with open(fn , 'r+b') as f:
f.seek(int(addr)-1)
f.write(chr(int (to_octal,8)))
diff="""
3 157 266
4 232 276
5 272 273
6 16 25
48 64 57
58 340 0
64 1 0
65 104 0
66 110 0
541 61 60
545 61 60
552 61 60
559 61 60
20508 15 0
20509 157 0
20510 230 0
20526 10 0
20532 15 0
20533 225 0
20534 150 0
913437 226 0
913438 37 0
913454 10 0
913460 1 0
913461 104 0
913462 100 0
"""
patch_file(f3,diff)
wrote a patch using Python:
addr=[3 , 4 , 5 , 6 , 48 , 58 , 64 , 65 , 66 , 541 , 545 , 552 , 559 , 20508 , 20509 , 20510 , 20526 , 20532 , 20533 , 20534 ]
last_range=[85987, 85986, 85970, 85964, 85963, 85962]
def get_bytes(addr):
out =[]
with open(f1 , 'r+b') as f:
for a in addr:
f.seek(a-1)
data= f.read(1)
hex= binascii.hexlify(data)
binary = int(hex, 16)
octa= oct(binary)
out.append((a,octa))
return out
def patch_file(fn, bytes_to_update):
with open(fn , 'r+b') as f:
for (a,to_octal) in bytes_to_update:
print (a,to_octal)
f.seek(int(a)-1)
f.write(chr(int (to_octal,8)))
if 1:
from_file=f1
fsize=os.stat(from_file).st_size
bytes_to_read = addr + [fsize-x for x in last_range]
bytes_to_update = get_bytes(bytes_to_read)
to_file =f3
patch_file(to_file,bytes_to_update)
The reason I do dmp file monitoring is because it cuts backup time in half.
I have a very bulky file about 1M lines like this:
4001 168991 11191 74554 60123 37667 125750 28474
8 145 25 101 83 51 124 43
2985 136287 4424 62832 50788 26847 89132 19184
3 129 14 101 88 61 83 32 1 14 10 12 7 13 4
6136 158525 14054 100072 134506 78254 146543 41638
1 40 4 14 19 10 35 4
2981 112734 7708 54280 50701 33795 75774 19046
7762 339477 26805 148550 155464 119060 254938 59592
1 22 2 12 10 6 17 2
6 136 16 118 184 85 112 56 1 28 1 5 18 25 40 2
1 26 2 19 28 6 18 3
4071 122584 14031 69911 75930 52394 89733 30088
1 9 1 3 4 3 11 2 14 314 32 206 253 105 284 66
I want to remove rows that have a value less than 100 in the second column.
How to do this with sed?
I would use awk to do this. Example:
awk ' $2 >= 100 ' file.txt
this will only display every row from file.txt that has a column $2 greater than 100.
Use the following approach:
sed '/^\w+\s+([0-9]{1,2}|[0][0-9]+)\b/d' -E /tmp/test.txt
(replace /tmp/test.txt with your current file path)
([0-9]{1,2}|[0][0-9]+) - will match either digits from 0 to 99 OR a digits with leading zero (ex. 012, 00982)
d - delete the pattern space;
-E(--regexp-extended) - Use extended regular expressions rather than basic regular expressions
To remove matched lines in place use -i option:
sed -i -E '/^\w+\s+([0-9]{1,2}|[0][0-9]+)\b/d' /tmp/test.txt
I have a set of files contain tab separated values, at the last but third line, I have my desired values. I have extracted that value with
cat result1.tsv | tail -3 | head -1 > final1.tsv
cat resilt2.tsv | tail -3 | head -1 >final2.tsv
..... so on (I have almost 30-40 files)
I want the content of final tsv files in next line in a new single file.
I tried
cat final1.tsv final2.tsv > final.tsv
but this works for the limited amount of files difficult to write the name of all files.
I tried to put the file names in a loop as variables but not worked.
final1.tsv contains:
270 96 284 139 271 331 915 719 591 1679 1751 1490 968 1363 1513 1184 1525 490 839 425 967 855 356
final2.tsv contains:
1 1 0 2 6 5 1 1 11 7 1 3 4 1 0 3 2 1 0 3 2 1 28
all the files (final1.tsv,final2.tsv,final3.tsv,final5..... contains same number of columns but different values)
I want the rows of each file merged in new file like
final.tsv
final1 270 96 284 139 271 331 915 719 591 1679 1751 1490 968 1363 1513 1184 1525 490 839 425 967 855 356
final2 1 1 0 2 6 5 1 1 11 7 1 3 4 1 0 3 2 1 0 3 2 1 28
final3 270 96 284 139 271 331 915 719 591 1679 1751 1490 968 1363 1513 1184 1525 490 839 425 967 855 356
final4 1 1 0 2 6 5 1 1 11 7 1 3 4 1 0 3 2 1 0 3 2 1 28
here you go...
for f in final{1..4}.tsv;
do
echo -en $f'\t' >> final.tsv;
cat $f >> final.tsv;
done
Try this:
rm final.tsv
for FILE in result*.tsv
do
tail -3 $FILE | head -1 >> final.tsv
done
As long as the files aren't enormous, it's simplest to read each file into an array and select the third record from the end
This solves your problem for you. It looks for all files in the current directory that match result*.tsv and writes the required line from each of them to final.tsv
use strict;
use warnings 'all';
my #results = sort {
my ($aa, $bb) = map /(\d+)/, ($a, $b);
$aa <=> $bb;
} glob 'result*.tsv';
open my $out_fh, '>', 'final.tsv';
for my $result_file ( #results ) {
open my $fh, '<', $result_file or die qq({Unable to open "$result_file" for input: $!};
my #data = <$fh>;
next unless #data >= 3;
my ($name) = $result_file =~ /([^.]+)/;
print { $out_fh } "$name\t$data[-3]";
}
I have this table as a result from another query
STATUS R1 R2 R3 R4 R5 R6 R7 R8 R9
----------------------------------------------------
ACCEPTED 322 241 278 473 575 595 567 449 605
ADECUACIONES 0 0 0 0 2 0 1 0 50
AET 0 0 2 0 0 0 0 0 11
EXECUTED 0 80 1 18 9 57 34 30 20
IN PROCESS 0 0 0 0 0 4 25 2 112
FREQ 0 55 2 76 25 117 7 73 48
INSTALL 1 4 1 10 5 14 2 13 62
WO INSTALL 9 2 51 24 143 17 15 59 16
WOT VL 0 1 0 0 1 0 0 0 0
OTHER 22 7 20 28 44 30 6 6 109
PROG 1 0 1 0 0 2 3 0 0
PTE PROG 0 5 0 0 0 0 3 19 93
TMX 0 0 0 28 4 8 11 3 14
PROJ 0 1 12 26 13 8 0 2 4
What I expect to have is this
STATUS R1 R2 R3 R4 R5 R6 R7 R8 R9 TOTAL
----------------------------------------------------------
ACCEPTED 322 241 278 473 575 595 567 449 605 4105
ADECUACIONES 0 0 0 0 2 0 1 0 50 53
AET 0 0 2 0 0 0 0 0 11 13
EXECUTED 0 80 1 18 9 57 34 30 20 249
IN PROCESS 0 0 0 0 0 4 25 2 112 143
FREQ 0 55 2 76 25 117 7 73 48 403
INSTALL 1 4 1 10 5 14 2 13 62 112
WO INSTALL 9 2 51 24 143 17 15 59 16 336
WOT VL 0 1 0 0 1 0 0 0 0 2
OTHER 22 7 20 28 44 30 6 6 109 272
PROG 1 0 1 0 0 2 3 0 0 7
PTE PROG 0 5 0 0 0 0 3 19 93 120
TMX 0 0 0 28 4 8 11 3 14 68
PROJ 0 1 12 26 13 8 0 2 4 66
TOTAL 355 396 368 683 821 852 674 656 1144 5949
I've been playing with grouping() and rollup(), but I always get duplicated rows and unwanted null values.
If you have problems, grouping_id function will help you.
(You can select grouping_id(col), but also grouping_id(col1, col2, col3, etc..))
But your case is simpler.
It is like:
drop table fg_test_group;
create table fg_test_group (a number, b number, c number, d number);
insert into fg_test_group values (1, 2, 3, 4);
insert into fg_test_group values (2, 2, 3, 4);
insert into fg_test_group values (3, 2, 3, 4);
select nvl(to_char(a), 'total') as a , sum(b), sum(c), sum(d), grouping_id(a)
from fg_test_group
group by rollup (a)
;
where a is Status in your case.
CREATE TABLE TEST1 (STATUS VARCHAR2(10), R1 NUMBER, R2 NUMBER, R3 NUMBER);
INSERT INTO TEST1 VALUES ('ACCEPTED', 322,241,278);
INSERT INTO TEST1 VALUES ('EXECUTED', 0, 80, 1);
INSERT INTO TEST1 VALUES ('FREQ', 0, 55, 2);
COMMIT;
select NVL(TO_CHAR(STATUS), 'total') as STATUS ,SUM(R1) R1, SUM(R2) R2 , SUM(R3) R3, SUM(R1+R2+R3)
from TEST1
group by rollup (STATUS)
;
STATUS R1 R2 R3 SUM(R1+R2+R3)
ACCEPTED 322 241 278 841
EXECUTED 0 80 1 81
FREQ 0 55 2 57
total 322 376 281 979