Cluster faces into groups - cluster-computing

The data:
2328568501515627770 2328529760910617771 100.0
2328529760910617771 2328568501515627770 100.0
2328529760910617771 2328530052968393930 99.976524
2328529760910617771 2328514835899260501 99.69356
2328529760910617771 2328515153726841781 98.76936
2328529760910617771 2328515132252005201 98.741165
2328529760910617771 2328515149431874457 98.6116
2328529760910617771 2328515158021809084 98.47021
2328529760910617771 2328515145136907144 98.156456
2328529760910617771 2328515089302332012 97.53229
2328529760910617771 2328515153726841775 97.449005
2328568501515627770 2328530052968393930 99.976524
2328530052968393930 2328529760910617771 99.976524
2328530052968393930 2328568501515627770 99.976524
2328530052968393930 2328514835899260501 99.68713
2328530052968393930 2328515132252005201 98.70858
2328530052968393930 2328515158021809084 98.612816
2328530052968393930 2328515153726841781 98.59485
2328530052968393930 2328515149431874457 98.43197
2328530052968393930 2328515145136907144 98.12278
2328530052968393930 2328515089302332012 97.5466
2328530052968393930 2328515153726841775 97.299934
2328515153726841775 2328568501515627770 97.44901
2328515153726841775 2328530052968393930 97.299934
2328515153726841775 2328514835899260501 97.28116
2328515153726841775 2328515149431874457 96.93521
First column is face id
Second column is matched face id
Third column is similarity score
EDIT: How can I apply hierarchical clustering on the above dataset.

All similarities are way larger than 80%.
So everything is one cluster.

Related

saving dataframe and corresponding chart in a single pdf file in python matplotlib

I have a dataframe:
id1 id2 fields valid invalid missing
0 1001.0 0.0 State 158.0 0.0 0.0
1 1001.0 0.0 Zip 156.0 0.0 2.0
2 1001.0 0.0 Race 128.0 20.0 10.0
3 1001.0 0.0 LastName 158.0 0.0 0.0
4 1001.0 0.0 Email 54.0 0.0 104.0
... ... ... ... ... ... ...
28859 5276.0 36922.0 Phone 0.0 0.0 8.0
28860 5276.0 36922.0 Email 1.0 0.0 7.0
28861 5276.0 36922.0 State 8.0 0.0 0.0
28862 5276.0 36922.0 office ID 8.0 0.0 0.0
28863 5276.0 36922.0 StreetAdd 8.0 0.0 0.0
with an initial goal of grouping into individual id and create a pdf file. I was able to create a pdf file from the plot I created but I would like to save the dataframe that goes with the graph in the same pdf file.
# read the csv file
cme_df = pd.read_csv('sample.csv')
# fill na with 0
cme_df = cme_df.fillna(0)
# iterate through the unique id2 in the file
for i in cme_df['id2'].unique():
with PdfPages('home/'+'id2_'+str(i)+'.pdf') as pdf:
cme_i = cme_df[cme_df['id2'] == i].sort_values('fields')
print(cme_i)
# I feel this is where I must have something to create or save the table into pdf with the graph created below #
# create the barh graph
plt.barh(cme_i['fields'],cme_i['valid'], color = 'g', label='valid')
plt.barh(cme_i['fields'],cme_i['missing'], left = cme_i['valid'],color='y',label='missing')
plt.barh(cme_i['fields'],cme_i['invalid'],left = cme_i['valid']+cme_i['missing'], color='r',label='invalid')
plt.legend(bbox_to_anchor=(0.5, -0.05), loc='upper center', shadow=True, ncol=3)
plt.suptitle('valid, invalid, missing', fontweight='bold')
plt.title('id2: '+ str(i))
pdf.savefig()
plt.clf()
my code above prints the table in the results window, then goes to creating the horizontal bar. the last few lines save the graph into pdf. I would like save both the dataframe and the graph in a single file.
In some searches, it suggested to convert to html then to pdf, I cannot seem to make it work.
cme_i.to_html('id2_'+str(i)+'.html')
# then convert to pdf
pdf.from_file(xxxxx)

Gnuplot: frequency per min

I have a sample log file with 1000 lines, that looks like this,
TIME,STATUS
09:00,OK
09:00,TEMP
09:00,TEMP
09:00,TEMP
09:00,TEMP
09:00,TEMP
09:01,OK
09:01,OK
09:01,OK
09:01,PERM
09:01,TEMP
09:01,TEMP
09:02,OK
09:02,TEMP
09:02,TEMP
09:03,OK
09:03,PERM
09:03,PERM
09:03,TEMP
09:03,TEMP
09:04,OK
09:04,PERM
09:04,PERM
09:04,TEMP
09:04,TEMP
09:04,TEMP
09:05,OK
09:05,OK
09:05,OK
09:05,PERM
09:05,TEMP
09:05,TEMP
09:05,TEMP
09:05,TEMP
09:06,OK
09:06,OK
09:06,PERM
09:06,PERM
09:06,PERM
09:06,PERM
09:06,TEMP
09:06,TEMP
09:06,TEMP
09:06,TEMP
09:06,TEMP
09:07,OK
09:07,OK
09:07,TEMP
09:07,TEMP
09:07,TEMP
09:08,OK
09:08,OK
09:08,OK
09:08,OK
09:08,OK
09:08,OK
09:08,OK
09:08,TEMP
09:08,TEMP
09:08,TEMP
09:08,TEMP
09:09,OK
09:09,OK
09:09,OK
09:09,PERM
09:10,OK
09:10,PERM
09:10,PERM
09:10,TEMP
09:11,OK
09:11,OK
09:11,OK
09:11,OK
09:11,PERM
09:11,PERM
09:11,PERM
09:11,PERM
09:11,TEMP
09:11,TEMP
09:11,TEMP
09:12,PERM
09:12,TEMP
09:12,TEMP
09:13,OK
09:13,OK
09:13,OK
09:13,OK
09:13,OK
09:13,PERM
09:13,PERM
09:13,PERM
09:13,TEMP
09:13,TEMP
09:14,OK
09:14,OK
09:14,OK
09:14,PERM
09:14,PERM
09:14,PERM
09:14,PERM
09:14,TEMP
09:16,OK
09:16,OK
09:16,OK
09:16,PERM
09:16,PERM
09:16,TEMP
09:16,TEMP
09:17,OK
09:17,OK
09:17,PERM
09:17,PERM
09:18,OK
09:18,OK
09:18,OK
09:18,OK
09:18,OK
09:18,PERM
09:18,PERM
09:18,TEMP
09:18,TEMP
09:18,TEMP
09:19,OK
09:19,OK
09:19,OK
09:19,OK
09:19,OK
09:19,PERM
09:20,OK
09:20,OK
09:20,PERM
09:20,PERM
09:20,TEMP
09:20,TEMP
09:21,OK
09:21,OK
09:21,OK
09:21,PERM
09:21,TEMP
09:22,OK
09:22,OK
09:22,PERM
09:22,PERM
09:22,TEMP
09:22,TEMP
09:23,OK
09:23,PERM
09:23,PERM
09:23,PERM
09:23,TEMP
09:23,TEMP
09:23,TEMP
09:24,PERM
09:24,PERM
09:24,PERM
09:25,OK
09:25,OK
09:25,PERM
09:25,TEMP
09:26,OK
09:26,OK
09:26,OK
09:26,OK
09:26,OK
09:26,PERM
09:26,TEMP
09:27,OK
09:27,OK
09:27,OK
09:27,PERM
09:27,PERM
09:27,TEMP
09:27,TEMP
09:27,TEMP
09:28,PERM
09:28,PERM
09:28,PERM
09:28,PERM
09:29,OK
...
while the final file will have 10K lines in the same time frame.
I need to create a graph to show number of statuses per minute for TEMP, PERM and OK. So I would like to use a line for the status (TEMP, PERM and OK), plot time on the X axis, and frequency of occurrence on the Y axis.
I installed Gnuplot only 2 days ago on my Ubuntu 20.04.4 LTS from the standard repo:
bi#green:bin$ apt list gnuplot* 2>/dev/null | grep installed
gnuplot-data/focal,focal,now 5.2.8+dfsg1-2 all [installed,automatic]
gnuplot-qt/focal,now 5.2.8+dfsg1-2 amd64 [installed,automatic]
gnuplot/focal,focal,now 5.2.8+dfsg1-2 all [installed]
and so far I haven't managed more than this,
#!/bin/bash
x=logoutcol
cat $x
gnuplot -p <<-EOF
#set ytics scale 0
#set yzeroaxis
reset
set format x "%H:%M" time
set xdata time
set yrange [0:*]
set ylabel "Occurences"
set ytics 2
#set margin at screen 0.95
binwidth=60
bin(val) = binwidth * floor(val/binwidth)
set boxwidth binwidth
set datafile separator ","
set term png
set output "$x.png"
plot "$x" using (bin(timecolumn(1,"%H%M"))):(2) smooth freq with boxes
EOF
shotwell $x.png
rm $x.png
which produces this:
Any help will be much appreciated.
I am pretty sure that there was an almost identical question here on SO, however, it seems I can't find it maybe due to my incapability of finding the right keywords for SO's search function.
The key point is the boolean expression (strcol(2) eq word(myKeys,i)) together with smooth frequency. If the value of the second column is identical to your keyword the expression results in 1, and 0 otherwise.
You don't need bins like in creating other histograms because you want a bin of 1 minute (and your time resolution is already 1 minute).
Check the following example as starting point for further optimization.
Script:
### count occurrences of keywords
reset session
$Data <<EOD
# TIME,STATUS
09:00,OK
09:00,TEMP
09:00,TEMP
09:00,TEMP
09:00,TEMP
09:00,TEMP
09:01,OK
09:01,OK
09:01,OK
09:01,PERM
09:01,TEMP
09:01,TEMP
09:02,OK
09:02,TEMP
09:02,TEMP
09:03,OK
09:03,PERM
09:03,PERM
09:03,TEMP
09:03,TEMP
09:04,OK
09:04,PERM
09:04,PERM
09:04,TEMP
09:04,TEMP
09:04,TEMP
09:05,OK
09:05,OK
09:05,OK
09:05,PERM
09:05,TEMP
09:05,TEMP
09:05,TEMP
09:05,TEMP
09:06,OK
09:06,OK
09:06,PERM
09:06,PERM
09:06,PERM
09:06,PERM
09:06,TEMP
09:06,TEMP
09:06,TEMP
09:06,TEMP
09:06,TEMP
09:07,OK
09:07,OK
09:07,TEMP
09:07,TEMP
09:07,TEMP
09:08,OK
09:08,OK
09:08,OK
09:08,OK
09:08,OK
09:08,OK
09:08,OK
09:08,TEMP
09:08,TEMP
09:08,TEMP
09:08,TEMP
09:09,OK
09:09,OK
09:09,OK
09:09,PERM
09:10,OK
09:10,PERM
09:10,PERM
09:10,TEMP
09:11,OK
09:11,OK
09:11,OK
09:11,OK
09:11,PERM
09:11,PERM
09:11,PERM
09:11,PERM
09:11,TEMP
09:11,TEMP
09:11,TEMP
09:12,PERM
09:12,TEMP
09:12,TEMP
09:13,OK
09:13,OK
09:13,OK
09:13,OK
09:13,OK
09:13,PERM
09:13,PERM
09:13,PERM
09:13,TEMP
09:13,TEMP
09:14,OK
09:14,OK
09:14,OK
09:14,PERM
09:14,PERM
09:14,PERM
09:14,PERM
09:14,TEMP
09:16,OK
09:16,OK
09:16,OK
09:16,PERM
09:16,PERM
09:16,TEMP
09:16,TEMP
09:17,OK
09:17,OK
09:17,PERM
09:17,PERM
09:18,OK
09:18,OK
09:18,OK
09:18,OK
09:18,OK
09:18,PERM
09:18,PERM
09:18,TEMP
09:18,TEMP
09:18,TEMP
09:19,OK
09:19,OK
09:19,OK
09:19,OK
09:19,OK
09:19,PERM
09:20,OK
09:20,OK
09:20,PERM
09:20,PERM
09:20,TEMP
09:20,TEMP
09:21,OK
09:21,OK
09:21,OK
09:21,PERM
09:21,TEMP
09:22,OK
09:22,OK
09:22,PERM
09:22,PERM
09:22,TEMP
09:22,TEMP
09:23,OK
09:23,PERM
09:23,PERM
09:23,PERM
09:23,TEMP
09:23,TEMP
09:23,TEMP
09:24,PERM
09:24,PERM
09:24,PERM
09:25,OK
09:25,OK
09:25,PERM
09:25,TEMP
09:26,OK
09:26,OK
09:26,OK
09:26,OK
09:26,OK
09:26,PERM
09:26,TEMP
09:27,OK
09:27,OK
09:27,OK
09:27,PERM
09:27,PERM
09:27,TEMP
09:27,TEMP
09:27,TEMP
09:28,PERM
09:28,PERM
09:28,PERM
09:28,PERM
09:29,OK
EOD
set datafile separator comma
myKeys = "OK TEMP PERM"
myKey(i) = word(myKeys,i)
myTimeFmt = "%H:%M"
set format x myTimeFmt timedate
plot for [i=1:words(myKeys)] $Data u (timecolumn(1,myTimeFmt)):(strcol(2) eq word(myKeys,i)) smooth freq w lp pt 7 ti word(myKeys,i)
### end of script
Result:

passing bash variable for awk column specifier

There are loads of threads about passing a shell variable to awk, and I've figured that out easily enough, but the variable I want to pass is the column specifier variable ($1,$2 etc)
Given that the shell uses these variables as default command line argument variables as well, this is getting confusing.
In this script I'm just sorting and joining 2 files together, but in order to begin generalising the script a little, I want to be able to specify on the command line, the field in the key file that awk should be taking as its sort-specifier.
What am I doing wrong here? (I'm only just getting to grips with awk and the oneliner was adapted slightly from here.
keyfile="$1"
filetosort="$2"
field="$3"
awk -v a="$field"
paste "$keyfile" <(awk 'NR==FNR{o[FNR]=a; next} {t[$1]=$0} END{for(x=1; x<=FNR; x++){y=o[x]; print t[y]}}' $keyfile $filetosort)
EDIT Added example in/output
Keyfile: (10 random lines from file)
PVClumt18 PAK_2199 PAK_01997
PVClopt2 PAK_2091 PAK_01895
PVCcif7 PAK_1975 PAK_01793
PVClopT12 PAU_02101 PAU_02063
PVCpnf20 PAK_3524 PAK_03184
PVClopt3 PAK_2090 PAK_01894
PVClopT11 PAU_02102 PAU_02064
PVCunit2_11 plu1698 PLT_01726
PVClumT9 afp10 PAU_02198
PVCunit2_17 plu1692 PLT_01720
File to sort:
PAU_02064 1pqx 1pqx_A 37.4 13 0.00035 31.4 >1pqx_A Conserved hypothetical protein; ZR18,structure, autostructure,spins,autoassign, northeast structural genomics consortium; NMR {Staphylococcus aureus subsp} SCOP: d.267.1.1 PDB: 2ffm_A 2m6q_A 2m8w_A
PAK_01997 5ftj 5ftj_A 99.9 1.6e-26 4.2e-31 229.2 >5ftj_A Transitional endoplasmic reticulum ATPase; hydrolase, single-particle, AAA ATPase; HET: ADP OJA; 2.30A {Homo sapiens} PDB: 3cf1_A* 3cf3_A* 3cf2_A* 5ftk_A* 5ftl_A* 5ftm_A* 5ftn_A* 1r7r_A* 5c19_A 5c1b_A* 5c18_A* 3cf0_A*
PAK_01894 3j9q 3j9q_A 99.9 1.8e-29 4.6e-34 215.9 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PAK_03184 1xju 1xju_A 99.4 4.1e-17 1.1e-21 98.8 >1xju_A Lysozyme; secreted inactive conformation, hydrolase; 1.07A {Enterobacteria phage P1} SCOP: d.2.1.3
PAK_01793 5a3a 5a3a_A 50.8 6 0.00016 31.4 >5a3a_A SIR2 family protein; transferase, P-ribosyltransferase, metalloprotein, NAD-depen lipoylation, regulatory enzyme, rossmann fold; 1.54A {Streptococcus pyogenes} PDB: 5a3b_A* 5a3c_A*
PLT_01720 3ggm 3ggm_A 54.2 4.9 0.00013 26.2 >3ggm_A Uncharacterized protein BT9727_2919; bacillus cereus group., structural genomics, PSI-2, protein structure initiative; 2.00A {Bacillus thuringiensis serovarkonkukian}
PLT_01726 3h2t 3h2t_A 96.8 8e-06 2.1e-10 82.6 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A
PAK_01895 3j9q 3j9q_A 100.0 2.5e-35 6.4e-40 248.6 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PAU_02198 4jiv 4jiv_D 69.6 1.6 4.2e-05 27.5 >4jiv_D VCA0105, putative uncharacterized protein; PAAR-repeat motif, membrane piercing, type VI secretion SYST vibrio cholerae VGRG2; HET: PLM STE ELA; 1.90A {Vibrio cholerae o1 biovar eltor}
PAU_02063 4yap 4yap_A 31.1 20 0.00052 29.1 >4yap_A Glutathione S-transferase homolog; GSH-lyase GSH-dependent; 1.11A {Sphingobium SP} PDB: 4g10_A 4yav_A*
Thus I need to sort and match the rows based on column 3 in the keyfile, and column 1 in the file to sort.
And the resulting file: (The duplication of columns 3 & 4 was something I was planning to sort out after)
PVClumt18 PAK_2199 PAK_01997 PAK_01997 5ftj 5ftj_A 99.9 1.6e-26 4.2e-31 229.2 >5ftj_A Transitional endoplasmic reticulum ATPase; hydrolase, single-particle, AAA ATPase; HET: ADP OJA; 2.30A {Homo sapiens} PDB: 3cf1_A* 3cf3_A* 3cf2_A* 5ftk_A* 5ftl_A* 5ftm_A* 5ftn_A* 1r7r_A* 5c19_A 5c1b_A* 5c18_A* 3cf0_A*
PVClopt2 PAK_2091 PAK_01895 PAK_01895 3j9q 3j9q_A 100.0 2.5e-35 6.4e-40 248.6 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PVCcif7 PAK_1975 PAK_01793 PAK_01793 5a3a 5a3a_A 50.8 6 0.00016 31.4 >5a3a_A SIR2 family protein; transferase, P-ribosyltransferase, metalloprotein, NAD-depen lipoylation, regulatory enzyme, rossmann fold; 1.54A {Streptococcus pyogenes} PDB: 5a3b_A* 5a3c_A*
PVClopT12 PAU_02101 PAU_02063 PAU_02063 4yap 4yap_A 31.1 20 0.00052 29.1 >4yap_A Glutathione S-transferase homolog; GSH-lyase GSH-dependent; 1.11A {Sphingobium SP} PDB: 4g10_A 4yav_A*
PVCpnf20 PAK_3524 PAK_03184 PAK_03184 1xju 1xju_A 99.4 4.1e-17 1.1e-21 98.8 >1xju_A Lysozyme; secreted inactive conformation, hydrolase; 1.07A {Enterobacteria phage P1} SCOP: d.2.1.3
PVClopt3 PAK_2090 PAK_01894 PAK_01894 3j9q 3j9q_A 99.9 1.8e-29 4.6e-34 215.9 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PVClopT11 PAU_02102 PAU_02064 PAU_02064 1pqx 1pqx_A 37.4 13 0.00035 31.4 >1pqx_A Conserved hypothetical protein; ZR18,structure, autostructure,spins,autoassign, northeast structural genomics consortium; NMR {Staphylococcus aureus subsp} SCOP: d.267.1.1 PDB: 2ffm_A 2m6q_A 2m8w_A
PVCunit2_11 plu1698 PLT_01726 PLT_01726 3h2t 3h2t_A 96.8 8e-06 2.1e-10 82.6 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A
PVClumT9 afp10 PAU_02198 PAU_02198 4jiv 4jiv_D 69.6 1.6 4.2e-05 27.5 >4jiv_D VCA0105, putative uncharacterized protein; PAAR-repeat motif, membrane piercing, type VI secretion SYST vibrio cholerae VGRG2; HET: PLM STE ELA; 1.90A {Vibrio cholerae o1 biovar eltor}
PVCunit2_17 plu1692 PLT_01720 PLT_01720 3ggm 3ggm_A 54.2 4.9 0.00013 26.2 >3ggm_A Uncharacterized protein BT9727_2919; bacillus cereus group., structural genomics, PSI-2, protein structure initiative; 2.00A {Bacillus thuringiensis serovarkonkukian}
When you pass awk -v a="$field", the specification of the awk variable a is only good for that single awk command. You can't expect a to be available in a completely different invocation of awk.
Thus, you need to put it in-place directly:
$ bashvar="2"
$ echo 'foo bar baz' | awk -v awkvar="$bashvar" '{print $awkvar}'
bar
Or in your case:
field=1
awk -v a="$field" '
NR==FNR {
o[FNR]=$a;
next;
}
{ t[$1] = $0 }
END {
for(x=1; x<=FNR; x++) {
y=o[x]
printf("%s\t%s\n", y, t[y])
}
}' "$keyfile" "$filetosort"
Points of note:
Our printf here is emitting both the key and the value, so there's no need to use paste to put the keyfile values back in.
$a is used to treat the awk variable a (assigned from shell variable field) as a variable name itself, and to perform an indirect reference -- thus, looking up the relevant column number.
Always, always quote your shell variables on expansion. Otherwise, you have no way of knowing how many argument to awk will be generated by the expansion of $keyfile -- it could be 0 (if there are no characters in the string not found in IFS); it could be 1, but it could also be a completely unbounded number (input file.txt would become two arguments, input and file.txt; * input * .txt would have each * replaced with a list of files).
I'm going to add this as an answer because it does resolve the question as I posed it, notwithstanding Charles' excellent advice about the (myriad) areas I was going wrong.
Altering the above code with Charles' point about the separate awk commands, I can now invoke the following (sorry yes, it's still using paste)...
#!/bin/bash
keyfile="$1"
filetosort="$2"
indexfield="$3"
paste "$keyfile" <(awk -v field="$indexfield" 'NR==FNR{o[FNR]=$field; next} {t[$1]=$0} END{for(x=1; x<=FNR; x++){y=o[x]; print t[y]}}' "$keyfile" "$filetosort")
I was missing the $ before the variable I was calling in the awk command, which is (part) of why my original code wasn't working, as well as not including the awk variable declaration in a single awk call.
Thus, bash sortandmatch.sh keyfile filetosort 3 produces the output I want:
PVCunit2_5 plu1704 PLT_01732 PLT_01732 4etv 4etv_A 39.0 12 0.00032 27.6 >4etv_A Ryanodine receptor 2; phosphorylation, cardiac, metal transport; 1.65A {Mus musculus}
PVCunit2_4 plu1705 PLT_01733 PLT_01733 3j9q 3j9q_A 99.9 7.2e-30 1.9e-34 219.0 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
XVC_pnf15 XBW1_RS06910 XBW1_RS06910 XBW1_RS06910 1fi0 1fi0_A 69.2 1.7 4.4e-05 22.8 >1fi0_A VPR protein, R ORF protein; helix, viral protein; NMR {Synthetic} SCOP: j.11.1.1
PVCcif7 PAU_01999 PAU_01967 PAU_01967 5a3a 5a3a_A 47.5 7.3 0.00019 30.9 >5a3a_A SIR2 family protein; transferase, P-ribosyltransferase, metalloprotein, NAD-depen lipoylation, regulatory enzyme, rossmann fold; 1.54A {Streptococcus pyogenes} PDB: 5a3b_A* 5a3c_A*
PVClumT15 PAU_02233 PAU_02192 PAU_02192 1tdp 1tdp_A 22.1 37.0 0.00096 27.2 >1tdp_A Carnobacteriocin B2 immunity protein; four-helix bundle, antimicrobial protein; NMR {Carnobacterium maltaromaticum} SCOP: a.29.8.1
XVC_pnf3 XBW1_RS06850 XBW1_RS06850 XBW1_RS06850 3eaa 3eaa_A 87.7 0.13 3.4e-06 35.7 >3eaa_A EVPC; T6SS, unknown function; 2.79A {Edwardsiella tarda}
PVCunit1_4 afp4 PAU_02778 PAU_02778 3j9q 3j9q_A 99.9 3.6e-29 9.5e-34 214.6 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PVCunit2_3 plu1706 PLT_01734 PLT_01734 3j9q 3j9q_A 100.0 1.6e-34 4.3e-39 253.7 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PVClumt17 PAK_2200 PAK_01998 PAK_01998 3k8p 3k8p_C 34.7 16.0 0.00041 34.1 >3k8p_C DSL1, KLLA0C02695P; intracellular trafficking, DSL1 complex, multisubunit tethering complex, snare proteins; 2.60A {Kluyveromyces lactis}
PVClopT12 PAU_02101 PAU_02063 PAU_02063 4yap 4yap_A 31.1 20 0.00052 29.1 >4yap_A Glutathione S-transferase homolog; GSH-lyase GSH-dependent; 1.11A {Sphingobium SP} PDB: 4g10_A 4yav_A*

SSRS Combining values within columns

I have a fetchxml report setup to pull data from our CRM instance. Inside Visual Studio 2010 it is laid out as such when it pulls the information
job number new lot rough start date city builder
30774-00c custom 8/4/2014 city1 builder1
30876-19 465 7/11/2014 city5 builder2
30876-19 466 7/11/2014 city5 builder2
30876-19 467 7/11/2014 city5 builder2
30876-19 489 7/12/2014 city5 builder2
30856-01 2 8/26/201 city3 builder5
I want to be able to combine the job number and "new lot" where "new roughstartdate" are the same so it would look like
job number new lot rough start date city builder
30774-00c custom 8/4/2014 city1 builder1
30876-19 465,466,467 7/11/2014 city5 builder2
30876-19 489 7/14/2014 city5 builder2
But I just cant seem to figure out the grouping correctly any guidance would be great.
I thought I could do =Join(LookupSet(Fields!jobnumber.Value,Fields!jobnumber.Value,Fields!roughstartdate.Value,"DataSet1"),",")
But that seems to just only show one item when they match and not combine the lots onto a single line.
First group by "rough start date" and then by "Job number" then use below expression in "new lot":
=Join(LookupSet(Fields!roughstartdate.Value,Fields!roughstartdate.Value,Fields!newlot.Value,"DataSet2"),",")
DataSet2 should be same as DataSet1.
I was just going to comment above but I can't.. So - I think the issue where you have all lots coming back is that the group is just on the Date.
You need to group on Job Number AND Date and then use the Join(LookupSet...
That way you will have groups job number 30876-19 for 7/11/2014 and 30876-19 for 7/12/2014.

How do I conditionally merge hashes and add values?

I am stuck merging some hashes to get the results that I need.
The hashes contain a breakdown of the total price of an order, e.g. item price, taxes, & shipping, for all orders in a subscription. I'm trying to do this dynamically as not all orders charge tax, or even the same tax or shipping.
Here's what I would call the "worst case scenario" that I'm dealing with:
#First order - has charges for Canadian GST and HST
{:orderhdr_id=>17654122, :order_item_seq=>1, :order_item_amt_break_seq=>1, :order_item_break_type=>0, :local_amount=>8.16, :base_amount=>8.16, :orig_base_amount=>149, :tax_delivery=>0, :tax_active=>0}
{:orderhdr_id=>17654122, :order_item_seq=>1, :order_item_amt_break_seq=>2, :order_item_break_type=>1, :local_amount=>0.41, :base_amount=>0.41, :state=>"ON", :tax_type=>"GST", :tax_rate_category=>"STD", :orig_base_amount=>7.45, :tax_rate=>5, :tax_delivery=>0, :tax_active=>1, :tx_incl=>1}
{:orderhdr_id=>17654122, :order_item_seq=>1, :order_item_amt_break_seq=>3, :order_item_break_type=>1, :local_amount=>0.65, :base_amount=>0.65, :state=>"ON", :tax_type=>"HST", :tax_rate_category=>"STD", :orig_base_amount=>11.92, :tax_rate=>8, :tax_delivery=>0, :tax_active=>1, :tx_incl=>1}
#Second order - has only one charge for tax
{:orderhdr_id=>1815296, :order_item_seq=>1, :order_item_amt_break_seq=>1, :order_item_break_type=>0, :local_amount=>76.52, :base_amount=>76.52, :orig_base_amount=>99.95, :tax_delivery=>0, :tax_active=>0}
{:orderhdr_id=>1815296, :order_item_seq=>1, :order_item_amt_break_seq=>2, :order_item_break_type=>1, :local_amount=>4.59, :base_amount=>4.59, :orig_base_amount=>6, :tax_delivery=>0, :tax_active=>1}
#Third order - has charge for shipping
{:orderhdr_id=>6112412, :order_item_seq=>1, :order_item_amt_break_seq=>1, :order_item_break_type=>0, :local_amount=>21.34, :base_amount=>21.34, :orig_base_amount=>99.95, :tax_delivery=>0, :tax_active=>0}
{:orderhdr_id=>6112412, :order_item_seq=>1, :order_item_amt_break_seq=>2, :order_item_break_type=>2, :local_amount=>4.7, :base_amount=>4.7, :orig_base_amount=>22, :tax_delivery=>0, :tax_active=>0}
The :order_item_break_type determines what type of charge it is, item, tax, shipping.
If :order_item_break_type is 0 or 2, then subtract :base_amount from :orig_base_amount and add it to the overall total for that break type. For :order_item_break_type equals 1, I need to sum up the difference of :orig_base_amount and :base_amount, but I need to keep the different taxes separate.
So here's what I should end up with for the above orders:
:break_type = 0 | total = 242.88
:break_type = 1, :state = ON, :tax_type = GST, :tax_rate_category = STD | total = 7.04
:break_type = 1, :state = ON, :tax_type = HST, :tax_rate_category = STD | total = 11.27
:break_type = 1 | total = 1.41
:break_type = 2 | total = 17.30
I have these hashes in an array called #amounts.
I have methods like merge!, inject, shift and more going through my head, but can't put it together.
How about this? I really don't know what you are doing there, though.
#amounts = [
#First order - has charges for Canadian GST and HST
{:orderhdr_id=>17654122, :order_item_seq=>1, :order_item_amt_break_seq=>1, :order_item_break_type=>0, :local_amount=>8.16, :base_amount=>8.16, :orig_base_amount=>149, :tax_delivery=>0, :tax_active=>0},
{:orderhdr_id=>17654122, :order_item_seq=>1, :order_item_amt_break_seq=>2, :order_item_break_type=>1, :local_amount=>0.41, :base_amount=>0.41, :state=>"ON", :tax_type=>"GST", :tax_rate_category=>"STD", :orig_base_amount=>7.45, :tax_rate=>5, :tax_delivery=>0, :tax_active=>1, :tx_incl=>1},
{:orderhdr_id=>17654122, :order_item_seq=>1, :order_item_amt_break_seq=>3, :order_item_break_type=>1, :local_amount=>0.65, :base_amount=>0.65, :state=>"ON", :tax_type=>"HST", :tax_rate_category=>"STD", :orig_base_amount=>11.92, :tax_rate=>8, :tax_delivery=>0, :tax_active=>1, :tx_incl=>1},
#Second order - has only one charge for tax
{:orderhdr_id=>1815296, :order_item_seq=>1, :order_item_amt_break_seq=>1, :order_item_break_type=>0, :local_amount=>76.52, :base_amount=>76.52, :orig_base_amount=>99.95, :tax_delivery=>0, :tax_active=>0},
{:orderhdr_id=>1815296, :order_item_seq=>1, :order_item_amt_break_seq=>2, :order_item_break_type=>1, :local_amount=>4.59, :base_amount=>4.59, :orig_base_amount=>6, :tax_delivery=>0, :tax_active=>1},
#Third order - has charge for shipping
{:orderhdr_id=>6112412, :order_item_seq=>1, :order_item_amt_break_seq=>1, :order_item_break_type=>0, :local_amount=>21.34, :base_amount=>21.34, :orig_base_amount=>99.95, :tax_delivery=>0, :tax_active=>0},
{:orderhdr_id=>6112412, :order_item_seq=>1, :order_item_amt_break_seq=>2, :order_item_break_type=>2, :local_amount=>4.7, :base_amount=>4.7, :orig_base_amount=>22, :tax_delivery=>0, :tax_active=>0},
]
#totals = Hash.new(0)
#amounts.group_by{|row| row[:order_item_break_type]}.each do |break_type, rows|
rows.each do |row|
key = [break_type, row[:tax_type]]
#totals[key] += row[:orig_base_amount] - row[:base_amount]
end
end
#totals
# => {[0, nil]=>242.88,
# [1, "GST"]=>7.04,
# [1, "HST"]=>11.27,
# [1, nil]=>1.4100000000000001,
# [2, nil]=>17.3}

Resources