passing bash variable for awk column specifier - bash

There are loads of threads about passing a shell variable to awk, and I've figured that out easily enough, but the variable I want to pass is the column specifier variable ($1,$2 etc)
Given that the shell uses these variables as default command line argument variables as well, this is getting confusing.
In this script I'm just sorting and joining 2 files together, but in order to begin generalising the script a little, I want to be able to specify on the command line, the field in the key file that awk should be taking as its sort-specifier.
What am I doing wrong here? (I'm only just getting to grips with awk and the oneliner was adapted slightly from here.
keyfile="$1"
filetosort="$2"
field="$3"
awk -v a="$field"
paste "$keyfile" <(awk 'NR==FNR{o[FNR]=a; next} {t[$1]=$0} END{for(x=1; x<=FNR; x++){y=o[x]; print t[y]}}' $keyfile $filetosort)
EDIT Added example in/output
Keyfile: (10 random lines from file)
PVClumt18 PAK_2199 PAK_01997
PVClopt2 PAK_2091 PAK_01895
PVCcif7 PAK_1975 PAK_01793
PVClopT12 PAU_02101 PAU_02063
PVCpnf20 PAK_3524 PAK_03184
PVClopt3 PAK_2090 PAK_01894
PVClopT11 PAU_02102 PAU_02064
PVCunit2_11 plu1698 PLT_01726
PVClumT9 afp10 PAU_02198
PVCunit2_17 plu1692 PLT_01720
File to sort:
PAU_02064 1pqx 1pqx_A 37.4 13 0.00035 31.4 >1pqx_A Conserved hypothetical protein; ZR18,structure, autostructure,spins,autoassign, northeast structural genomics consortium; NMR {Staphylococcus aureus subsp} SCOP: d.267.1.1 PDB: 2ffm_A 2m6q_A 2m8w_A
PAK_01997 5ftj 5ftj_A 99.9 1.6e-26 4.2e-31 229.2 >5ftj_A Transitional endoplasmic reticulum ATPase; hydrolase, single-particle, AAA ATPase; HET: ADP OJA; 2.30A {Homo sapiens} PDB: 3cf1_A* 3cf3_A* 3cf2_A* 5ftk_A* 5ftl_A* 5ftm_A* 5ftn_A* 1r7r_A* 5c19_A 5c1b_A* 5c18_A* 3cf0_A*
PAK_01894 3j9q 3j9q_A 99.9 1.8e-29 4.6e-34 215.9 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PAK_03184 1xju 1xju_A 99.4 4.1e-17 1.1e-21 98.8 >1xju_A Lysozyme; secreted inactive conformation, hydrolase; 1.07A {Enterobacteria phage P1} SCOP: d.2.1.3
PAK_01793 5a3a 5a3a_A 50.8 6 0.00016 31.4 >5a3a_A SIR2 family protein; transferase, P-ribosyltransferase, metalloprotein, NAD-depen lipoylation, regulatory enzyme, rossmann fold; 1.54A {Streptococcus pyogenes} PDB: 5a3b_A* 5a3c_A*
PLT_01720 3ggm 3ggm_A 54.2 4.9 0.00013 26.2 >3ggm_A Uncharacterized protein BT9727_2919; bacillus cereus group., structural genomics, PSI-2, protein structure initiative; 2.00A {Bacillus thuringiensis serovarkonkukian}
PLT_01726 3h2t 3h2t_A 96.8 8e-06 2.1e-10 82.6 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A
PAK_01895 3j9q 3j9q_A 100.0 2.5e-35 6.4e-40 248.6 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PAU_02198 4jiv 4jiv_D 69.6 1.6 4.2e-05 27.5 >4jiv_D VCA0105, putative uncharacterized protein; PAAR-repeat motif, membrane piercing, type VI secretion SYST vibrio cholerae VGRG2; HET: PLM STE ELA; 1.90A {Vibrio cholerae o1 biovar eltor}
PAU_02063 4yap 4yap_A 31.1 20 0.00052 29.1 >4yap_A Glutathione S-transferase homolog; GSH-lyase GSH-dependent; 1.11A {Sphingobium SP} PDB: 4g10_A 4yav_A*
Thus I need to sort and match the rows based on column 3 in the keyfile, and column 1 in the file to sort.
And the resulting file: (The duplication of columns 3 & 4 was something I was planning to sort out after)
PVClumt18 PAK_2199 PAK_01997 PAK_01997 5ftj 5ftj_A 99.9 1.6e-26 4.2e-31 229.2 >5ftj_A Transitional endoplasmic reticulum ATPase; hydrolase, single-particle, AAA ATPase; HET: ADP OJA; 2.30A {Homo sapiens} PDB: 3cf1_A* 3cf3_A* 3cf2_A* 5ftk_A* 5ftl_A* 5ftm_A* 5ftn_A* 1r7r_A* 5c19_A 5c1b_A* 5c18_A* 3cf0_A*
PVClopt2 PAK_2091 PAK_01895 PAK_01895 3j9q 3j9q_A 100.0 2.5e-35 6.4e-40 248.6 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PVCcif7 PAK_1975 PAK_01793 PAK_01793 5a3a 5a3a_A 50.8 6 0.00016 31.4 >5a3a_A SIR2 family protein; transferase, P-ribosyltransferase, metalloprotein, NAD-depen lipoylation, regulatory enzyme, rossmann fold; 1.54A {Streptococcus pyogenes} PDB: 5a3b_A* 5a3c_A*
PVClopT12 PAU_02101 PAU_02063 PAU_02063 4yap 4yap_A 31.1 20 0.00052 29.1 >4yap_A Glutathione S-transferase homolog; GSH-lyase GSH-dependent; 1.11A {Sphingobium SP} PDB: 4g10_A 4yav_A*
PVCpnf20 PAK_3524 PAK_03184 PAK_03184 1xju 1xju_A 99.4 4.1e-17 1.1e-21 98.8 >1xju_A Lysozyme; secreted inactive conformation, hydrolase; 1.07A {Enterobacteria phage P1} SCOP: d.2.1.3
PVClopt3 PAK_2090 PAK_01894 PAK_01894 3j9q 3j9q_A 99.9 1.8e-29 4.6e-34 215.9 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PVClopT11 PAU_02102 PAU_02064 PAU_02064 1pqx 1pqx_A 37.4 13 0.00035 31.4 >1pqx_A Conserved hypothetical protein; ZR18,structure, autostructure,spins,autoassign, northeast structural genomics consortium; NMR {Staphylococcus aureus subsp} SCOP: d.267.1.1 PDB: 2ffm_A 2m6q_A 2m8w_A
PVCunit2_11 plu1698 PLT_01726 PLT_01726 3h2t 3h2t_A 96.8 8e-06 2.1e-10 82.6 >3h2t_A Baseplate structural protein GP6; viral protein, virion; 3.20A {Enterobacteria phage T4} PDB: 3h3w_A 3h3y_A
PVClumT9 afp10 PAU_02198 PAU_02198 4jiv 4jiv_D 69.6 1.6 4.2e-05 27.5 >4jiv_D VCA0105, putative uncharacterized protein; PAAR-repeat motif, membrane piercing, type VI secretion SYST vibrio cholerae VGRG2; HET: PLM STE ELA; 1.90A {Vibrio cholerae o1 biovar eltor}
PVCunit2_17 plu1692 PLT_01720 PLT_01720 3ggm 3ggm_A 54.2 4.9 0.00013 26.2 >3ggm_A Uncharacterized protein BT9727_2919; bacillus cereus group., structural genomics, PSI-2, protein structure initiative; 2.00A {Bacillus thuringiensis serovarkonkukian}

When you pass awk -v a="$field", the specification of the awk variable a is only good for that single awk command. You can't expect a to be available in a completely different invocation of awk.
Thus, you need to put it in-place directly:
$ bashvar="2"
$ echo 'foo bar baz' | awk -v awkvar="$bashvar" '{print $awkvar}'
bar
Or in your case:
field=1
awk -v a="$field" '
NR==FNR {
o[FNR]=$a;
next;
}
{ t[$1] = $0 }
END {
for(x=1; x<=FNR; x++) {
y=o[x]
printf("%s\t%s\n", y, t[y])
}
}' "$keyfile" "$filetosort"
Points of note:
Our printf here is emitting both the key and the value, so there's no need to use paste to put the keyfile values back in.
$a is used to treat the awk variable a (assigned from shell variable field) as a variable name itself, and to perform an indirect reference -- thus, looking up the relevant column number.
Always, always quote your shell variables on expansion. Otherwise, you have no way of knowing how many argument to awk will be generated by the expansion of $keyfile -- it could be 0 (if there are no characters in the string not found in IFS); it could be 1, but it could also be a completely unbounded number (input file.txt would become two arguments, input and file.txt; * input * .txt would have each * replaced with a list of files).

I'm going to add this as an answer because it does resolve the question as I posed it, notwithstanding Charles' excellent advice about the (myriad) areas I was going wrong.
Altering the above code with Charles' point about the separate awk commands, I can now invoke the following (sorry yes, it's still using paste)...
#!/bin/bash
keyfile="$1"
filetosort="$2"
indexfield="$3"
paste "$keyfile" <(awk -v field="$indexfield" 'NR==FNR{o[FNR]=$field; next} {t[$1]=$0} END{for(x=1; x<=FNR; x++){y=o[x]; print t[y]}}' "$keyfile" "$filetosort")
I was missing the $ before the variable I was calling in the awk command, which is (part) of why my original code wasn't working, as well as not including the awk variable declaration in a single awk call.
Thus, bash sortandmatch.sh keyfile filetosort 3 produces the output I want:
PVCunit2_5 plu1704 PLT_01732 PLT_01732 4etv 4etv_A 39.0 12 0.00032 27.6 >4etv_A Ryanodine receptor 2; phosphorylation, cardiac, metal transport; 1.65A {Mus musculus}
PVCunit2_4 plu1705 PLT_01733 PLT_01733 3j9q 3j9q_A 99.9 7.2e-30 1.9e-34 219.0 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
XVC_pnf15 XBW1_RS06910 XBW1_RS06910 XBW1_RS06910 1fi0 1fi0_A 69.2 1.7 4.4e-05 22.8 >1fi0_A VPR protein, R ORF protein; helix, viral protein; NMR {Synthetic} SCOP: j.11.1.1
PVCcif7 PAU_01999 PAU_01967 PAU_01967 5a3a 5a3a_A 47.5 7.3 0.00019 30.9 >5a3a_A SIR2 family protein; transferase, P-ribosyltransferase, metalloprotein, NAD-depen lipoylation, regulatory enzyme, rossmann fold; 1.54A {Streptococcus pyogenes} PDB: 5a3b_A* 5a3c_A*
PVClumT15 PAU_02233 PAU_02192 PAU_02192 1tdp 1tdp_A 22.1 37.0 0.00096 27.2 >1tdp_A Carnobacteriocin B2 immunity protein; four-helix bundle, antimicrobial protein; NMR {Carnobacterium maltaromaticum} SCOP: a.29.8.1
XVC_pnf3 XBW1_RS06850 XBW1_RS06850 XBW1_RS06850 3eaa 3eaa_A 87.7 0.13 3.4e-06 35.7 >3eaa_A EVPC; T6SS, unknown function; 2.79A {Edwardsiella tarda}
PVCunit1_4 afp4 PAU_02778 PAU_02778 3j9q 3j9q_A 99.9 3.6e-29 9.5e-34 214.6 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PVCunit2_3 plu1706 PLT_01734 PLT_01734 3j9q 3j9q_A 100.0 1.6e-34 4.3e-39 253.7 >3j9q_A Sheath; pyocin, bacteriocin, sheath, structural protein; 3.50A {Pseudomonas aeruginosa}
PVClumt17 PAK_2200 PAK_01998 PAK_01998 3k8p 3k8p_C 34.7 16.0 0.00041 34.1 >3k8p_C DSL1, KLLA0C02695P; intracellular trafficking, DSL1 complex, multisubunit tethering complex, snare proteins; 2.60A {Kluyveromyces lactis}
PVClopT12 PAU_02101 PAU_02063 PAU_02063 4yap 4yap_A 31.1 20 0.00052 29.1 >4yap_A Glutathione S-transferase homolog; GSH-lyase GSH-dependent; 1.11A {Sphingobium SP} PDB: 4g10_A 4yav_A*

Related

Labeling points in gnuplot data series

I have been reading a post already written on this subject, but I can not represent my data since an error appears.
%data.txt:
"Hf" 2233 13.31
"Ir" 2466 22.56
"B_4C" 2763 2.52
"Y_2O_3" 2425 5.03
"Nb" 2477 8.57
"NbN" 2573 8.47
"SrZrO_3" 2700 5.1
"SiC" 2830 3.16
"ZrO_2" 2715 5.68
"Mo" 2623 10.28
"VC" 2810 5.77
"TiB_2" 3230 4.52
"HfO_2" 2758 9.68
"UO_2" 2867 10.97
"TiN" 2930 5.22
"TiC" 3160 4.93
"ZrB_2" 3246 6.085
"ZrN" 2952 7.09
"TaB_2" 3140 11.15
"C" 3549 2.27
"ZrC" 3540 6.73
"ThO_2" 3390 10
"HfB_2" 3250 10.5
"HfN" 3305 13.8
"NbC" 3608 7.82
"Re" 3186 21.02
"W" 3422 19.25
"Ta" 3017 16.65
"WC" 2830 15.63
"TaC" 3880 14.6
"HfC" 3890 12.2
%code:
set terminal postscript enhanced color"Times-Roman" 20
set output "TemperatureVsDensity.eps"
set xlabel "Temperature [degrees]}"
set ylabel " Density [g/cc]"
plot "data.txt" using 2:3 , "" u 2:3:1 w labels rotate offset 1
Can someone help me with this?
Thanks in advance :D
It would be helpful if you can post the error message you got when trying to run your code. I copied your data, code, and ran gnuplot code on the terminal. gnuplot gives the following warning (not error):
"code", line 6: warning: enhanced text mode parser - ignoring spurious }
which tells you that you should remove the spurious } on your set ylabel statement (as pointed out in the comments already). However, that does not prevent the figure to be generated. This is what I got:

Using grep command to transfer data into a new file

This might be super simple, but I have a .txt file with seismic data in which I'm trying to use the grep command to print out specific data only from Nevada (data in the file is marked either CA or NV) and to put it into its own .txt file.
Sample data:
map 0.2 2016/09/26 18:36:51 39.330N 119.991W 4.7 9 km ( 6 mi) N of Incline Village, NV
map 1.5 2016/09/26 18:26:27 39.362N 122.781W 19.5 25 km (15 mi) NNE of Upper Lake, CA
map 1.5 2016/09/26 18:18:16 36.055N 117.857W 2.2 8 km ( 5 mi) E of Coso Junction, CA
map 0.2 2016/09/26 18:10:46 38.363N 118.324W 4.6 32 km (20 mi) SE of Hawthorne, NV
I'm typing: grep NV filename > newfilename
But nothing is showing up. What's wrong? (My homework is to specifically use the grep command.)
You want this:
cat *filename* | grep something > result.txt
Your command appears as though it should have worked, but for safety you would want to be sure you're getting exactly what you want. Below grep would only get lines that end in NV
grep " NV$" filename > newfilename
When you say you can't see anything are you viewing the file contents afterward?
I copy/pasted your sample data into a file called sample-data and tried the grep pattern I would have used (' NV$'), but then I found only one line came through, the last one, because there was [invisible] whitespace after the NV in the first line. So to guard against that, I put [ \t]* between the NV and the $ (end of line symbol) in the grep pattern, and I got the result I expected. See below:
$ grep ' NV$' sample-data > result.txt
$ cat result.txt
map 0.2 2016/09/26 18:10:46 38.363N 118.324W 4.6 32 km (20 mi) SE of Hawthorne, NV
$ cat sample-data
map 0.2 2016/09/26 18:36:51 39.330N 119.991W 4.7 9 km ( 6 mi) N of Incline Village, NV
map 1.5 2016/09/26 18:26:27 39.362N 122.781W 19.5 25 km (15 mi) NNE of Upper Lake, CA
map 1.5 2016/09/26 18:18:16 36.055N 117.857W 2.2 8 km ( 5 mi) E of Coso Junction, CA
map 0.2 2016/09/26 18:10:46 38.363N 118.324W 4.6 32 km (20 mi) SE of Hawthorne, NV
$ grep ' NV[ \t]*$' sample-data > result.txt
$ cat result.txt
map 0.2 2016/09/26 18:36:51 39.330N 119.991W 4.7 9 km ( 6 mi) N of Incline Village, NV
map 0.2 2016/09/26 18:10:46 38.363N 118.324W 4.6 32 km (20 mi) SE of Hawthorne, NV
$
In short, I think what you want, to be safe, is:
grep ' NV[ \t]*$' sample-data > result.txt
Or, even safer, if you don't trust there always to be a space between the comma and NV:
grep ',[ \t]*NV[ \t]*$' sample-data > result.txt
which, translated, means, "match lines that have a comma, zero or more spaces or tabs, NV, zero or more spaces or tabs, and nothing more before the end of the line."
By the way, if this is homework, and not for your job or home project, technically you should probably admit to your teacher that you asked for help on StackOverflow. Your teacher will be more impressed with your honesty, and probably won't ding you if you can say, "I get it, I get it, look at these other examples I tested and they worked too!" A teacher's main goal is that you learn and understand, not that you get some score. My purpose in providing this answer is to help you understand a tiny bit more about grep, which millions of us use every single day in our lives to get our work done, so it really is worth learning. Probably I should have provided a "teaching" example that was not the exact answer, but this was such a small, trivial problem I just answered it during my coffee break. Just be honest with your teacher is all I ask.

Pig Join is returning no results

I have been stuck on this problem for over twelve hours now. I have a Pig script that is running on Amazon Web Services. Currently, I am just running my script in interactive mode. I am trying to get averages on a large data set of climate readings from weather stations; however, this data doesn't have country or state information so it has to be joined with another table that does.
State Table:
719990 99999 LILLOOET CN CA BC WKF +50683 -121933 +02780
719994 99999 SEDCO 710 CN CA CWQJ +46500 -048500 +00000
720000 99999 BOGUS AMERICAN US US -99999 -999999 -99999
720001 99999 PEASON RIDGE/RANGE US US LA K02R +31400 -093283 +01410
720002 99999 HALLOCK(AWS) US US MN K03Y +48783 -096950 +02500
720003 99999 DEER PARK(AWS) US US WA K07S +47967 -117433 +06720
720004 99999 MASON US US MI K09G +42567 -084417 +02800
720005 99999 GASTONIA US US NC K0A6 +35200 -081150 +02440
Climate Table: (I realize this doesn't contain anything to satisfy the join condition, but the full data set does.)
STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT
010010 99999 20090101 23.3 24 15.6 24 1033.2 24 1032.0 24 13.5 6 9.6 24 17.5 999.9 27.9* 16.7 0.00G 999.9 001000
010010 99999 20090102 27.3 24 20.5 24 1026.1 24 1024.9 24 13.7 5 14.6 24 23.3 999.9 28.9 25.3* 0.00G 999.9 001000
010010 99999 20090103 25.2 24 18.4 24 1028.3 24 1027.1 24 15.5 6 4.2 24 9.7 999.9 26.2* 23.9* 0.00G 999.9 001000
010010 99999 20090104 27.7 24 23.2 24 1019.3 24 1018.1 24 6.7 6 8.6 24 13.6 999.9 29.8 24.8 0.00G 999.9 011000
010010 99999 20090105 19.3 24 13.0 24 1015.5 24 1014.3 24 5.6 6 17.5 24 25.3 999.9 26.2* 10.2* 0.05G 999.9 001000
010010 99999 20090106 12.9 24 2.9 24 1019.6 24 1018.3 24 8.2 6 15.5 24 25.3 999.9 19.0* 8.8 0.02G 999.9 001000
010010 99999 20090107 26.2 23 20.7 23 998.6 23 997.4 23 6.6 6 12.1 22 21.4 999.9 31.5 19.2* 0.00G 999.9 011000
010010 99999 20090108 21.5 24 15.2 24 995.3 24 994.1 24 12.4 5 12.8 24 25.3 999.9 24.6* 19.2* 0.05G 999.9 011000
010010 99999 20090109 27.5 23 24.5 23 982.5 23 981.3 23 7.9 5 20.2 22 33.0 999.9 34.2 20.1* 0.00G 999.9 011000
010010 99999 20090110 22.5 23 16.7 23 977.2 23 976.1 23 11.9 6 15.5 23 35.0 999.9 28.9* 17.2 0.09G 999.9 000000
I load in the climate data using TextLoader, apply a regular expression to obtain the fields, and filter out the nulls from the result set. I then do the same with the state data, but I filter it for the country being the US.
The bags have the following schema:
CLIMATE_REMOVE_EMPTY: {station: int,wban: int,year: int,month: int,day: int,temp: double}
STATES_FILTER_US: {station: int,wban: int,name: chararray,wmo: chararray,fips: chararray,state: chararray}
I need to perform a join operation on (station,wban) so I can get a resulting bag with the station, wban, year, month, and temps. When I perform a dump on the resulting bag, it says that it was successful; however, the dump returns 0 results. This is the output.
HadoopVersion PigVersion UserId StartedAt FinishedAt Features
1.0.3 0.9.2-amzn hadoop 2013-05-03 00:10:51 2013-05-03 00:12:42 HASH_JOIN,FILTER
Success!
Job Stats (time in seconds):
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs
job_201305030005_0001 2 1 36 15 25 33 33 33 CLIMATE,CLIMATE_REMOVE_NULL,RAW_CLIMATE,RAW_STATES,STATES,STATES_FILTER_US,STATE_CLIMATE_JO IN HASH_JOIN hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203,
Input(s):
Successfully read 30587 records from: "hiddenbucket"
Successfully read 21027 records from: "hiddenbucket"
Output(s):
Successfully stored 0 records in: "hdfs://10.204.30.125:9000/tmp/temp-204730737/tmp1776606203"
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
I have no idea why my this contains 0 results. My data extraction seems correct. and the job is successful. It leads me to believe that the join condition is never satisfied. I know the input files have some data that should satisfy the join condition, but it returns absolutely nothing.
The only thing that looks suspicious is a warning that states:
Encountered Warning ACCESSING_NON_EXISTENT_FIELD 26001 time(s).
I'm not exactly sure where to go from here. Since the job isn't failing, I can't see any errors or anything in debug.
I'm not sure if these mean anything, but here are other things that stand out:
When I try to illustrate STATE_CLIMATE_JOIN, I get a nullPointerException - ERROR 2997: Encountered IOException. Exception : null
When I try to illustrate STATES, I get java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
Here is my full code:
--Piggy Bank Functions
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
--Load Climate Data
RAW_CLIMATE = LOAD 'hiddenbucket' USING TextLoader as (line:chararray);
RAW_STATES= LOAD 'hiddenbucket' USING TextLoader as (line:chararray);
CLIMATE=
FOREACH
RAW_CLIMATE
GENERATE
FLATTEN ((tuple(int,int,int,int,int,double))
EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\d{4})(\\d{2})(\\d{2})\\s+(\\d{1,3}\\.\\d{1})')
)
AS (
station: int,
wban: int,
year: int,
month: int,
day: int,
temp: double
)
;
STATES=
FOREACH
RAW_STATES
GENERATE
FLATTEN ((tuple(int,int,chararray,chararray,chararray,chararray))
EXTRACT(line,'^(\\d{6})\\s+(\\d{5})\\s+(\\S+)\\s+(\\w{2})\\s+(\\w{2})\\s+(\\w{2})')
)
AS (
station: int,
wban: int,
name: chararray,
wmo: chararray,
fips: chararray,
state: chararray
)
;
CLIMATE_REMOVE_NULL = FILTER CLIMATE BY station IS NOT NULL;
STATES_FILTER_US = FILTER STATES BY (fips == 'US');
STATE_CLIMATE_JOIN = JOIN CLIMATE_REMOVE_NULL BY (station), STATES_FILTER_US BY (station);
Thanks in advance. I am at a loss here.
--EDIT--
I finally got it to work! My regular expression for parsing the STATE_DATA was invalid.

the output of stanford nlp classifier

We are learning the usage of stanford-nlp classifier. As its wiki page said, it can be used to build model for classifying numerical data like Iris:
http://www-nlp.stanford.edu/wiki/Software/Classifier#Iris_data_set
But on interpreting the output we have difficult on some of them: there are 4 columns for input attributes(1-Value, 2-Value, 3-Value, 4-Value) and one column for output label (Iris-setosa, Iris-versicolor, Iris-virginica). But what is CLASS here? Is it the output column overall?
Built this classifier: Linear classifier with the following weights
Iris-setosa Iris-versicolor Iris-virginica
3-Value -2.27 0.03 2.26
CLASS 0.34 0.65 -1.01
4-Value -1.07 -0.91 1.99
2-Value 1.60 -0.13 -1.43
1-Value 0.69 0.42 -1.23
Total: -0.72 0.05 0.57
Prob: 0.15 0.32 0.54
CLASS is like the intercept term in a simple linear regression - it represents the relative frequency of different classes. It is a feature of every instance.

Merge and matching tables in Oracle

Does anyone know how to merge two tables with a common column name and data into a single table? The shared column is a date column. This is part of a project at work, no one here quite knows how it works. Any help would be appreciated.
table A
Sub Temp Weight Silicon Cast_Date
108 2675 2731 0.7002 18-jun-11 18:45
101 2691 3268 0.6194 18-jun-11 20:30
107 2701 6749 0.6976 18-jun-11 20:30
113 2713 2112 0.6616 18-jun-11 20:30
116 2733 3142 0.7382 19-jun-11 05:46
121 2745 2611 0.6949 19-jun-11 00:19
125 2726 1995 0.644 19-jun-11 00:19
table B
Si Temperature Sched_Cast_Date Treadwell
0.6622 2542 01-APR-11 02:57 114
0.6622 2542 01-APR-11 03:07 116
0.7516 2526 19-jun-11 05:46 116
0.7516 2526 01-APR-11 03:40 107
0.6741 2372 01-APR-11 04:03 107
0.6206 2369 01-APR-11 09:43 114
0.6741 2372 19-jun-11 00:19 125
the results would look like:
Subcar Temp Weight Silicon Cast_Date SI Temperature Sched_Cast_Date Treadwell
116 2733 3142 0.7382 19-jun-11 05:46 0.7516 2526 19-jun-11 05:46 116
125 2726 1995 0.644 19-jun-11 00:19 0.6741 2372 19-jun-11 00:19 125
I would like to run a query that returns a results data only where Sched_Cast_Date and Cast_Date are the same. A table with the same qualities would work just as well.
I hope that this makes more sense.
Are you asking how to join two tables on a common column? i.e.
select a.Sub, a.Temp, a.Weight a.Silicon a.Cast_Date, b.SI,
b.Temperature, b.Sched_Cast_Date, b.Treadwell
from a
join b on b.sched_cast_date = a.cast_date

Resources