I'm trying to use the map clause with Hive but I'm tripping over syntax and not finding many examples of my use case around. I used the map clause before when I had to process one of the columns of a table using an external script.
I had a python script called, say, run, that took one command line parameter and spit out three space separated values. So I just did:
FROM(MAP
tablename.columnName
USING
'run' AS
result1, result2, result3
FROM
tablename
) map_output
INSERT OVERWRITE TABLE results SELECT *;
Now I have a python script that receives a lot more parameters and tried a few things that didn't worked and couldn't find examples on this. I did the obvious thing:
FROM
(MAP
numAgents, alpha, beta, burnin, nsteps, thin
USING
'runAuthorityMCMC' AS numAgents, alpha, beta, energy, avgDegree, maxDegree, accept
FROM
parameters
) map_output
INSERT OVERWRITE TABLE results SELECT *;
But I got an error A user-supplied transfrom script has exited with error code 2 instead of 0. When I run runAuthorityMCMC, with 6 command line parameters sampled from that table, it works perfectly well.
It seems to me it's trying to run the script without passing the parameters at all. In one of the error messages I got exactly the output I expected if this was the case. What is the correct syntax to do what I'm trying to do?
EDIT:
Confirming - this was part of the error message:
usage: runAuthorityMCMC [-h]
numAgents normalizedBrainCapacity ecologicalPressure
burnInSteps monteCarloSteps thiningRatio
runAuthorityMCMC: error: too few arguments
Which is exactly the output I'd expect with too few arguments. The script should take six arguments.
Ok, perhaps there is a difference of vocabulary here but hive doesn't send the values as "arguments" to the script. They are read in through standard input (which is different than passing something as argument). Also, you can try sending the data to /bin/cat so see what's actually being sent to the hive. If my memory serves me right, the values are sent tab separated and result emitted out from the script is also expected to be tab separated.
Trying printing stuff from stdout (or stderr) in your script, you will see the result in your jobtracker logs. That will help you debug.
Related
I'm doing a check in my syntax to see if all of the string fields have values. This looks something like this:
IF(STRING ~= "").
Now, instead of filtering or computing a variable, I would like to force an error if there are fields with no values. I'm running a lot of these scripts and I don't want to keep checking the datasets manually. Instead, I just want to receive an error and stop execution.
Is this possible in SPSS Syntax?
First, you can efficiently count the blanks in your string variables with COUNT:
COUNT blanks = A B C(" ").
where you list the string variables to be checked. So if the sum of blanks is > 0, you want the error to be raised. First aggregate to a new dataset and activate it:
DATASET DECLARE sum.
AGGREGATE /OUTFILE=sum /count=sum(blanks).
The hard part is getting the job to stop when the blank count is greater than 0. You can put the job in a syntax file and run it using INSERT with ERROR=STOP, or you can run it as a production job via Utilities or via the -production flag on the command line of an spj (production facility) job.
Here is how to generate an error on a condition.
DATASET ACTIVATE sum.
begin program.
import spssdata
curs = spssdata.Spssdata()
count = curs.fetchone()
curs.CClose()
if count[0] > 0:
spss.Submit("blank values exist")
end program.
DATASET CLOSE sum.
This code reads the aggregate file and if the blank count is positive issues an invalid command causing an error, which stops the job.
I have the following XML data:
<CompactData><my:DataSet><my:Series VAL="A" AMOUNT_TYPE="FI" IDENTIFIER="1"><my:Obs AMT="24.25" UNIT_MEASURE="KG"></my:Obs></my:Series><my:Series VAL="B" AMOUNT_TYPE="GI" IDENTIFIER="2"><my:Obs AMT="21.22" UNIT_MEASURE="KG"></my:Obs></my:Series></my:DataSet></CompactData>
I am trying to convert it to a CSV format using the following commands in PIG:
A = LOAD '/testing/mydata.xml' using org.apache.pig.piggybank.storage.XMLLoader('CompactData') as (x:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'<my:Series VAL="([^"]+)" AMOUNT_TYPE="([^"]+)" IDENTIFIER="([^"]+)"><my:Obs AMT="([^"]+)" UNIT_MEASURE="([^"]+)"></my:Obs></my:Series>')) AS (val:chararray,amount_type:chararray,identifier:chararray,amt:chararray,unit_measure:chararray);
Putting the regex <my:Series VAL="([^"]+)" AMOUNT_TYPE="([^"]+)" IDENTIFIER="([^"]+)"><my:Obs AMT="([^"]+)" UNIT_MEASURE="([^"]+)"><\/my:Obs><\/my:Series> into Regexr gives two perfect matches, but Pig just does not want to work with it. It always gives me an empty result whereas I expect the following:
A,FI,1,24.25,KG
B,GI,2,21.22,KG
Update 1: This seems most likely related to the issue mentioned here: Pig xmlloader error when loading tag with colon
Assuming your code does not give an error, I can think of 3 potential problems here:
Your regex is not called
Your regex (in pig) is not returning the expected result
The output of your regex is not shown
To deal with the situation, I would recommend the following steps:
Create a pig program that succesfully uses regex to find 'b' in 'aba'
Create a pig program that succesfully finds both occurrences of 'a' in 'aba'
Create a pig program that succesfully finds the firstof 'a' in 'aba'
Keep 'growing' this solution gradually untill you reach your actual solution
If you still get stuck, please share the last solution that worked, and the first one that didn't work. (including input!)
Using TeamCity 9.1.4.
I'm trying to get some server hostnames into a Command Line script with Configuration Parameters. I want each option to contain multiple hostnames.
My configuration:
vanmain => rad-ecr1,rad-ecr2,rad-ecr3,rad-myecr,rad-balancer
tor => rad2-bal,rad2-ecr1,rad2-ecr2,rad2-myecr
fvcdc => rad-fvcdc,rad-balancer
bccfa => rad-bccfa
When I select fvcdc in a build, I receive the following error message:
One of entered values 'rad-fvcdc' is not one of valid select item values: rad-ecr1,rad-ecr2,rad-ecr3,rad-myecr,rad-balancer,rad2-bal,rad2-ecr1,rad2-ecr2,rad2-myecr,rad-fvcdc,rad-balancer,rad-bccfa
How do I get the values into my script?
Dunkan,
I successfully reproduced your issue and was able to find out the root cause of it.
On my virtual installation I created a build with select type parameter, let's name it HostValue. Next, in Items field I copy/pasted values from your initial post and tried to reproduce the problem -- but executing the build was successful. Then I decided to reconfigure parameter and toggled Allow multiple checkbox, and viola, same error message as you got!
If you read small text below Value separator field, you will see, that default value is comma: , and as your values contain this symbol you got an error.
So, to solve this problem I can suggest you these variants:
If you don't need multiple choices, you can just turn off this feature and everything should work.
Replace default Value separator with custom one, for example <SEP>. Then whenever you will select multiple values for this parameter you will get something like:
"rad-ecr1,rad-ecr2,rad-ecr3,rad-myecr,rad-balancer"<SEP>"rad2-bal,rad2-ecr1,rad2-ecr2,rad2-myecr"<SEP>"rad-fvcdc,rad-balancer"
Replace comma in your values with some other separator, for example | or :. In this case it would look like:
"rad-ecr1:rad-ecr2:rad-ecr3:rad-myecr:rad-balancer","rad-fvcdc:rad-balancer"
After that you can use the value of this parameter as usual %HostValue% and parse depending on which variant you choose.
Maybe the error message from server could be a little bit clearer. Hope it will help you.
Also I would like to recommend you my plugin teamcity-web-parameters. It will allow you to create dynamic select values from external web service.
Take a look at this thread: https://teamcity-support.jetbrains.com/hc/en-us/community/posts/206843785-How-to-specify-empty-value-for-Typed-Parameter -- looks very similar to your question.
I'm building a Sequence job that contains a UserVariables activity (ParamLoading) and a Job activity (ExtractJob), in that order. ParamLoading creates 4 user variables and invokes a routine to fill each one with the correspondng value, then invokes ExtractJob pasiing it the parametes previously loaded.
ParamLoading invokes a server routine (GetParams) which simply executes a shell script (ShellQuery) and captures the result; that shell script executes an SQL query against an Oracle database and prints the result on screen.
As far as tests go, ShellQuery works as expected and GetParams returns the expected value. But when GetParams is invoked from the sequence job (no matter if it's in ParamLoading or ExtractJob) the job fails with the following error:
Test2..JobControl (#JOB033_TBK_026_EXT_PTLF): Controller problem: Error calling DSSetParam(RUTA_ORIGEN), code=-4
[ParamValue/Limitvalue is not appropriate]
I've checked data types, parameter names, all, without success or even a message saying what might be happening.
Code of ShellQuery:
value=$(sqlplus -s $1/$2#$3/$4 <<!
set heading off
set feedback off
set pages 0
select param_value from cfg_params where filter='$5' and param_name='$6';
!)
echo $value
Code of GetParams:
Call DSExecute("UNIX", Ruta_Programas:"getparams.sh ":Username:" ":Password:" ":Server:" ":ServiceId:" ":Filtro:" ":Parametro, Output, SystemReturnCode)
Ans = Output
Return(Ans)
What are you returning as values from GetParams?
Calling a function from a sequence expects an integer value back and any non-zero digit returned is evaluated as an error.
As a test, try changing the return value from the routines to values 0-4.
Solved. For those struggling with a similar problem, GetParams was returning the captured value from ShellQuery adding a special character called "field delimiter", and given that the character is a 254 in ASCII, any job receiving the parameter would complain of an invalid value, which was the error.
Changing the routine to the following solved it:
Call DSExecute("UNIX", Ruta_Programas:"getparams.sh ":Username:" ":Password:" ":Server:" ":ServiceId:" ":Filtro:" ":Parametro, Output, SystemReturnCode)
Ans = EReplace(Output, #FM, "")
Return(Ans)
Thanks to Matt Calderon for providing a clue for solving.
I want to see number of affected rows of my target table. For that, I can write a shell script in which I pass a parameter as $PM#numAffectedRows. However, if my target table name is parameterized and I want to pass that in the same shell, how can I do that?
Eg.
$ParamTgtTable=myTable
When I pass $PM'$ParamTgtTable'#numAffectedRows in the shell script, it echos myTable#numAffectedRows. If I pass the same without the quote as $PM$ParamTgtTable#numAffectedRows, I get $ParamTgtTable#numAffectedRows as my output.
Is there any workaround for this? Appreciate your help on this.
pass 2 parameters seperately like
$ParamTgtTable=myTable
$PM#numAffectedRows=your_count
now create a third parameter as X=$ParamTgtTable$PM#numAffectedRows
if didnt work try using single quotes (i dont have access to UNIX to test right now)