How to insert dummpy map values in pig - hadoop

I am doing a conditional check for null and empty occurrence of a bag. The contains multiple map arrays. Whenever 'info' is null or empty I want to put a dummy map values into this. Because in the next step I am doing a FLATTEN operation on 'info'.
Why I need this because null or empty bag in FLATTEN will remove the complete record from the data which I don't want.
((info is null or IsEmpty(info)) ? {(['Unknown'#'unknown'])} : info) as info;
This is giving me below compilation error?
2014-09-02 06:20:37,978 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " ": "" at line 24, column 70.
Was expecting one of:
"cat" ...
"clear" ...
"fs" ...
"sh" ...
"cd" ...
"cp" ...
"copyFromLocal" ...

It seems there is a syntax error while creating a map. There is an easy way to create map using TOMAP function, which you can use as below:
((info is null or IsEmpty(info)) ? {(TOMAP('Unknown','unknown'))} : info) as info;

Related

I can't extract the node text with a Xpath

I have a XML file (test.xml) like this one:
<?xml version="1.0" encoding="ISO-8859-1"?>
<s2xResponse>
<s2xData>
<Name>This is the name</Name>
<InfocomData>
<DateOfUpdate day="07" month="02" year="2018">20180207</DateOfUpdate>
<CompanyName>MY COMPANY</CompanyName>
<TaxCode FlagCheck="0">XXXYYYWWWZZZ</TaxCode>
</InfocomData>
<AssessmentSummary>
<Rating Code="2">Rating Description for Code 2</Rating>
</AssessmentSummary>
<AssessmentData>
<SectorialDistribution>
<CompaniesNumber>11650</CompaniesNumber>
<ScoreDistribution />
<CervedScoreDistribution>
<DistributionData>
<Rating Code="1">SICUREZZA</Rating>
<Percentage>1.91</Percentage>
</DistributionData>
<DistributionData>
<Rating Code="2">SOLVIBILITA' ELEVATA</Rating>
<Percentage>35.56</Percentage>
</DistributionData>
</CervedScoreDistribution>
</SectorialDistribution>
</AssessmentData>
</s2xData>
</s2xResponse>
I'm trying to get the "Name" node text ("This is the name") with a U-SQL script using the XmlExtractor. The following is the code I'm using:
USE TestXML; // It contains the registered assembly
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
#xml = EXTRACT xml_text string
FROM "textxpath/test.xml"
USING Extractors.Text(rowDelimiter: "^", quoting: false);
#xml_cleaned =
SELECT
xml_text.Replace("\r\n", "").Replace("\t", " ") AS xml_text
FROM #xml;
#values =
SELECT Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(xml_text, "s2xResponse/s2xData/Name")[1] AS value
FROM #xml_cleaned;
OUTPUT #values TO #"outputs/test_xpath.txt" USING Outputters.Text(quoting: false);
But I'm getting this runtime error:
Execution failed with error '1_SV1_Extract Error :
'{"diagnosticCode":195887116,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXPRESSIONEVALUATION","message":"Error
while evaluating expression
Microsoft.Analytics.Samples.Formats.Xml.XPath.Evaluate(xml_text.Replace(\"\r\n\",
\"\").Replace(\"\t\", \" \"),
\"s2xResponse/s2xData/Name\")[1]","description":"Inner exception from
user expression: Index was out of range. Must be non-negative and less
than the size of the collection.
I get the same error even if I use a zero index for the Evaluate result ([0]).
What's wrong with my query?
The problem here is that you are applying the subscript [1] to the result of XPath.Evaluate, which I believe will be returning the Name nodes. However, you are applying the [1] subscript in code, not in XPath, so the subscript is likely to be zero based, and not 1-based as it is in XPath, hence the Index out of range error.
Here's one solution - simply apply the subscript operator in Xpath (where it is still 1-based), and select the text() there
.Evaluate("s2xResponse/s2xData/Name[1]/text()")
Is there a particular reason you want to use the Evaluate method? I got his to work using the XmlDomExtractor, which would allow you to extract multiple values from the xml, eg
REFERENCE ASSEMBLY [Microsoft.Analytics.Samples.Formats];
DECLARE #inputFile string = "/input/input100.xml";
#input =
EXTRACT Name string
FROM #inputFile
USING new Microsoft.Analytics.Samples.Formats.Xml.XmlDomExtractor(rowPath : "/s2xResponse",
columnPaths : new SQL.MAP<string, string>{
{ "s2xData/Name", "Name" },
}
);
#output =
SELECT *
FROM #input;

Assigning a variable from another variable

I am trying to assign a variable from another variable. My code looks like this
<#macro ctglink c rhs x y z m e b>
<#assign ctg>
<#if ctgroutes["${y}..${x}-${m}"]??>ctgroutes['${y}..${x}-${m}']
<#elseif ctgroutes["${x}..${y}-${m}"]??>ctgroutes['${x}..${y}-${m}']
<#else>{}</#if>
</#assign>
However, this ctg variable is evaluating to just ctgroutes['227..257-TPPMD04X02'] its not actually evaluating the string itself.
I have tried ?eval, and ?interpret and a bunch of other very hacky things to get this to work, no go. Even the {} is a string
Basically, I need the assign function to work like the old PHP eval() function or something. I am trying to access values in a Map whose keys are derived from the state of the data, so I don't see any easy way to query my Map without evaluating keys.
Update:
I forgot to include the elseif in there
Either way, I tried <#assign ctg = ctgroutes["${y}..${x}-${m}"]!ctgroutes["${x}..${y}-${m}"]> but I get the following error:
Caused by: freemarker.core.InvalidReferenceException: The following has evaluated to null or missing:
==> ctgroutes["${y}..${x}-${m}"]!ctgroutes["${x}..${y}-${m}"] [in template "RouteCompare-WptTable.ftlh" at line 5, column 24]
I would like a null result to just return an empty map, however that doesn't seem possible:
Caused by: java.lang.RuntimeException: freemarker.core.InvalidReferenceException: The following has evaluated to null or missing:
==> ctgroutes["${y}..${x}-${m}"]!ctgroutes["${x}..${y}-${m}"] [in template "RouteCompare-WptTable.ftlh" at line 5, column 24]
So basically, my goal is I need to assign a variable that can take 1 of 3 values:
ctgroutes["${y}..${x}-${m}"] // Assuming it is not null
ctgroutes["${x}..${y}-${m}"] // Assuming it is not null
{} // An empty map
What is the best way to do that?
If I understand well what you want to achieve, you can write it like this:
<#assign ctg = ctgroutes["${y}..${x}-${m}"]!ctgroutes["${x}..${y}-${m}"]!{}>
Also note that <#assign target>...</#assign> is for capturing the output printed between the two tags into the target variable (instead of actually printing it). So target will always store a string or markup value. Also things outside FreeMarker tags and ${} are just static text, and won't be parsed. So, the naive but working approach is just using #if/#elseif/#else and have a separate #assign ctg = ... inside each of them, but you can make this much sorter with the ! operator, as it was shown.

Pig:Relation and Schema name confusion

In Pig Latin;this works as expected:
filtered = FILTER records BY age > 27;
But this throws an exception (when >> DUMP filtered):
filtered = FILTER records BY records.age > 27;
This is the excepiton:
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (John,Wilk,27,M), 2nd :(Tri,Tim,27,F)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:403)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (John,Wilk,27,M), 2nd :(Tri,Tim,27,F)
at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:119)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:345)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextInteger(POUserFunc.java:394)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:322)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.GreaterThanExpr.getNextBoolean(GreaterThanExpr.java:74)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNextTuple(POFilter.java:144)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:282)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:277)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
What is the difference between the two? Are not they same?
No, both the stmts are different.
First stmt is perfectly valid, in this case, pig will iterate through each row and apply the filter constraint(age > 27). Its a standard way of using filter stmts.
In the second case, you used dereference operator(.) to access the fields, but the dereference operator are mainly used to access the complex data types(Tuples,Bags and Maps) values, when you use dereference operator to access the fields then pig will always expect the scalar output(ie, only one output after the filter condition) unfortunately your filter condition(age > 27) return more than one matching result, that is the reason you got "Scalar has more than one row in the output"
In case your filter condition(age>27) return only one output then your stmt is perfectly valid.

Pig: Unable to Load BAG

I have a record in this format:
{(Larry Page),23,M}
{(Suman Dey),22,M}
{(Palani Pratap),25,M}
I am trying to LOAD the record using this:
records = LOAD '~/Documents/PigBag.txt' AS (details:BAG{name:tuple(fullname:chararray),age:int,gender:chararray});
But I am getting this error:
2015-02-04 20:09:41,556 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 7, column 101> mismatched input ',' expecting RIGHT_CURLY
Please advice.
It's not a bag since it's not made up of tuples. Try
load ... as (name:tuple(fullname:chararray), age:int, gender:chararray)
For some reason Pig wraps the output of a line in curly braces which make it look like a bag but it's not. If you have saved this data using PigStorage you can save it using a parameter ('-schema') which tells PigStorage to create a schema file .pigschema (or something similar) which you can look at to see what the saved schema is. It can also be used when loading with PigStorage to save you the AS clause.
Yes LiMuBei point is absolutely right. Your input is not in the right format. Pig will always expect the bag should hold collection of tuples but in your case its a collection of (tuple and fields). In this case pig will retain the tuple and reject the fields(age and gender) during load.
But this problem can be easily solvable in different approach(kind of hacky solution).
1. Load each input line as chararray.
2. Remove the curly brackets and function brackets from the input.
3. Using strsplit function segregate the input as (name,age,sex) fields.
PigScript:
A = LOAD 'input' USING PigStorage AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(REPLACE(line,'[}{)(]+','')) AS (newline:chararray);
C = FOREACH B GENERATE FLATTEN(STRSPLIT(newline,',',3)) AS (fullname:chararray,age:int,sex:chararray);
DUMP C;
Output:
(Larry Page,23,M)
(Suman Dey,22,M)
(Palani Pratap,25,M)
Now you can access all the fields using fullname,age,sex.

Pass an array argument to custom pig loader

I wrote a LoadFunc function that allows me to select given keywords of an unstructured huge log-file. How do I pass Tuple into my function as an argument?
Something like
A = load '/input/*' using MyLoader('keyword1','keyword2');
or
A = load '/input/*' using MyLoader( ('keyword1','keyword2') );
cause errors:
grunt> a = LOAD '/input/*' USING MyLoader( ('keyword1','keyword2') );
2012-08-28 19:44:04,331 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: <line 3, column 36> mismatched input '(' expecting RIGHT_PAREN
Details at logfile: /home/hadoop/pig-0.10.0/pig_1346159261142.log
In practice, a Pig LoadFunc can only accept String parameters for its constructor. See http://mail-archives.apache.org/mod_mbox/pig-user/201302.mbox/%3CCAO8ATY27UOdcgSjdh19F=iHsnFEAwmzedWbsnZ66sNvcsjfgog#mail.gmail.com%3E.
For your purposes, I would pass a CSV as a String to your LoadFunc and then parse it within the LoadFunc's constructor.

Resources