Invalid format: "19690321" is too short - hadoop

I am trying to convert yyyyMMdd format to yyyy/MM/dd format using pig for that i have written below code.
Code:
STOCK_A = LOAD '/user/root/xxxx/*' USING PigStorage('|');
data = FILTER STOCK_A BY ($1 matches '.*ID.*');
MSH_DATA = FOREACH data GENERATE ToDate($8,'yyyy/MM/dd','UTC') AS dob;
When i am trying to dump the result i am getting below error.
ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 0:
Exception while executing [POUserFunc (Name:
POUserFunc(org.apache.pig.builtin.ToDate3ARGS)[datetime] - scope-209
Operator Key: scope-209) children: null at []]:
java.lang.IllegalArgumentException: Invalid format: "19690321" is too
short
Sample:
EXVORV##PDULD21F|ID|1|483|1020783||EXVORV##PDULD||19690321|F|
$8 seems valid to me i am not able to locate the reason the issue is coming. Any help would be really appreciated.

You use :
ToDate($8,'yyyy/MM/dd','UTC')
but the format is
19690321
so you should have
ToDate($8,'yyyyMMdd','UTC')

The issue is most likely because of the load statement.Since you are not specifying the schema the datatype by default will be bytearray. You will have to convert it to chararray before passing the field to ToDate
STOCK_A = LOAD '/user/root/xxxx/*' USING PigStorage('|');
data = FILTER STOCK_A BY ($1 matches '.*ID.*');
MSH_DATA = FOREACH data GENERATE ToDate((chararray)$8,'yyyy/MM/dd','UTC') AS dob;

Related

fhir-net-api (STU3) - Hl7.Fhir.Model.PlanDefinition parsing error

Using HL7.FHIR.STU3.Core, I am getting an invalid cast exception when I try and parse an PlanDefinition FHIR file.
Do I need to set the Schema for PlanDefinition file?
string HL7FilePath = string.Format("{0}\\{1}", System.IO.Directory.GetCurrentDirectory(), "ANA3.xml");
string HL7FileData = File.ReadAllText(HL7FilePath)
var b = new FhirXmlParser().Parse<Bundle>(HL7FileData);
Error
InValidCastException {"Unable to cast object of type 'Hl7.Fhir.Model.PlanDefinition' to type 'Hl7.Fhir.Model.Bundle'."}
You are trying to parse a PlanDefinition resource into a Bundle object, as the InvalidCastException tells you. If you change the Parse<Bundle> into Parse<PlanDefinition> your code should work fine.

1003 error (unable to find an operator for alias ) in group function in pig

I have written a .pig file whose content is :
register /home/tuhin/Documents/PigWork/pigdata/piggybank.jar;
define replace org.apache.pig.piggybank.evaluation.string.REPLACE();
define csvloader org.apache.pig.piggybank.storage.CSVLoader();
xyz = load '/pigdata/salaryTravelReport.csv' using csvloader();
x = foreach xyz generate $0 as name:chararray, $1 as title:chararray, replace($2, ',','') as salary:bytearray, replace($3, ',', '') as travel:bytearray, $4 as orgtype:chararray, $5 as org:chararray, $6 as year:bytearray;
refined = foreach x generate name, title, (float)salary, (float)travel, orgtype, org, (int)year;
year2010 = filter refined by year == 2010;
byjobtitile = GROUP year2010 by title;
The purpose is to remove ',' in dollar value in 2 columns and then group the data by jobtitle. When I am running this using run command there is not error. Even dumping of year2010 is working fine. But dumping byjobtitiel is giving error:
error in dumping
The output of the log file is:
Pig Stack Trace
--------------- ERROR 1003: Unable to find an operator for alias byjobtitle
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1003: Unable
to find an operator for alias byjobtitle at
org.apache.pig.PigServer$Graph.buildPlan(PigServer.java:1544) at
org.apache.pig.PigServer.storeEx(PigServer.java:1029) at
org.apache.pig.PigServer.store(PigServer.java:997) at
org.apache.pig.PigServer.openIterator(PigServer.java:910) at
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:66) at
org.apache.pig.Main.run(Main.java:565) at
org.apache.pig.Main.main(Main.java:177) at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606) at
org.apache.hadoop.util.RunJar.run(RunJar.java:221) at
org.apache.hadoop.util.RunJar.main(RunJar.java:136)
I am new to bigdata and dont have much knowledge. But it looks like there is a problem in data type. Can anyone help me out?
The issue is due to "CSVLoader" you are using. This will have ',' as default delimiter. Since your data also has "," in some of its field like salary and travel, the positional index is getting changed. So if your data is something like this
name title salary travel orgtype org year
A B 10,000 23,1357 ORG_TYPE ORG 2016
then using CSVLoader will make "A B 10" as the first field, "000 23" as the second field and "1357 ORG_TYPE ORG 2016" as the third field based on ","
register /Users/rakesh/Documents/SVN/iReporter/iReporterJobFramework/avro/lib/1.7.5/piggybank.jar;
define replace org.apache.pig.piggybank.evaluation.string.REPLACE();
define csvloader org.apache.pig.piggybank.storage.CSVLoader();
xyz = load '<path to your file>' using csvloader();
a = foreach xyz generate $0;
2016-06-07 12:28:12,384 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1<br>
(A B 10)<br>
You can make your delimiter different so that it is not present in any field value.
Try using CSVExcelStorage. You can use its constructor to explicitly define the delimiter
register /Users/rakesh/Documents/SVN/iReporter/iReporterJobFramework/avro/lib/1.7.5/piggybank.jar;
define replace org.apache.pig.piggybank.evaluation.string.REPLACE();
define CSVExcelStorage org.apache.pig.piggybank.storage.CSVExcelStorage('|','NO_MULTILINE','NOCHANGE');
It will work fine as long as same identifier is not present as ;
delimiter
any field value

How to process multi - delimiter file in pig 0.8

I have input text file( name multidelimiter) with followings records
1,Mical,2000;10
2,Smith,3000;20
I have written pig code as follows
A =LOAD '/user/input/multidelimiter' AS line;
B = FOREACH A GENERATE FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)[,](.*)[,](.*)[;]')) AS (f1,f2,f3,f4);
But this code in not work given following error
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Lexical error at line 1, column 78. Encountered: <EOF> after : "\'(.*)[,](.*)[,](.*)[;"
I refereed following links but not able to resolve my error
how to load files with different delimiter each time in piglatin
Please help me get out from this error.
Thanks.
Solution for your input example:
LOAD as comma separated, than STRSPLIT by ';' and FLATTEN
Finally got solution.
Here is my solution:
A =LOAD '/user/input/multidelimiter' using PigStorage(',') as (empid,ename,line);
B = FOREACH A GENERATE empid,ename, FLATTEN( REGEX_EXTRACT_ALL( line,'(.*)\\u003B(.*)')) AS (sal:int,deptno:int);

string concatenation not working in pig

I have a table in hcatalog which has 3 string columns. When I try to concatenate string, I am getting the following error:
A = LOAD 'default.temp_table_tower' USING org.apache.hcatalog.pig.HCatLoader() ;
B = LOAD 'default.cdr_data' USING org.apache.hcatalog.pig.HCatLoader();
c = FOREACH A GENERATE CONCAT(mcc,'-',mnc) as newCid;
Could not resolve concat using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Could not infer the matching function for org.apache.pig.builtin.CONCAT as multiple or none of them fit. Please use an explicit cast
What might be the root cause of the problem?
May be it will help for concatenation in pig
data1 contain:
(Maths,abc)
(Maths,def)
(Maths,ef)
(Maths,abc)
(Science,ac)
(Science,bc)
(Chemistry,xc)
(Telugu,xyz)
considering schema as sub:Maths,Maths,Science....etc and name :abc,def ,ef..etc
X = FOREACH data1 GENERATE CONCAT(sub,CONCAT('#',name));
O/P of X is:
(Maths#abc)
(Maths#def)
(Maths#ef)
(Maths#abc)
(Science#ac)
(Science#bc)
(Chemistry#xc)
(Telugu#xyz)

Insert data into Cassandra from Pig using list datatype fails

I have the following scenario:
Table in Cassandra:
CREATE TABLE tb_st_test (
id int,
email list<text>,
PRIMARY KEY ((id));
PIG Code:
teste = LOAD 'cql://main/tb_st_test' USING CqlStorage();
testing = FOREACH teste GENERATE $0 as cod, ['emailtest#test.com'] as field:();
insert_test =
FOREACH testing GENERATE
TOTUPLE(
TOTUPLE('id',cod)
),
TOTUPLE(field);
STORE insert_test INTO 'cql://main/tb_st_test?output_query=UPDATE tb_st_test set email %3D%3F' USING CqlStorage();
The idea here is to read the table tb_st_test, get the key values, and update the field email.
But when I run the script I get the following error:
Backend error message
java.io.IOException: org.apache.thrift.transport.TTransportException
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:256)
Caused by: org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
at org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362)
at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284)
at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at org.apache.cassandra.thrift.Cassandra$Client.recv_execute_prepared_cql3_query(Cassandra.java:1820)
at org.apache.cassandra.thrift.Cassandra$Client.execute_prepared_cql3_query(Cassandra.java:1805)
at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:240)
Does anyone know what it is happening?
The insert_test format is wrong, for list collection format should be TOTUPLE(TOUTUPLE('some email', 'email2')). check https://issues.apache.org/jira/browse/CASSANDRA-5867

Resources