Pig throwing incompatible type error - hadoop

I am using the following code to generate sessionId in pig by using sessionize UDF in datafu.
SET mapred.min.split.size 1073741824
SET mapred.job.queue.name 'marathon'
SET mapred.output.compress true;
--SET avro.output.codec snappy;
--SET pig.maxCombinedSplitSize 536870912;
page_view_pre = LOAD '/data/tracking/PageViewEvent/' USING LiAvroStorage('date.range','start.date=20150226;end.date=20150226;error.on.missing=true'); -----logic is currently for 2015-02-26,will later replace them with date parameters
p_key = LOAD '/projects/dwh/dwh_dim/dim_page_key/#LATEST' USING LiAvroStorage();
page_view_pre = FILTER page_view_pre BY (requestHeader.userAgent != 'CRAWLER' and requestHeader.browserId != 'CRAWLER') and NOT IsTestMemberId(header.memberId);
page_view_pre = FOREACH page_view_pre GENERATE
(int) (header.memberId <0 ? -9 : header.memberId ) as member_sk,
(chararray) requestHeader.browserId as browserId,
--(chararray) requestHeader.sessionId as sessionId,
(chararray) UnixToISO(header.time) as pageViewTime,
header.time as pv_time,
(chararray) requestHeader.path as path,
(chararray) requestHeader.referer as referer,
(chararray) epochToFormat(header.time, 'yyyyMMdd', 'America/Los_Angeles') as tracking_date,
(chararray) requestHeader.pageKey as pageKey,
(chararray) SUBSTRING(requestHeader.trackingCode, 0, 500) as trackingCode,
FLATTEN(botLookup(requestHeader.userAgent, requestHeader.browserId)) as (is_crawler, crawler_type),
(int) totalTime as totalTime,
((int) totalTime < 20 ? 1 :0) as bounce_flag;
page_view_pre = FILTER page_view_pre BY is_crawler == 'N' ;
p_key = FILTER p_key By is_aggregate ==1;
page_view_agg = JOIN page_view_pre by pageKey ,p_key by page_key;
page_view_agg = FOREACH page_view_agg GENERATE
(chararray)page_view_pre::member_sk as member_sk,
(chararray)page_view_pre::browserId as browserId,
--page_view_pre::sessionId as sessionId,
(chararray)page_view_pre::pageViewTime as pageViewTime,
(long)page_view_pre::pv_time as pv_time,
(chararray)page_view_pre::tracking_date as tracking_date,
(chararray)page_view_pre::path as path,
(chararray)page_view_pre::referer as referer,
(chararray)page_view_pre::pageKey as pageKey,
(int)p_key::page_key_sk as page_key_sk,
(chararray)page_view_pre::trackingCode as trackingCode,
(int)page_view_pre::totalTime as totalTime,
(int)page_view_pre::bounce_flag as bounce_flag;
page_view_agg = FILTER page_view_agg By (member_sk is NOT null) OR (browserId IS NOT NULL) ;
pvs_by_member_browser_pair = GROUP page_view_agg BY (member_sk,browserId);
***session_groups = FOREACH pvs_by_member_browser_pair {
visits = ORDER page_view_agg BY pv_time;
GENERATE FLATTEN(Sessionize(visits)) AS (
pageViewTime,member_sk, pv_time,tracking_date, pageKey,page_key_sk,browserId,referer ,path, trackingCode,totalTime, sessionId
);
}***
The bolded part is giving me the following error :
ERROR 1031: Incompatable schema: left is "pageViewTime:NULL,member_sk:NULL,pv_time:NULL,tracking_date:NULL,pageKey:NULL,page_key_sk:NULL,browserId:NULL,referer:NULL,path:NULL,trackingCode:NULL,totalTime:NULL,sessionId:NULL", right is "datafu.pig.sessions.sessionize_visits_43::member_sk:chararray,datafu.pig.sessions.sessionize_visits_43::browserId:chararray,datafu.pig.sessions.sessionize_visits_43::pageViewTime:chararray,datafu.pig.sessions.sessionize_visits_43::pv_time:long,datafu.pig.sessions.sessionize_visits_43::tracking_date:chararray,datafu.pig.sessions.sessionize_visits_43::path:chararray,datafu.pig.sessions.sessionize_visits_43::referer:chararray,datafu.pig.sessions.sessionize_visits_43::pageKey:chararray,datafu.pig.sessions.sessionize_visits_43::page_key_sk:int,datafu.pig.sessions.sessionize_visits_43::trackingCode:chararray,datafu.pig.sessions.sessionize_visits_43::totalTime:int,datafu.pig.sessions.sessionize_visits_43::bounce_flag:int,datafu.pig.sessions.sessionize_visits_43::session_id:chararray"
I thought initially this had to do with null member or browser id's.I filtered for them too, still the error is persisting.I have been stuck here for hours.Would really appreciate some pointers or solution to resolve this problem.
Thanks

This is a classical case of Schema mismatch:
page_view_pre = LOAD '/data/tracking/PageViewEvent/' USING LiAvroStorage('date.range','start.date=20150226;end.date=20150226;error.on.missing=true'); -----logic is currently for 2015-02-26,will later replace them with date parameters
Just have illustrate page_view_pre after this line to figure out the schema.

Related

KAFKA JDBC Source connector adds the default schema

I use the KAFKA JDBC Source connector to read from the database ClickHouse (driver - clickhouse-jdbc-0.2.4.jar) with incrementing mod.
Settings:
batch.max.rows = 100
catalog.pattern = null
connection.attempts = 3
connection.backoff.ms = 10000
connection.password = [hidden]
connection.url = jdbc:clickhouse://<ip>:8123/<schema>
connection.user = user
db.timezone =
dialect.name =
incrementing.column.name = id
mode = incrementing
numeric.mapping = null
numeric.precision.mapping = false
poll.interval.ms = 5000
query =
query.suffix =
quote.sql.identifiers = never
schema.pattern = null
table.blacklist = []
table.poll.interval.ms = 60000
table.types = [TABLE]
table.whitelist = [<table_name>]
tables = [default.<schema>.<table_name>]
timestamp.column.name = []
timestamp.delay.interval.ms = 0
timestamp.initial = null
topic.prefix = staging-
validate.non.null = false
Why does the connector additionally substitute the default scheme? and how to avoid it?
Instead of a request
SELECT * FROM <schema>.<table_name> WHERE <schema>.<table_name>.id > ? ORDER BY <schema>.<table_name>.id ASC
I get an error with
SELECT * FROM default.<schema>.<table_name> WHERE default.<schema>.<table_name>.id > ? ORDER BY default.<schema>.<table_name>.id ASC
You can create CH data source object like below (Where schema name is not passed).
final ClickHouseDataSource dataSource = new ClickHouseDataSource(
"jdbc:clickhouse://"+host+"/"+user+"?option1=one%20two&option2=y");
Then in SQL query, you can specify a schema name(schema.table). So it will not add the default schema in your query.

Pig Performance Issues

I have following PIG script which is taking lot of time for processing 342 files with 256 MB as split size(testing only). Can anybody suggest improvement:
SPLIT filteredalnumcdrs into splitalnumcdrs_1 IF (
(SUBSTRING(aparty,2,3) == '-')),
splitalnumcdrs_2 OTHERWISE;
tmpsplitalnumcdrs_1 = FOREACH splitalnumcdrs_1 GENERATE aparty,srcgt,destgt,SUBSTRING(aparty,0,2) as splitaparty,bparty,smscgt,status,prepost;
groupsplitalnumcdrs_1 = GROUP tmpsplitalnumcdrs_1 BY (aparty,srcgt,destgt,splitaparty,bparty,smscgt,status,prepost);
distinctsplitalnumcdrs_1 = FOREACH groupsplitalnumcdrs_1 {
uniqsplitalnumcdrs_1 = DISTINCT tmpsplitalnumcdrs_1.(aparty,srcgt,destgt,splitaparty,bparty,smscgt,status,prepost);
GENERATE FLATTEN(group),COUNT(tmpsplitalnumcdrs_1) as countalnumcdrs;
};
tmpsplitalnumcdrs_2 = FOREACH splitalnumcdrs_2 GENERATE aparty,srcgt,destgt,aparty as splitaparty_2,bparty,smscgt,status,prepost;
groupsplitalnumcdrs_2 = GROUP tmpsplitalnumcdrs_2 BY (aparty,srcgt,destgt,splitaparty_2,bparty,smscgt,status,prepost);
distinctsplitalnumcdrs_2 = FOREACH groupsplitalnumcdrs_2 {
uniqsplitalnumcdrs_2 = DISTINCT tmpsplitalnumcdrs_2.(aparty,srcgt,destgt,splitaparty_2,bparty,smscgt,status,prepost);
GENERATE FLATTEN(group),COUNT(tmpsplitalnumcdrs_2) as countsplitalnumcdrs_2;
};
distinctalnumcdrs = UNION distinctsplitalnumcdrs_1,distinctsplitalnumcdrs_2;
alnumreportmap = FOREACH distinctalnumcdrs GENERATE aparty,smsiuc_udfs.mapgtabparty(srcgt,destgt,splitaparty,bparty),smscgt,status,prepost,countalnumcdrs PARALLEL 20;
alnumreportmapgroup = GROUP alnumreportmap BY (aparty,mappedreport,smscgt,status,prepost);
alnumreportmaprecord = FOREACH alnumreportmapgroup GENERATE FLATTEN(group),SUM(alnumreportmap.countalnumcdrs) as alnumsmscount;
you can avoid union
tmpsplitalnumcdrs = foreach filteredalnumcdrs generate aparty,srcgt,destgt,(SUBSTRING(aparty,2,3) == '-' ?SUBSTRING(aparty,0,2):aparty) as splitaparty,bparty,smscgt,status,prepost;
distinctsplitalnumcdrs = FOREACH tmpsplitalnumcdrs {
uniqsplitalnumcdrs = DISTINCT tmpsplitalnumcdrs.(aparty,srcgt,destgt,splitaparty,bparty,smscgt,status,prepost);
GENERATE FLATTEN(group),COUNT(tmpsplitalnumcdrs) as countsplitalnumcdrs;
};
why do you need
uniqsplitalnumcdrs = DISTINCT tmpsplitalnumcdrs.(aparty,srcgt,destgt,splitaparty,bparty,smscgt,status,prepost);

NullReferenceException Error when trying to iterate a IEnumerator

I have a datatable and want to select some records with LinQ in this format:
var result2 = from row in dt.AsEnumerable()
where row.Field<string>("Media").Equals(MediaTp, StringComparison.CurrentCultureIgnoreCase)
&& (String.Compare(row.Field<string>("StrDate"), dtStart.Year.ToString() +
(dtStart.Month < 10 ? '0' + dtStart.Month.ToString() : dtStart.Month.ToString()) +
(dtStart.Day < 10 ? '0' + dtStart.Day.ToString() : dtStart.Day.ToString())) >= 0
&& String.Compare(row.Field<string>("StrDate"), dtEnd.Year.ToString() +
(dtEnd.Month < 10 ? '0' + dtEnd.Month.ToString() : dtEnd.Month.ToString()) +
(dtEnd.Day < 10 ? '0' + dtEnd.Day.ToString() : dtEnd.Day.ToString())) <= 0)
group row by new { Year = row.Field<int>("Year"), Month = row.Field<int>("Month"), Day = row.Field<int>("Day") } into grp
orderby grp.Key.Year, grp.Key.Month, grp.Key.Day
select new
{
CurrentDate = grp.Key.Year + "/" + grp.Key.Month + "/" + grp.Key.Day,
DayOffset = (new DateTime(grp.Key.Year, grp.Key.Month, grp.Key.Day)).Subtract(dtStart).Days,
Count = grp.Sum(r => r.Field<int>("Count"))
};
and in this code, I try to iterate it with the following code:
foreach (var row in result2)
{
//... row.DayOffset.ToString() + ....
}
this issue occurred :
Object reference not set to an instance of an object.
I think it happens when there's no record with above criteria.
I tried to change it to enumerator like this , and use MoveNext() to check the data is on that or not:
result2.GetEnumerator();
if (enumerator2.MoveNext()) {//--}
but still the same error.
whats the problem?
I guess in one or more rows Media is null.
You then call Equals on null, which results in a NullReferenceException.
You could add a null check:
var result2 = from row in dt.AsEnumerable()
where row.Field<string>("Media") != null
&& row.Field<string>("Media").Equals(MediaTp, StringComparison.CurrentCultureIgnoreCase)
...
or use a surrogate value like:
var result2 = from row in dt.AsEnumerable()
let media = row.Field<string>("Media") ?? String.Empty
where media.Equals(MediaTp, StringComparison.CurrentCultureIgnoreCase)
...
(note that the last approach is slightly different)

SUM, AVG, in Pig are not working

I am analyzing Cluster user log files with the following code in pig:
t_data = load 'log_flies/*' using PigStorage(',');
A = foreach t_data generate $0 as (jobid:int),
$1 as (indexid:int), $2 as (clusterid:int), $6 as (user:chararray),
$7 as (stat:chararray), $13 as (queue:chararray), $32 as (projectName:chararray), $52 as (cpu_used:float), $55 as (efficiency:float), $59 as (numThreads:int),
$61 as (numNodes:int), $62 as (numCPU:int),$72 as (comTime:int),
$73 as (penTime:int), $75 as (runTime:int), $52/($62*$75) as (allEff: float), SUBSTRING($68, 0, 11) as (endTime: chararray);
---describe A;
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
B = group A by user;
f_data = foreach B {
grp = group;
count = COUNT(A);
avg = AVG(A.cpu_used);
generate FLATTEN(grp), count, avg;
};
f_data = limit f_data 10;
dump f_data;
Code works for group and COUNT but when I includes AVG and SUM, it shows the errors:
ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066: Unable to open
iterator for alias f_data
I checked data types. All are fine. Do you have any suggestions where I missed it?. Thank you in advance for your help.
Its an syntax error. Read http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#more_on_foreach (section : Nested foreach) for details.
Pig Script
A = LOAD 'a.csv' USING PigStorage(',') AS (user:chararray, cpu_used:float);
B = GROUP A BY user;
C = FOREACH B {
cpu_used_bag = A.cpu_used;
GENERATE group AS user, AVG(cpu_used_bag) AS avg_cpu_used, SUM(cpu_used_bag) AS total_cpu_used;
};
Input : a.csv
a,3
a,4
b,5
Output :
(a,3.5,7.0)
(b,5.0,5.0)
Your pig is full of errors
do not use same Alias at both side of = ;
using PigLoader() as (mention your schema appropriately );
A = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
CHANGE THIS TO
F = foreach A generate jobid, indexid, clusterid, user, cpu_used, numThreads, runTime, allEff, endTime;
f_data = limit f_data 10;
CHANGE left F_data with someother name .
Stop making your life complex.
General rule of debugging Pigscript
run in local mode
dump after everyline
Wrote a sample pig to mimic ur pig :(working)
t_data = load './file' using PigStorage(',') as (jobid:int,cpu_used:float);
C = foreach t_data generate jobid, cpu_used ;
B = group C by jobid ;
f_data = foreach B {
count = COUNT(C);
sum = SUM(C.cpu_used);
avg = AVG(C.cpu_used);
generate FLATTEN(group), count,sum,avg;
};
never_f_data = limit f_data 10;
dump never_f_data;

Optimizing pig script

I am trying to generate aggregated output. The issue is that all the data is going to a single reducer(Filter and Count are creating a problem). How can I optimize the following script?
Expected output:
group, 10,2,12,34...
data = LOAD '/input/useragents' USING PigStorage('\t') AS (Col1:chararray,Col2:chararray,Col3:chararray,col4:chararray,col5:chararray);
grp1 = GROUP data BY UA PARALLEL 50;
fr1 = FOREACH grp1 {
fltrCol1 = FILTER data BY Col1 == 'Other';
fltrCol2 = FILTER data BY Col2 == 'Other';
fltrCol3 = FILTER data BY Col3 == 'Other';
fltrCol4 = FILTER data BY col4 == 'Other';
fltrCol5 = FILTER data BY col5 == 'Other';
cnt_fltrCol1 = COUNT(fltrCol1);
cnt_fltrCol2 = COUNT(fltrCol2);
cnt_fltrCol3 = COUNT(fltrCol3);
cnt_fltrCol4 = COUNT(fltrCol4);
cnt_fltrCol5 = COUNT(fltrCol5);
GENERATE group,cnt_fltrCol1,cnt_fltrCol2,cnt_fltrCol3,cnt_fltrCol4,cnt_fltrCol5;
}
You could put the filter logic before the group by adding fltrCol{1,2,3,4,5} columns as integers, than sum them up. From the top of my head here is the script :
data = LOAD '/input/useragents' USING PigStorage('\t') AS (Col1:chararray,Col2:chararray,Col3:chararray,col4:chararray,col5:chararray);
filter = FOREACH data GENERATE UA,
((Col1 == 'Other') ? 1 : 0) as fltrCol1,
((Col2 == 'Other') ? 1 : 0) as fltrCol2,
((Col3 == 'Other') ? 1 : 0) as fltrCol3,
((Col4 == 'Other') ? 1 : 0) as fltrCol4,
((Col5 == 'Other') ? 1 : 0) as fltrCol5;
grp1 = GROUP data BY UA PARALLEL 50;
fr1 = FOREACH grp1 {
cnt_fltrCol1 = SUM(fltrCol1);
cnt_fltrCol2 = SUM(fltrCol2);
cnt_fltrCol3 = SUM(fltrCol3);
cnt_fltrCol4 = SUM(fltrCol4);
cnt_fltrCol5 = SUM(fltrCol5);
GENERATE group,cnt_fltrCol1,cnt_fltrCol2,cnt_fltrCol3,cnt_fltrCol4,cnt_fltrCol5;
}

Resources