apache-pig: ERROR 1066: Unable to open iterator for alias - hadoop

I am trying to run a pig-script on bulk wikipedia page statistics data.
To start off with, I am just doing a basic filter like:
A = LOAD '/data' using PigStorage(' ') as (project:chararray, page:chararray, requests:int, size:int);
B= FILTER A BY project == 'en';
dump B;
This is working fine if I am loading 2-3 files but erroring out if I load all the files. The error is :
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias B
To confirm that there are no corrupted records, I made several copies of the file that was working and ran the above script, but no luck. Please advise!

Related

tensorflow_datasets.load() downloads but cannot extract certain datasets

Calling tensorflow_datasets.load('cycle_gan/apple2orange') works fine
but tensorflow_datasets.load('cycle_gan/vangogh2photo') gives me an error.
I've tried this on my desktop and laptop and both gave the same error message.
Here's the code I ran and the error message I got:
import tensorflow_datasets as tfds
dataset = tfds.load('cycle_gan/vangogh2photo',
data_dir='data', batch_size=1, download=True, in_memory=False)
InvalidArgumentError: Failed to create a NewWriteableFile: data\downloads\extracted\ZIP.peop.eecs.berk.edu_taes_park_Cycl_data_vanNiw0c-cL4JRL2gjUnWYOr9woVN9V1peDW4GG0decqv8.zip.incomplete_bf327518b23f41ee9a3a469cc0b541ba\vangogh2photo\testB\2014-12-10 12:08:40.jpg : The filename, directory name, or volume label syntax is incorrect.
; Unknown error
then it says
During handling of the above exception, another exception occurred:
(traceback)
ExtractError: Error while extracting data\downloads\peop.eecs.berk.edu_taes_park_Cycl_data_vanNiw0c-cL4JRL2gjUnWYOr9woVN9V1peDW4GG0decqv8.zip to data\downloads\extracted\ZIP.peop.eecs.berk.edu_taes_park_Cycl_data_vanNiw0c-cL4JRL2gjUnWYOr9woVN9V1peDW4GG0decqv8.zip : Failed to create a NewWriteableFile: data\downloads\extracted\ZIP.peop.eecs.berk.edu_taes_park_Cycl_data_vanNiw0c-cL4JRL2gjUnWYOr9woVN9V1peDW4GG0decqv8.zip.incomplete_bf327518b23f41ee9a3a469cc0b541ba\vangogh2photo\testB\2014-12-10 12:08:40.jpg : The filename, directory name, or volume label syntax is incorrect.
; Unknown error
How do I fix this?
Which OS are you using?
There is an issue with some datasets on Windows when composing the URLs to fetch the files or the URLs where to save them locally.
For the Oxford Pets III dataset, the link below provides the fix:
https://github.com/tensorflow/tensorflow/issues/31171#issuecomment-529169445
Perhaps something similar may apply here?

Required field 'uncompressed_page_size' was not found in serialized data! Parquet

I am getting below error while trying to save parquet file from local directory using pyspark.
I tried spark 1.6 and 2.2 both give same error
It display's schema properly but gives error at the time of writing file.
base_path = "file:/Users/xyz/Documents/Temp/parquet"
reg_path = "file:/Users/xyz/Documents/Temp/parquet/ds_id=48"
df = sqlContext.read.option( "basePath",base_path).parquet(reg_path)
out_path = "file:/Users/xyz/Documents/Temp/parquet/out"
df2 = df.coalesce(5)
df2.printSchema()
df2.write.mode('append').parquet(out_path)
org.apache.spark.SparkException: Task failed while writing rows
Caused by: java.io.IOException: can not read class org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' was not found in serialized data! Struct: PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)
In my own case, I was writing a custom Parquet Parser for Apache Tika and I experienced this error. It turned out that if the file is being used by another process, the ParquetReader will not be able to access uncompressed_page_size. Hence, causing the error.
Verify if other processes are not holding on to the file.
Temporary resolved by the spark config:
"spark.sql.hive.convertMetastoreParquet": "false"
Although it would has extra cost, but a walkaround approach by now.

Loading Multiple Files in PIG

I have 35 Csv files I want to load the data using Pig. I have tried it with the following attempts
1) A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/{HLPCA-00000,HLPCA-01000,HLPCA-02000,HLPCA-03000,HLPCA-04000,HLPCA-05000,HLPCA-06000,HLPCA-07000,HLPCA-08000,HLPCA-09000,HLPCA-10000,HLPCA-11000,HLPCA-12000,HLPCA-13000,HLPCA-14000,HLPCA-15000,HLPCA-16000,HLPCA-17000,HLPCA-18000,HLPCA-19000,HLPCA-20000,HLPCA-21000,HLPCA-22000,HLPCA-23000,HLPCA-24000,HLPCA-25000,HLPCA-26000,HLPCA-27000,HLPCA-28000,HLPCA-29000,HLPCA-30000,HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}.csv' UsingPigStorage(',');
For this attempt I have got the error
014-10-06 00:32:07,130 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2999: Unexpected internal error. Can not create a Path from an empty string
Details at logfile: /home/mrinmoy/Desktop/Sampath Project/Household/pig_1412580582549.log
In the next attempt I have changed script with using SomeLoader();
2) A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/{HLPCA-00000,HLPCA-01000,HLPCA-02000,HLPCA-03000,HLPCA-04000,HLPCA-05000,HLPCA-06000,HLPCA-07000,HLPCA-08000,HLPCA-09000,HLPCA-10000,HLPCA-11000,HLPCA-12000,HLPCA-13000,HLPCA-14000,HLPCA-15000,HLPCA-16000,HLPCA-17000,HLPCA-18000,HLPCA-19000,HLPCA-20000,HLPCA-21000,HLPCA-22000,HLPCA-23000,HLPCA-24000,HLPCA-25000,HLPCA-26000,HLPCA-27000,HLPCA-28000,HLPCA-29000,HLPCA-30000,HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}.csv' using SomeLoader();
But I got the error saying this
2014-10-06 00:39:42,905 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve SomeLoader using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
Details at logfile: /home/mrinmoy/Desktop/Sampath Project/Household/pig_1412580912789.log
Pig will always load all files in a directory. So you just need to specify the directory with your CSV files.
A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/' using PigStorage(',');
Please also note usingPigStorage() is missing a whitespace. It should be using PigStorage().
And you have some double commas: ...HLPCA-31000,,HLPCA-32000,,HLPCA-33000,,HLPCA-34000,,HLPCA-35000}...
Pig supports providing file names as regular expressions. So you can provide something like:
A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/HLPCA*' Using PigStorage(',');
and it will load all files with names starting from 'HLPCA' in Household directory.

pig + hbase + hadoop2 integration

has anyone had successful experience loading data to hbase-0.98.0 from pig-0.12.0 on hadoop-2.2.0 in an environment of hadoop-2.20+hbase-0.98.0+pig-0.12.0 combination without encountering this error:
ERROR 2998: Unhandled internal error.
org/apache/hadoop/hbase/filter/WritableByteArrayComparable
with a line of log trace:
java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArra
I searched the web and found a handful of problems and solutions but all of them refer to pre-hadoop2 and base-0.94-x which were not applicable to my situation.
I have a 5 node hadoop-2.2.0 cluster and a 3 node hbase-0.98.0 cluster and a client machine installed with hadoop-2.2.0, base-0.98.0, pig-0.12.0. Each of them functioned fine separately and I got hdfs, map reduce, region servers , pig all worked fine. To complete an "loading data to base from pig" example, i have the following export:
export PIG_CLASSPATH=$HADOOP_INSTALL/etc/hadoop:$HBASE_PREFIX/lib/*.jar
:$HBASE_PREFIX/lib/protobuf-java-2.5.0.jar:$HBASE_PREFIX/lib/zookeeper-3.4.5.jar
and when i tried to run : pig -x local -f loaddata.pig
and boom, the following error:ERROR 2998: Unhandled internal error. org/apache/hadoop/hbase/filter/WritableByteArrayComparable (this should be the 100+ times i got it dying countless tries to figure out a working setting).
the trace log shows:lava.lang.NoClassDefFoundError: org/apache/hadoop/hbase/filter/WritableByteArrayComparable
the following is my pig script:
REGISTER /usr/local/hbase/lib/hbase-*.jar;
REGISTER /usr/local/hbase/lib/hadoop-*.jar;
REGISTER /usr/local/hbase/lib/protobuf-java-2.5.0.jar;
REGISTER /usr/local/hbase/lib/zookeeper-3.4.5.jar;
raw_data = LOAD '/home/hdadmin/200408hourly.txt' USING PigStorage(',');
weather_data = FOREACH raw_data GENERATE $1, $10;
ranked_data = RANK weather_data;
final_data = FILTER ranked_data BY $0 IS NOT NULL;
STORE final_data INTO 'hbase://weather' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:date info:temp');
I have successfully created a base table 'weather'.
Has anyone had successful experience and be generous to share with us?
ant clean jar-withouthadoop -Dhadoopversion=23 -Dhbaseversion=95
By default it builds against hbase 0.94. 94 and 95 are the only options.
If you know which jar file contains the missing class, e.g. org/apache/hadoop/hbase/filter/WritableByteArray, then you can use the pig.additional.jars property when running the pig command to ensure that the jar file is available to all the mapper tasks.
pig -D pig.additional.jars=FullPathToJarFile.jar bulkload.pig
Example:
pig -D pig.additional.jars=/usr/lib/hbase/lib/hbase-protocol.jar bulkload.pig

Error in metadata: MetaException(message:java.lang.IllegalStateException: Can't overwrite cause)

I have created a external table in hive and when I provide the location of the data for this table I get the following error:
FAILED: Error in metadata: MetaException(message:java.lang.IllegalStateException: Can't overwrite cause)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask
Also I am able to load the same file using PIG Script using the PigStorage() loader function.
I have the following permissions on the file: rw-rw-r-
and on the folder where this file resides (Giving the path of this folder in location in the query ) : drwxrwxr-x
What can be the cause for this and how to correct this error ?
The solution is to have write permission on the file....
Another possible cause of this issue is having your LOCATION wrong for your hive table (in case someone else has this issue and can't figure out what is going wrong).

Resources