Getting exception while trying to execute a Pig Latin Script - hadoop

I am learning Pig on my own and while trying to explore a dataset I am encountering an exception. What is wrong in the script and why:
movies_data = LOAD '/movies_data' using PigStorage(',') as (id:chararray,title:chararray,year:int,rating:double,duration:double);
high = FILTER movies_data by rating > 4.0;
high_rated = FOREACH high GENERATE movies_data.title,movies_data.year,movies_data.rating,movies_data.duration;
DUMP high_rated;
At the end of the MAP Reduce execution I am getting the below error.
2018-07-22 20:11:07,213 [main] ERROR org.apache.pig.tools.grunt.Grunt
ERROR 1066: Unable to open iterator for alias high_rated.
Backend error : org.apache.pig.backend.executionengine.ExecException:
ERROR 0: Scalar has more than one row in the output.
1st : (1,The Nightmare Before Christmas,1993,3.9,4568.0),
2nd :(2,The Mummy,1932,3.5,4388.0)
(common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar" )

First, let's see how we can fix your problem. You don't need to access your fields using the alias name. Your third line could be simply:
high_rated = FOREACH high GENERATE title, year, rating, duration;
If you wanted to use the alias name for some reason you should use the referential operator (::) as can be seen in the ERROR suggestion. Then your line would look like:
high_rated = FOREACH high GENERATE movies_data::title, movies_data::year, movies_data::rating, movies_data::duration;
Next, let's try to understand the exact reason behind the error message. When you try to access the fields using a dot operator (.), pig will assume that the alias is a scalar (alias having only one row). Since your alias had more than one row, it complained. You can read more about scalars in Pig here: https://issues.apache.org/jira/browse/PIG-1434
In the JIRA's release notes section, you will notice at the end, the expected error message matches the error you are getting:
If a relation contains more than single tuple, a runtime error is generated:
"Scalar has more than one row in the output"

this works for you without error.
movies_data = LOAD '/movies_data' using PigStorage(',') as (id:chararray,title:chararray,year:int,rating:double,duration:double);
high = FILTER movies_data by rating > 4.0;
high_rated = FOREACH high GENERATE title,year,rating,duration;
DUMP high_rated;
The FILTER command to allowed all the column records which are satisfy the filter condition.

Related

Converting datatype of a column in dataframe and getting warning

I'm attempting to change the datatype of a few columns in a dataframe from boolean/float to integer and keep getting the following warning:
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
after removing the cwd from sys.path.
I've attempted to .copy() the columns prior to invoking .astype() as well as taking values (.values) of the columns prior to invoking .astype() and keep getting this warning. I haven't encountered this before.
df[['Cancelled', 'Diverted']] are booleans and df['DepDel15'] is a float.
Here is the code:
# Convert target variables to integers
target_vars = ['Cancelled','Diverted','DepDel15']
for element in target_vars:
df[element] = df[element].astype('int')
Does anyone have suggestions on how to avoid the warning I'm getting?

Automate downloading of multiple xml files from web service with power query

I want to download multiple xml files from web service API. I have a query that gets a JSON document:
= Json.Document(Web.Contents("http://reports.sem-o.com/api/v1/documents/static-reports?DPuG_ID=BM-086&page_size=100"))
and manipulates it to get list of file names such as: PUB_DailyMeterDataD1_201812041627.xml in a column on an excel spreadsheet.
I hoped to get a function to run against this list of names to get all the data, so first I worked on one file: PUB_DailyMeterDataD1_201812041627
= Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/PUB_DailyMeterDataD1_201812041627.xml"))
This gets an xml table which I manipulate to get the data I want (the half hourly metered MWh for generator GU_401970
Now I want to change the query into a function to automate the process across all xml files avaiable from the service. The function requires a variable to be substituted for the filename. I try this as preparation for the function:
let
Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = (Web.Contents("https://reports.sem-o.com/documents/Filename")),
(followed by the manipulating Mcode)
This doesnt work.
then this:
let
Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/[Filename]")),
I get:
DataFormat.Error: Xml processing failed. Either the input is invalid or it isn't supported. (Internal error: Data at the root level is invalid. Line 1, position 1.)
Details:
Binary
So stuck here. Can you help.
thanks
Conor
You append strings with the "&" symbol in Power Query. [Somename] is the format for referencing a field within a table, a normal variable is just referenced with it's name. So in your example
let Filename="PUB_DailyMeterDataD1_201812041627.xml",
Source = Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/" & Filename)),
Would work.
It sounds like you have an existing query that drills down to a list of filenames and you are trying to use that to import them from the url though, so assuming that the column you have gotten the filenames from is called "Filename" then you could add a custom column with this in it
Xml.Tables(Web.Contents("https://reports.sem-o.com/documents/" & [Filename]))
And it will load the table onto the row of each of the filenames.

Force an error in SPSS Syntax when condition is met

I'm doing a check in my syntax to see if all of the string fields have values. This looks something like this:
IF(STRING ~= "").
Now, instead of filtering or computing a variable, I would like to force an error if there are fields with no values. I'm running a lot of these scripts and I don't want to keep checking the datasets manually. Instead, I just want to receive an error and stop execution.
Is this possible in SPSS Syntax?
First, you can efficiently count the blanks in your string variables with COUNT:
COUNT blanks = A B C(" ").
where you list the string variables to be checked. So if the sum of blanks is > 0, you want the error to be raised. First aggregate to a new dataset and activate it:
DATASET DECLARE sum.
AGGREGATE /OUTFILE=sum /count=sum(blanks).
The hard part is getting the job to stop when the blank count is greater than 0. You can put the job in a syntax file and run it using INSERT with ERROR=STOP, or you can run it as a production job via Utilities or via the -production flag on the command line of an spj (production facility) job.
Here is how to generate an error on a condition.
DATASET ACTIVATE sum.
begin program.
import spssdata
curs = spssdata.Spssdata()
count = curs.fetchone()
curs.CClose()
if count[0] > 0:
spss.Submit("blank values exist")
end program.
DATASET CLOSE sum.
This code reads the aggregate file and if the blank count is positive issues an invalid command causing an error, which stops the job.

Pig XMLLoader: Cannot parse XML (Converting to CSV)

I have the following XML data:
<CompactData><my:DataSet><my:Series VAL="A" AMOUNT_TYPE="FI" IDENTIFIER="1"><my:Obs AMT="24.25" UNIT_MEASURE="KG"></my:Obs></my:Series><my:Series VAL="B" AMOUNT_TYPE="GI" IDENTIFIER="2"><my:Obs AMT="21.22" UNIT_MEASURE="KG"></my:Obs></my:Series></my:DataSet></CompactData>
I am trying to convert it to a CSV format using the following commands in PIG:
A = LOAD '/testing/mydata.xml' using org.apache.pig.piggybank.storage.XMLLoader('CompactData') as (x:chararray);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(x,'<my:Series VAL="([^"]+)" AMOUNT_TYPE="([^"]+)" IDENTIFIER="([^"]+)"><my:Obs AMT="([^"]+)" UNIT_MEASURE="([^"]+)"></my:Obs></my:Series>')) AS (val:chararray,amount_type:chararray,identifier:chararray,amt:chararray,unit_measure:chararray);
Putting the regex <my:Series VAL="([^"]+)" AMOUNT_TYPE="([^"]+)" IDENTIFIER="([^"]+)"><my:Obs AMT="([^"]+)" UNIT_MEASURE="([^"]+)"><\/my:Obs><\/my:Series> into Regexr gives two perfect matches, but Pig just does not want to work with it. It always gives me an empty result whereas I expect the following:
A,FI,1,24.25,KG
B,GI,2,21.22,KG
Update 1: This seems most likely related to the issue mentioned here: Pig xmlloader error when loading tag with colon
Assuming your code does not give an error, I can think of 3 potential problems here:
Your regex is not called
Your regex (in pig) is not returning the expected result
The output of your regex is not shown
To deal with the situation, I would recommend the following steps:
Create a pig program that succesfully uses regex to find 'b' in 'aba'
Create a pig program that succesfully finds both occurrences of 'a' in 'aba'
Create a pig program that succesfully finds the firstof 'a' in 'aba'
Keep 'growing' this solution gradually untill you reach your actual solution
If you still get stuck, please share the last solution that worked, and the first one that didn't work. (including input!)

Hadoop Pig Latin Tuples: How to pass them to UDFs?

My goal is to pass every field in the input to a UDF as follows:
A = LOAD './input/file1' USING PigStorage(' ') AS (f1:chararray, f2:chararray);
B = FOREACH A GENERATE com.mycompany.udf.FAKEUDF(tuple(*));
NOTE: I am using Cloudera's version 0.12.0-cdh5.0.0.
The above FOREACH is just one of my many attempts. I have seen examples like
...FAKEUDF(*)
And so forth.
The main question is, what is the correct syntax? And has the syntax changed from earlier versions?
Here is a link which shows the lone asterisk syntax:
Chapter 10: Writing Evaluation & Filter Functions
It depends how u are processing your reqiurement. Argument will be name of column (one or more) like FAKEUDF(column1,column2,....) or for all the column you can specify * also like FAKEUDF(*) or you can specify relationName also. In UDF, you have to take out the column values from the tuple like : tuple.get(index). You have to be careful what you have sent as argument based on that processing is happening. It can be even DataBag.

Resources