New to pig.
I'm loading data into a relation like so:
raw_data = LOAD '$input_path/abc/def.*;
It works great, but if it can't find any files matching def.* the entire script fails.
Is here a way to continue with the rest of the script when there are no matches. Just produce an empty set?
I tried to do:
raw_data = LOAD '$input_path/abc/def.* ONERROR Ignore();
But that doesn't parse.
You could write a custom load UDF that returns either the file or an empty tuple.
http://wiki.apache.org/pig/UDFManual
No, there is no such feature, at least the one that I've heard of.
Also I would say that "producing an empty set" is "not running the script at all".
If you don't want to run a Pig script under some circumstances then I recommend using wrapper shell scripts or Pig embedding:
http://pig.apache.org/docs/r0.11.1/cont.html
Related
Is there a way to load an applescript library based on a variable.
What I try to achieve is this:
set basescript to "hello.scpt"
tell script basescript
dialoger("testing")
end tell
the basescript will contain something like:
on dialoger(message)
display dialog message
end dialoger
This works fine a long as I type it out but as soon I try to pass it like a var it keeps giving errors...
Any help would be greatly appreciated
I use script libraries all the time. Once you get the hang of it, it becomes a huge timesaver. There are a couple of ways of loading script commands from a “Library Script” into another script.
One way is to use the load script command by setting a variable name to load script and path/to/script.
There is also another way, which in my opinion, is much more powerful. You can import library scripts using the use statement. This method removes the need of using tell statements.
For example, I saved your following code as “Hello.scpt” in my /Users/YOUR SHORT NAME/Library/Script Libraries/ folder.
on dialoger(message)
display dialog message
end dialoger
Next, in the script in which I want to load the commands from the Library script “Hello.scpt”, this is the code I used using the use statement
use basescript : script "Hello"
use scripting additions
basescript's dialoger("testing")
By using use statements with multiple applications, you can combine terms from different sources in ways impossible using standard tell statements or tell blocks, because the tell construct only makes one terminology source available at a time.
Solution:
If you do set basescript to load script POSIX file "/path/to/Hello.scpt" then tell basescript to dialoger("testing") will work!
Assuming that these are script libraries, you can accomplish what you want using a handler like so:
-- send the same of the script library in the first parameter
-- and the message in the second
myHandler("hello", "My Message")
on myHandler(libName, message)
tell script libName
dialoger(message)
end tell
end test
Since the handler isn't processed until runtime, it will dynamically implement the correct script library passed in libName.
Is there a way to automatically run a pig script when invoking pig from command line?
The reason I'm wondering about this is that I have several import and define statements that I use constantly over and over to set everything up. Is it possible to define this collection of statements somewhere so that when I start pig, it will automatically execute those lines? I apologize in advance if this is something trivial that I missed from the documentation.
yes you can certainly do so from version 0.11 onwards.
You need to use .pigbootup file.
Here is a nice blogpost on setting up the pigbootup file
http://hadoopified.wordpress.com/2013/02/06/pig-specify-a-default-script/
If you want to include Pig-Macros from a file you can use the import command
Take a look at http://pig.apache.org/docs/r0.9.1/cont.html#import-macros for reference
I am using Pig via Azure HDInsight. I am able to submit a query that ends with a STORE, something like this:
STORE Ordered INTO 'results' USING PigStorage(',');
That works, storing the output in the directory /user/hdp/results/. However I would like to control the output directory. I've tried both...
STORE Ordered INTO '/myOutDir/results' USING PigStorage(',');
and
STORE Ordered INTO 'wasb:///myOutDir/results' USING PigStorage(',');
Neither of those works. They both generate this error:
Ordered was unexpected at this time.
My question is, can I control the output directory for a Store command? Or does it have to go in the user directory?
If you want to set the output with a parameter you can do this :
STORE Ordered INTO '$myOutDir/results' USING...
And then run your script with :
pig -param myOutDir=/blablabla/... myScript.pig
NB: you can also set a default value to your parameter, add at the top of your script :
%default myOutDir '/blablabla/...'
Hope this help, good luck :)
Use output path as below
wasb[s]://<BlobStorageContainerName>#<StorageAccountName>.blob.core.windows.net/<path>
If your output path /example/data/sample.log then use
wasb://mycontainer#mystorageaccount.blob.core.windows.net/example/data/sample.log
wasb:///example/data/sample.log
I hope this may help you. :-)
It seems that Pig prevents us from reusing an output directory. In that case, I want to write a Pig UDF that will accept a filename as parameter, open the file within the UDF and append the contents to the already existing file at the location. Is this possible?
Thanks in advance
It may be possible, but I don't know that it's advisable. Why not just have a new output directory? For example, if ultimately you want all your results in /path/to/results, STORE the output of the first run into /path/to/results/001, the next run into /path/to/results/002, and so on. This way you can easily identify bad data from any failed jobs, and if you want all of it together, you can just do hdfs -cat /path/to/results/*/*.
If you don't actually want to append but instead want to just replace the existing contents, you can use Pig's RMF shell command:
%DEFINE output /path/to/results
RMF $output
STORE results INTO '$output';
I'm new in programming Pig and currently I'm trying to implement my Hadoop jobs with pig.
So far my Pig programs work. I've got some output files stored as *.txt with semicolon as delimiter.
My problem is that Pig adds parentheses around the tuple's...
Is it possible to store the output in a file without these parentheses? Only storing the values? Maybe by overwriting the PigStorage method with an UDF?
Does anyone have a hint for me?
I want to read my output files into a RDBMS (Oracle) without the parentheses.
You probably need to write your own custom Storer. See: http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
Shouldn't be too difficult to just write it as a plain CSV or whatever. There's also a pre-existing DBStorage class that you might be able to use to write directly to Oracle if you want.
For people who find find this topic first, question is answered here:
Remove brackets and commas in output from Pig
use the FLATTEN command in your script like this:
output = FOREACH [variable] GENERATE FLATTEN (($1, $2, $3));<br>
STORE output INTO '[path]' USING PigStorage(,);
notice the second set of parentheses for the output you want to flatten.