Pig removing parentheses when storing output

Pig removing parentheses when storing output - hadoop

I'm new in programming Pig and currently I'm trying to implement my Hadoop jobs with pig.
So far my Pig programs work. I've got some output files stored as *.txt with semicolon as delimiter.
My problem is that Pig adds parentheses around the tuple's...
Is it possible to store the output in a file without these parentheses? Only storing the values? Maybe by overwriting the PigStorage method with an UDF?
Does anyone have a hint for me?
I want to read my output files into a RDBMS (Oracle) without the parentheses.

You probably need to write your own custom Storer. See: http://wiki.apache.org/pig/Pig070LoadStoreHowTo.
Shouldn't be too difficult to just write it as a plain CSV or whatever. There's also a pre-existing DBStorage class that you might be able to use to write directly to Oracle if you want.

For people who find find this topic first, question is answered here:
Remove brackets and commas in output from Pig
use the FLATTEN command in your script like this:
output = FOREACH [variable] GENERATE FLATTEN (($1, $2, $3));<br>
STORE output INTO '[path]' USING PigStorage(,);
notice the second set of parentheses for the output you want to flatten.

Related

How to combine two lines in pig based on the given format?

am trying to process a file. as of now am getting the output as shown below.
input file:-
c=1,2,3
a,b,c,d,a
d,e,f
g,h,i,i
c=2,3,4
j,k,l
m,n,a,h
c=3,2,5
d,g,a
s,fs,a
expecting an output like:-
c=1,2,3,a,b,c,d,a
c=1,2,3,d,e,f
c=1,2,3,g,h,i,i
c=2,3,4,j,k,l
c=2,3,4,m,n,a,h
c=3,2,5,d,g,a
c=3,2,5,s,fs,a
is there any other way we can get the output something like.
Another output format:-
c=1,2,3,{(a,b,c,d,a),(d,e,f),(g,h,i,i)}
c=2,3,4,{(j,k,l),(m,n,a,h)}
c=3,2,5,{(d,g,a),(s,fs,a)}
Could some one help me. Am trying with pig but am no where close to this,I am trying to solve this problem with pig to get some practice.
Thanks & Regards,
Ankush Reddy

I don't think it's possible with pig. Pig is parallel processing then it cannot know the record order in file. So I suggest you pre-process it with bash script or other tool before process with pig.

How to Delete an entry from MapFile in Hadoop

Is there any solution to delete an entry from MapFile in Hadoop. I could able to read and write entries to a MapFile, but i am totally unaware of deleting or updating an entry from it. Is there any good solution for the same ? Any help is appreciated. Thanks in Advance.

hdfs is basically supports data warehousing facilities. You can not modify existing content of any hdfs file, at most you can append new content at bottom of fine.
You can refer similar question

Suppose file contain below 2 lines
hi hello world
this is fine
Now in mapper write logic string which contains "hello" , and pass it to reducer phase.
now the reducer output will contain only "hi hello world"
If you want any other than please specify with short use case.

PIG - LOAD continue on error

New to pig.
I'm loading data into a relation like so:
raw_data = LOAD '$input_path/abc/def.*;
It works great, but if it can't find any files matching def.* the entire script fails.
Is here a way to continue with the rest of the script when there are no matches. Just produce an empty set?
I tried to do:
raw_data = LOAD '$input_path/abc/def.* ONERROR Ignore();
But that doesn't parse.

You could write a custom load UDF that returns either the file or an empty tuple.
http://wiki.apache.org/pig/UDFManual

No, there is no such feature, at least the one that I've heard of.
Also I would say that "producing an empty set" is "not running the script at all".
If you don't want to run a Pig script under some circumstances then I recommend using wrapper shell scripts or Pig embedding:
http://pig.apache.org/docs/r0.11.1/cont.html

storing a file in an already occupied location in Pig

It seems that Pig prevents us from reusing an output directory. In that case, I want to write a Pig UDF that will accept a filename as parameter, open the file within the UDF and append the contents to the already existing file at the location. Is this possible?
Thanks in advance

It may be possible, but I don't know that it's advisable. Why not just have a new output directory? For example, if ultimately you want all your results in /path/to/results, STORE the output of the first run into /path/to/results/001, the next run into /path/to/results/002, and so on. This way you can easily identify bad data from any failed jobs, and if you want all of it together, you can just do hdfs -cat /path/to/results/*/*.
If you don't actually want to append but instead want to just replace the existing contents, you can use Pig's RMF shell command:
%DEFINE output /path/to/results
RMF $output
STORE results INTO '$output';

Parsing text files and sorting in Ruby?

I would like to write a Ruby program which can parse three separate text files, each containing different delimiters, then sort them according to certain criteria.
Can someone please point me in the right direction?

It is not clear what is the data format in your files, and what criteria you used to sort, so I am not able to provide you a accurate answer.
However, basically, you might need something like this:
File.open("file_name","r").read.split(",").sort_by {|x| x.length}
You:
Opened a file using File.open.
Read the whole file and got a string. You can also read the file line-by-line using the each method.
Split the string use split. The delimiter used is ,.
Use sort_by to sort them according to the criteria specified in the block.

Enumerable#sort_by will allow you to sort an array (or other enumerable object) with a specific comparison function.

If by "text files with delimiters" you mean CSV files (character seperated values), then you can use the csv library, which is part of the standard library, to parse them. CSV gives you objects that look and feel like Ruby Hashes and Arrays, so you can use all the standard Ruby methods for sorting, filtering and iterating, including the aforementioned Enumerable#sort_by.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Pig removing parentheses when storing output - hadoop

You probably need to write your own custom Storer. See: http://wiki.apache.org/pig/Pig070LoadStoreHowTo. Shouldn't be too difficult to just write it as a plain CSV or whatever. There's also a pre-existing DBStorage class that you might be able to use to write directly to Oracle if you want.

Related

How to combine two lines in pig based on the given format?

How to Delete an entry from MapFile in Hadoop

PIG - LOAD continue on error

storing a file in an already occupied location in Pig

Parsing text files and sorting in Ruby?

Categories

Resources