PIG : how to separate data by positions in a single line - hadoop

normally if we there any delimiter in a line we do.
load "pigtest.txt" using PigStorage(',') as (year:int,temp:float);
Below is the sample whether data of single line.
0029029070999991901010106004+64333+023450FM12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999
I need to extract year 1901(16th position to 4positions) temperature(89th position to 4 positions) so that i can define my key and value.
I also need to trim initial zeroes from temperature.
Thanks in advance

Yes you can use FixedWidthLoaderUDF to extract the specific position from input data. Download piggybank.jar and try the below approach.
input
0029029070999991901010106004+64333+023450FM12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999
PigScript:
REGISTER /tmp/piggybank.jar;
A = LOAD 'input' USING org.apache.pig.piggybank.storage.FixedWidthLoader('16-19,89-92') AS(year:int,temp:float);
DUMP A;
Output:
(1901,781.0)
Reference:
http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/piggybank/storage/FixedWidthLoader.html

Related

How to transform one column to much column on matrix using Fortran 90

I have one column (im = 160648) and row (jm = 1). I want to transform that to a matrix with sizes (im = 344) and (jm=467)
my program code is
program matrix
parameter (im=160648, jm=1)
dimension h(im,jm)
integer::h
open (1,file="Hasil.txt", status='old')
open (2,file="HasilNN.txt", status='unknown')
do i=1,jm
read(1,*)(h(i,j)),j=1,jm)
end do
do i=1,im
write(2,33)(h(i,j),j=1,jm)
end do
33 format(1x, 344f10.6)
end program matrix
the error code that appears when read(1,*)(h(i,j)),j=1,jm)
the data type is floating data.
Your read loop is:
do i=1,jm
read(1,*)(h(i,j)),j=1,jm)
end do
Shouldn't do i=1,jm be do i=1,im ?
This would imply there are "im" records (lines) in the formatted text file Hasil.txt, which your question suggests.
read(1,*)(h(i,j)),j=1,jm) implies each record (line of text) has "jm" values, which is 1 value per line. Is this what the file looks like ? (An unknown number of blank lines will be skipped with this read (lu,*) ... statement.)
You appear to be wanting to write this information to another file; HasilNN.txt using 33 format (1x, 344f10.6) which suggests 3441 characters per line, although your write statement will write only 1 value per line (as jm=1). This would be a very long line for a text file and probably difficult to manage outside the program. If you did wish to do this, you could achieve this with an implied do loop, such as:
write(2,33) ((h(i,j),j=1,jm),I=1,im)
A few comments:
using jm = 1 implies each row has only one value, which could be equivalently represented as a 1d vector "dimension h(im)", negating the need for j
File unit numbers 1 and 2 are typically reserved unit numbers for screen/keyboard. You would be better using units 11 and 12.
When devising this code, you need to address the record structure in the 2 files, as a simple vector could be used. You can control the line length with the format. A format of (1x,8f10.6) would create a record of 81 characters, which would be much easier to manage.
Format descriptor f10.6 also limits the range of values you can manage in the files. Values >= 1000 or <= -100 will overflow this format, while values smaller than 1.e-6 will be zero.
As #francescalus has noted, you have declared "h" as integer, but use a real format descriptor. This will produce an "Error : format-data mismatch" and has to be changed to what is expected in the file.
You should consider what you wish to achieve and adjust the code.

Hadoop Pig Latin Tuples: How to pass them to UDFs?

My goal is to pass every field in the input to a UDF as follows:
A = LOAD './input/file1' USING PigStorage(' ') AS (f1:chararray, f2:chararray);
B = FOREACH A GENERATE com.mycompany.udf.FAKEUDF(tuple(*));
NOTE: I am using Cloudera's version 0.12.0-cdh5.0.0.
The above FOREACH is just one of my many attempts. I have seen examples like
...FAKEUDF(*)
And so forth.
The main question is, what is the correct syntax? And has the syntax changed from earlier versions?
Here is a link which shows the lone asterisk syntax:
Chapter 10: Writing Evaluation & Filter Functions
It depends how u are processing your reqiurement. Argument will be name of column (one or more) like FAKEUDF(column1,column2,....) or for all the column you can specify * also like FAKEUDF(*) or you can specify relationName also. In UDF, you have to take out the column values from the tuple like : tuple.get(index). You have to be careful what you have sent as argument based on that processing is happening. It can be even DataBag.

How to print the maximum of multiple functions in Gnuplot

My question is pretty basic. I am plotting several functions at once using gnuplot, and I want to print out (in either a file or on the graph itself) the maximum y-values of every function. Any idea how I could do that?
I looked into STATS and GPVAL_DATA_Y_MAX but I can't really figure out how to make them work with several functions at the same time.
Without going into too much details, let's suppose that my file looks like that :
plot 'file1.dat' us 1:2 title "file1" w lines,\
'file2.dat' us 1:2 title "file2" w lines,\
'file3.dat' us 1:2 title "file3" w lines
You can use the name parameter of the stats option to save the maximum of every file in a different set of variables:
stats 'file1.dat' using 2 nooutput name 'file1'
stats 'file2.dat' using 2 nooutput name 'file2'
stats 'file3.dat' using 2 nooutput name 'file3'
Now you can either print the values to an external file
set print 'max.dat'
print file1_max
print file2_max
print file3_max
If you want to place a respective label near the maximum in your graph, you must also know the corresponding x-value where the data has its maximum. This data is not readily available from the first stats command, only its index in the data file. So you need an additional call to stats in order to get the x-value where the maximum y-value was:
stats 'file1.dat' using 1 every ::file1_index_max::file1_index_max name 'file1_x'
...
And then you can use
set label center at first file1_x_max,first file1_max sprintf('y = %.2f', file1_max) offset char 0,1
Unfortunately, most of the commands cannot be iterated properly with changing variable names.

PIG Latin : While loading how to discard the first line in any file?

I've been using PIG since sometime and wanted to know how to not consider the first line while loading a file. I've a file which has headers. So I should ignore the first line and go to the next line to do the processing on the date columns and all. How to go about this?
Thanks
If you have pig version 0.11 you could try this:
input_file = load 'input' USING PigStorage(',') as (row1:chararay, row2:chararray);
ranked = rank input_file;
NoHeader = Filter ranked by (rank_input_file > 1);
New_input_file = foreach NoHeader generate row1, row2;
The New_input_file should have your data without the header. note that the rank operator is new to pig 0.11, so this will not work with earlier versions.
EDIT: note this solution only works with a single file, if you are instead loading a directory try something else instead.
The given solution will work well if you just load in 1 file. However, if you load in all files in a directory (which is also possible by simply making sure that input is a directory path), the given solution will only cut off the top for the first file.
For removing the header in each file, you will probably want to use CSVExcelStorage
my_input = load 'inputfileordir' USING CSVExcelStorage(',', 'default', 'NOCHANGE', 'SKIP_INPUT_HEADER')

Apache Pig - Is it possible to serialize a variable?

Let's take the wordCount example:
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
bag_words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
Is it possible to serialize the "bag_words" variable so that we don't have to rebuild the entire bag each time we want to execute the script ?
Thanks.
STORE bag_words INTO 'some-output-directory';
Then read it in later to skip the foreach generate, flatten, tokenize.
You can output any alias in pig using the STORE command: you could use standard formats (like CSV) or write your own PigLoader class to implement any specific behaviour. You can then LOAD this output in a separate script, thus bypassing the initial LOAD.

Resources