I need some guidance/help with a simple task to create a schema in Apache Pig for my data file. I have two files that would contribute to this task. First file is a data file which contains the data with no column header, and a second file contains the column header for the data file. So basically, the column_header file is the schema for the data file. How do i outline this in a pig script? Here's what i got so far.
column_header = load 'sitecatalyst/coulmn_headers.tsv' using PigStorage('\t');
data = load 'sitecatalyst/hit_data.tsv' using PigStorage('\t') as column_header;
schema = foreach data generate column_header;
store schema into 'output1' using PigStorage('\t', '-schema');
withSchema = load 'output1';
describe withSchema;
This is the output for
DUMP column_header
(accept_language,browser,browser_height,browser_width)
When i do,
DUMP data;
only the first line column of data is being output, which is wrong.
en-US
en-US
en-US
en-US
Instead it should be,
en-US 638 755 1600
en-US 638 655 1342
en-US 638 723 1612
en-US 638 231 1234
How can i trick Pig to use "column_header" as a string that can be use during the PigStorage AS statement on the second line of code?
Edit:
This code will work but instead of hard-coding my column_header i would like pig script to read it instead.
column_header = load 'sitecatalyst/coulmn_headers.tsv' using PigStorage('\t');
data = load 'sitecatalyst/hit_data.tsv' using PigStorage('\t') as (accept_language,browser,browser_height,browser_width);
schema = foreach data generate accept_language,browser,browser_height,browser_width;
store schema into 'output1' using PigStorage('\t', '-schema');
withSchema = load 'output1';
describe withSchema;
you can not achieve such parameterization from in the pig script directly,
you can to the same thing by
data = load 'sitecatalyst/hit_data.tsv' using PigStorage('\t') as $column_header;
schema = foreach data generate column_header;
store schema into 'output1' using PigStorage('\t', '-schema');
withSchema = load 'output1';
describe withSchema;
and run the pig script by ,
pig -param_file (location of the file) column
The file should be of the format
column_header = complete schema
https://blogs.msdn.microsoft.com/bigdatasupport/2014/08/12/how-to-use-parameter-substitution-with-pig-latin-and-powershell/
Related
I have error in pig latin while i run the basic command "Dump Students" for the file
>Students = LOAD 'C:\\Users\\avtar\\OneDrive\\Desktop\\student.txt' USING PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
image
I'm using Flume to collect tweets and store them on HDFS.
The collecting part is working fine, and I can find all my tweets in my file system.
Now I would like to extract all these tweets in one single file.
The problem is that the different tweets are stored as follow :
As we can see, the tweets are stored inside blocks of 128 MB but only use a few Ko, which is a normal behaviour for HDFS correct me if I'm wrong.
However how could I get all the different tweets on one file ?
Here is my conf file that I run with the follwing command :
flume-ng agent -n TwitterAgent -f ./my-flume-files/twitter-stream-tvseries.conf
twitter-stream-tvseries.conf :
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type =
org.apache.flume.source.twitter.TwitterSource
TwitterAgent.sources.Twitter.consumerKey=hidden
TwitterAgent.sources.Twitter.consumerSecret=hidden
TwitterAgent.sources.Twitter.accessToken=hidden
TwitterAgent.sources.Twitter.accessTokenSecret=hidden
TwitterAgent.sources.Twitter.keywords=GoT, GameofThrones
TwitterAgent.sources.Twitter.keywords=GoT, GameofThrones
TwitterAgent.sinks.HDFS.channel=MemChannel
TwitterAgent.sinks.HDFS.type=hdfs
TwitterAgent.sinks.HDFS.hdfs.path=hdfs://ip-addressl:8020/user/root/data/twitter/tvseries/tweets
TwitterAgent.sinks.HDFS.hdfs.fileType=DataStream
TwitterAgent.sinks.HDFS.hdfs.writeformat=Text
TwitterAgent.sinks.HDFS.hdfs.batchSize=1000
TwitterAgent.sinks.HDFS.hdfs.rollSize=0
TwitterAgent.sinks.HDFS.hdfs.rollCount=10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval=600
TwitterAgent.channels.MemChannel.type=memory
TwitterAgent.channels.MemChannel.capacity=10000
TwitterAgent.channels.MemChannel.transactionCapacity=1000
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sinks.HDFS.channel = MemChannel
You can configure the HDFS sink to produce a message by time, event or size. So, if you want to save multiple messages till 120MB limit is reached, set
hdfs.rollInterval = 0 # This is to create new file based on time
hdfs.rollSize = 125829120 # This is to create new file based on size
hdfs.rollCount = 0 # This is to create new file based on events (different tweets in your case)
You can use the following commands to concatenate the files into single file:
find . -type f -name 'FlumeData*' -exec cat {} + >> output.file
or if you want to store the data into Hive tables for later analysis, create an external table and consume it into Hive DB.
My sequence files are stored directly in hdfs e.g.:
grunt> ls
grunt> ls /blabla
hdfs://namenode1:54310/blabla/0411f03a-db7f-48d0-9542-5203304e3e81.seq<r 3> 185284523
hdfs://namenode1:54310/blabla/05be8fc0-e967-42e1-b76a-0d7108a69d17.seq<r 3> 201489688
hdfs://namenode1:54310/blabla/06222427-519c-49c0-bbbf-49a9f43bbd13.seq<r 3> 196858576
hdfs://namenode1:54310/blabla/066da26a-48da-45b1-83f5-60d16475e40d.seq<r 3> 194832641
hdfs://namenode1:54310/blabla/07cbfc83-42a2-47bf-b364-d39da3a2d071.seq<r 3> 194806047
hdfs://namenode1:54310/blabla/10dea7b8-9ed3-4e66-b4bd-a3c07d8bf39e.seq<r 3> 166224702
How can I create a Pig script which is reading every file from the directory "blabla" and performing an action?
I've tried multiple ways for loading the input but none of those worked:
%default INPUT '/blabla/f8fbbe9a-aae3-413f-b3b9-37cdef71da8f.seq'
%default INPUT 'hdfs://namenode1:54310/blabla/f8fbbe9a-aae3-413f-b3b9-37cdef71da8f.seq'
%default INPUT 'f8fbbe9a-aae3-413f-b3b9-37cdef71da8f.seq'
I always get the error:
Input(s):
Failed to read data from "hdfs://namenode1:54310/........."
You can try reading the Sequence Files in these ways :
Pig SequenceFileLoader :
A = LOAD 'hdfs://namenode1:54310/blabla/*' using org.apache.pig.piggybank.storage.SequenceFileLoader();
(Or) Using Elephant Bird :
REGISTER 'elephant-bird-pig-3.0.5.jar';
REGISTER 'elephant-bird-core-4.1.jar';
REGISTER 'elephant-bird-hadoop-compat-4.1.jar';
A = LOAD 'hdfs://namenode1:54310/blabla/*' using com.twitter.elephantbird.pig.load.SequenceFileLoader();
Did you try this way :
%default INPUT 'hdfs://namenode1:54310/blabla/*'
?
It should work if your .seq files are readables. It looks like they are not, because your attempt to do it should have load one file. Could-you give the complete log line?
Maybe you would have to use pig SequenceFileLoader.
I have a log file in HDFS which needs to be parsed and put in a Hbase table.
I want to do this using PIG .
How can i go about it.Pig script should parse the logs and then put in Hbase?
The pig script would be (assuming tab is your data separator in log file):
A= load '/home/log.txt' using PigStorage('\t') as (one:chararray,two:chararray,three:chararray,four:chararray);
STORE A INTO 'hbase://table1' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('P:one,P:two,S:three,S:four');
I am using Hadoop 1.0.3, Pig 0.11.0 on Ubuntu 12.04. In the part-m-00000 file in HDFS the content is as below
training#BigDataVM:~/Installations/hadoop-1.0.3$ bin/hadoop fs -cat /user/training/user/part-m-00000
1,Praveen,20,India,M
2,Prajval,5,India,M
3,Prathibha,15,India,F
I am loading it into a bag and then filtering it as below.
Users1 = load '/user/training/user/part-m-00000' as (user_id, name, age:int, country, gender);
Fltrd = filter Users1 by age <= 16;
But, when I dump the Users1 5 records are shown in the console. But, dumping Fltrd will fetch no records.
dump Fltrd;
The below warning is shown in the Pig console
2013-02-24 16:19:40,735 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Encountered Warning ACCESSING_NON_EXISTENT_FIELD 12 time(s).
Looks like I have done some simple mistake, but couldn't figure out what it is. Please help me with this.
Since you haven't defined any load function, Pig will use PigStorage in which the
default delimiter is '\t'.
If part-m-00000 is a textfile then try to set the delimiter to ',' :
Users1 = load '/user/training/user/part-m-00000' using PigStorage(',')
as (user_id, name, age:int, country, gender);
If it's a SequenceFile then have a look at Dolan's or my answer on this question.