Load multiple files in pig - extended - hadoop

Please help me out...
I have spent a lot of hours on this.
I have files in a folder in which i wish them to be loaded according to the order of their file name.
I have even went to the extend of writing Java code to convert the file names to match the format in the guides in the following links.
Load multiple files in pig
Pig Latin: Load multiple files from a date range (part of the directory structure)
http://netezzaadmin.wordpress.com/2013/09/25/passing-parameters-to-pig-scripts/
I am using pig 11.0
In my script.pig,
set io.sort.mb 10;
REGISTER 'path_to/lib/pig/piggybank.jar';
data_ = LOAD '$input' USING org.apache.pig.piggybank.storage.XMLLoader('Data') AS (data_:chararray);
DUMP data_;
In shell
[root#servername currentfolder]# pig -x local script.pig -param input=/20131217/{1..10}.xml
Error returned:
[main] ERROR.org.apache.pig.Main - ERROR 2999: Unexpected error. Undefined parameter : input

I dont know why are you using input parameters.
For example for loading every file in folder MyFolder/CurrentDate/ (in YYYYMMDD format), I am using following script:
%default DATE `date +%Y%m%d`;
x_basic_table = LOAD '/MyFolder/$DATE';
Nice day

Related

Apache Pig: How to concat strings in load function?

I am new to Pig and I want to use Pig to load data from a path. The path is dynamic and is stored in a txt file. Say we have a txt file called pigInputPath.txt
In the pig script, I plan to do the following:
First load the path using:
InputPath = Load 'pigInputPath.txt' USING PigStorage();
Second load data from the path using:
Data = Load 'someprefix' + InputPath + 'somepostfix' USING PigStorage();
But this would not work. I also tried CONCAT but it also gives me an error. Can someone help me with this. Thanks a lot!
First, find a way to pass your input path as a parameter. (References: Hadoop Pig: Passing Command Line Arguments, https://wiki.apache.org/pig/ParameterSubstitution)
Lets say you invoke your script as pig -f script.pig -param inputPath=blah
You could then LOAD from that path with required prefix and postfix as follows:
Data = LOAD 'someprefix$inputPath/somepostfix' USING PigStorage();
The catch for the somepostfix string is that is needs to be separated from the parameter using a / or other such special characters to tell pig that the string is not a part of the parameter name.
One option to avoid using special characters is by doing the following:
%default prefix 'someprefix'
%default postfix 'somepostfix'
Data = LOAD '$prefix$inputPath$postfix' USING PigStorage();

An Error message occurs on command prompt while generating Dashboard from Jmeter

I have started using JMeter 3.1 recently for load testing, all I wanted to do was generate a report dashboard from a csv file.
When I run the following command from Command Prompt:
jmeter -g (csv file location) -o (Destination folder to save HTML Dashboard)
I get the error shown below:
Could not parse timestamp<1.487+12> using format defined by property.saveservice.timestamp+format=ms on sample 1.487+12 .........
I have also attached the screenshot of the error message kindly refer below:
Below is my saveservice properties that I copied into user properties file:
jmeter.save.saveservice.bytes = true
jmeter.save.saveservice.label = true
jmeter.save.saveservice.latency = true
jmeter.save.saveservice.response_code = true
jmeter.save.saveservice.response_message = true
jmeter.save.saveservice.successful = true
jmeter.save.saveservice.thread_counts = true
jmeter.save.saveservice.thread_name = true
jmeter.save.saveservice.time = true
jmeter.save.saveservice.print_field_names=true
# the timestamp format must include the time and should include the date.
# For example the default, which is milliseconds since the epoch:
jmeter.save.saveservice.timestamp_format = ms
# Or the following would also be suitable
#jmeter.save.saveservice.timestamp_format = dd/MM/yyyy HH:mm
#save service assertion
jmeter.save.saveservice.assertion_results_failure_message = true
I am not able to figure out the resaon, any help in this regard will be much appreciated.
please help, also please let me know if any addition information is required.
I have followed the below link to generate Dashboard:
http://jmeter.apache.org/usermanual/generating-dashboard.html
The answer is in your question itself:
Could not parse timestamp<1.487+12>
According to your configuration, JMeter expects first column to be in Unix timestamp format like 1487047932355 (time since beginning of Unix epoch in milliseconds)
Another supported format is yyyy/MM/dd HH:mm:ss.SSS like 2017/02/14 05:52:12.355
So there are several constraints:
The value of jmeter.save.saveservice.timestamp_format = ms should be the same during test execution and dashboard generation
You need to restart JMeter to pick the properties up. For example if you ran the test, then amended properties and then tried to generate dashboard - it might fail
There are no duplicate properties
You don't do anything with the .jtl results file between test execution and dashboard generation
My expectation is that you opened .jtl results file with MS Excel which converted timestamps into scientific notation and saved so most probably you will be able to do the opposite.
Just in case I would also recommend getting familiarised with Apache JMeter Properties Customization Guide
The default timestamp format in JMeter csv and logs is given in a Unix style format , but you can change it .
Go to (jmeterDirectory)/bin .
open jmeter.properties file .
Search for the following :-
jmeter.save.saveservice.timestamp_format
You will find it commented (Start with #) . Uncomment it and restart the Jmeter .
You can update this above property with the format you need

how to load multiple text files in a folder in pig using load command?

I have been using this for loading one text file
A = LOAD '1try.txt' USING PigStorage(' ') as (c1:chararray,c2:chararray,c3:chararray,c4:chararray);
You can use folder name instead of file name, like this:
A = LOAD 'myfolder' USING PigStorage(' ')
AS (c1:chararray,c2:chararray,c3:chararray,c4:chararray);
Pig will load all files in the specified folder, as stated in Programming Pig:
When specifying a “file” to read from HDFS, you can specify directories. In this case, Pig will find all files under the directory you specify and use them as input for that load statement. So, if you had a directory input with two datafiles today and yesterday under it, and you specified input as your file to load, Pig will read both today and yesterday as input. If the directory you specify has other directories, files in those directories will be included as well.
Here is the link to the official pig documentation that indicates that you can use the load statement to load all the files in a directory:
http://pig.apache.org/docs/r0.14.0/basic.html#load
Syntax: LOAD 'data' [USING function] [AS schema];
Where: 'data': The name of the file or directory, in single quotes. If you specify a directory name, all the files in the directory are loaded.
data = load '/FOLDER/PATH' using PigStorage(' ') AS (<name> <type>, ..);
OR
data = load '/FOLDER/PATH' using HBaseStorage();

How to transfer files between machines in Hadoop and search for a string using Pig

I have 2 questions:
I have a big file of records, a few million ones. I need to transfer this file from one machine to a hadoop cluster machine. I guess there is no scp command in hadoop (or is there?) How to transfer files to the hadoop machine?
Also, once the file is on my hadoop cluster, I want to search for records which contain a specific string, say 'XYZTechnologies'. How to do this is Pig? Some sample code would be great to give me a head-start.
This is the very first time I am working on Hadoop/Pig. So please pardon me if it is a "too basic" question.
EDIT 1
I tried what Jagaran suggested and I got the following error:
2012-03-18 04:12:55,655 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " "(" "( "" at line 3, column 26.
Was expecting:
<QUOTEDSTRING> ...
Also, please note that, I want to search for the string anywhere in the record, so I am reading the tab separated record as one single column:
A = load '/user/abc/part-00000' using PigStorage('\n') AS (Y:chararray);
for your first question, i think that Guy has already answered it.
as for the second question, it looks like if you just want to search for records which contain a specific string, a bash script is better, but if you insist on Pig, this is what i suggest:
A = load '/user/abc/' using PigStorage(',') AS (Y:chararray);
B = filter A by CONTAINS(A, 'XYZTechnologies');
store B into 'output' using PigStorage()
keep in mind that PigStorage default delimeter is tab so put a delimeter that does not appear in your file.
then you should write a UDF that returns a boolean for CONTAINS, something like:
public class Contains extends EvalFunc<Boolean> {
#Override
public Boolean exec(Tuple input) throws IOException
{
return input.get(0).toString().contains(input.get(1).toString());
}
}
i didn't test this, but this is the direction i would have tried.
For Copying to Hadoop.
1. You can install Hadoop Client in the other machine and then do
hadoop dfs -copyFromLocal from commandline
2. You could simple write a java code that would use FileSystem API to copy to hadoop.
For Pig.
Assuming you know field 2 may contain XYZTechnologies
A = load '<input-hadoop-dir>' using PigStorage() as (X:chararray,Y:chararray);
-- There should not be "(" and ")" after 'matches'
B = Filter A by Y matches '.*XYZTechnologies.*';
STORE B into 'Hadoop=Path' using PigStorage();
Hi you can use the hadoop grep function to find the specific string in the file.
for e.g my file contains some data as follows
Hi myself xyz. i like hadoop.
hadoop is good.
i am practicing.
so the hadoop command is
hadoop fs -text 'file name with path' | grep 'string to be found out'
Pig shell:
--Load the file data into the pig variable
**data = LOAD 'file with path' using PigStorage() as (text:chararray);
-- find the required text
txt = FILTER data by ($0 MATCHES '.string to be found out.');
--display the data.
dump txt; ---or use Illustrate txt;
-- storing it in another file
STORE txt into "path' using PigStorage();

Comma separated list with AvroStorage in Pig

I tried to load several files with AvroStorage in Pig by using a comma separated list. The statement I used is:
test_data= LOAD 'repo_1/part-r-00000.avro,repo_2/part-r-00000.avro' USING org.apache.pig.piggybank.storage.avro.AvroStorage();
Pig states that no input paths were specified in job. Please see the stacktrace below.
I tried pig version0.8.1-cdh3u2 and 0.9.1.
Does anyone observe the same behavior? Is it a bug or a feature?
Stacktrace:
rg.apache.pig.backend.executionengine.ExecException: ERROR 2118: No input paths specified in job
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:282)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.io.IOException: No input paths specified in job
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:186)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:270)
... 7 more
Those part files are loaded automatically by Pig, so you only need to specify the directory.
Try
test_file1 = LOAD 'repo_1' using AvroStorage();
test_file2 = LOAD 'repo_2' using AvroStorage();
test_file = UNION test_file1, test_file2;

Resources