How to structure unstructured data using apache pig

How to structure unstructured data using apache pig - hadoop

I have a file containing the following line:
3124,"hello...",ku4
3125,"hello,hi",ab2
I want to load the file such that it has three columns. I used PigStorage(',') but it is also splitting "hello,hi" into two. I want it under a single field.
How can I achieve this?

You can write your own custom UDF or use CSVLoader from piggybank.jar
-- Get piggybank.jar that is compatible with your pig version and register
it in your pig script by pointing to the location of the jar file
REGISTER piggybank.jar
A = LOAD 'test.txt' USING org.apache.pig.piggybank.storage.CSVLoader(',') AS (f1:int,f2:chararray,f3:chararray);
B = FOREACH A GENERATE f1, f2, f3;
DUMP B;

Related

How to merge CSV files in Hadoop?

I am new to the Hadoop framework and I would like to merge 4 CSV files into a single file.
All the 4 CSV files have same headers and order is also the same.

I don't think Pig STORE offers such a feature.
You could use Spark's coalesce(1) function, however, there is little reason to do this as almost all Hadoop processing tools prefer to read directories, not files.
You should ideally not be storing raw CSV in Hadoop for very long, anyway, and rather you convert it to ORC or Parquet as columnar data. Especially if you are reading CSV to begin with already -- do not output CSV again.
If the idea is to produce one CSV to later download, then I would suggest using Hive + Beeline to do that
This will store the result into a file in the local file system.
beeline -u 'jdbc:hive2://[databaseaddress]' --outputformat=csv2 -f yourSQlFile.sql > theFileWhereToStoreTheData.csv

try using getmerge utility to merge the csv files
for example you have a couple of EMP_FILE1.csv EMP_FILE2.csv EMP_FILE3.csv are placed at some location on hdfs. you can merge all these files and can placed merge file at some new location.
hadoop fs -getmerge /hdfsfilelocation/EMP_FILE* /newhdfsfilelocation/MERGED_EMP_FILE.csv

Apache Pig: How to concat strings in load function?

I am new to Pig and I want to use Pig to load data from a path. The path is dynamic and is stored in a txt file. Say we have a txt file called pigInputPath.txt
In the pig script, I plan to do the following:
First load the path using:
InputPath = Load 'pigInputPath.txt' USING PigStorage();
Second load data from the path using:
Data = Load 'someprefix' + InputPath + 'somepostfix' USING PigStorage();
But this would not work. I also tried CONCAT but it also gives me an error. Can someone help me with this. Thanks a lot!

First, find a way to pass your input path as a parameter. (References: Hadoop Pig: Passing Command Line Arguments, https://wiki.apache.org/pig/ParameterSubstitution)
Lets say you invoke your script as pig -f script.pig -param inputPath=blah
You could then LOAD from that path with required prefix and postfix as follows:
Data = LOAD 'someprefix$inputPath/somepostfix' USING PigStorage();
The catch for the somepostfix string is that is needs to be separated from the parameter using a / or other such special characters to tell pig that the string is not a part of the parameter name.
One option to avoid using special characters is by doing the following:
%default prefix 'someprefix'
%default postfix 'somepostfix'
Data = LOAD '$prefix$inputPath$postfix' USING PigStorage();

export data to csv using hive sql

How to export hive table/select query to csv? I have tried the command below. But it creates the output as multiple files. Any better methods?
INSERT OVERWRITE LOCAL DIRECTORY '/mapr/mapr011/user/output/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT fied1,field2,field3 FROM table1

Hive creates as many files as many reducers were running. This is fully parallel.
If you want single file then add order by to force running on single reducer or try to increase bytes per reducer configuration parameter:
SELECT fied1,field2,field3 FROM table1 order by fied1
OR
set hive.exec.reducers.bytes.per.reducer=67108864; --increase accordingly
Also you can try to merge files:
set hive.merge.smallfiles.avgsize=500000000;
set hive.merge.size.per.task=500000000;
set hive.merge.mapredfiles=true;
Also you can concatenate files using cat after getting them from hadoop.

You can use hadoop fs -cat /hdfspath > some.csv
command and get the output in one file.
If you want Header then you can use SED along with hive.. See this link which discusses various options in exporting Hive to CSV
https://medium.com/#gchandra/best-way-to-export-hive-table-to-csv-file-326063f0f229

how to load multiple text files in a folder in pig using load command?

I have been using this for loading one text file
A = LOAD '1try.txt' USING PigStorage(' ') as (c1:chararray,c2:chararray,c3:chararray,c4:chararray);

You can use folder name instead of file name, like this:
A = LOAD 'myfolder' USING PigStorage(' ')
AS (c1:chararray,c2:chararray,c3:chararray,c4:chararray);
Pig will load all files in the specified folder, as stated in Programming Pig:
When specifying a “file” to read from HDFS, you can specify directories. In this case, Pig will find all files under the directory you specify and use them as input for that load statement. So, if you had a directory input with two datafiles today and yesterday under it, and you specified input as your file to load, Pig will read both today and yesterday as input. If the directory you specify has other directories, files in those directories will be included as well.

Here is the link to the official pig documentation that indicates that you can use the load statement to load all the files in a directory:
http://pig.apache.org/docs/r0.14.0/basic.html#load
Syntax: LOAD 'data' [USING function] [AS schema];
Where: 'data': The name of the file or directory, in single quotes. If you specify a directory name, all the files in the directory are loaded.

data = load '/FOLDER/PATH' using PigStorage(' ') AS (<name> <type>, ..);
OR
data = load '/FOLDER/PATH' using HBaseStorage();

How to transfer files between machines in Hadoop and search for a string using Pig

I have 2 questions:
I have a big file of records, a few million ones. I need to transfer this file from one machine to a hadoop cluster machine. I guess there is no scp command in hadoop (or is there?) How to transfer files to the hadoop machine?
Also, once the file is on my hadoop cluster, I want to search for records which contain a specific string, say 'XYZTechnologies'. How to do this is Pig? Some sample code would be great to give me a head-start.
This is the very first time I am working on Hadoop/Pig. So please pardon me if it is a "too basic" question.
EDIT 1
I tried what Jagaran suggested and I got the following error:
2012-03-18 04:12:55,655 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Encountered " "(" "( "" at line 3, column 26.
Was expecting:
<QUOTEDSTRING> ...
Also, please note that, I want to search for the string anywhere in the record, so I am reading the tab separated record as one single column:
A = load '/user/abc/part-00000' using PigStorage('\n') AS (Y:chararray);

for your first question, i think that Guy has already answered it.
as for the second question, it looks like if you just want to search for records which contain a specific string, a bash script is better, but if you insist on Pig, this is what i suggest:
A = load '/user/abc/' using PigStorage(',') AS (Y:chararray);
B = filter A by CONTAINS(A, 'XYZTechnologies');
store B into 'output' using PigStorage()
keep in mind that PigStorage default delimeter is tab so put a delimeter that does not appear in your file.
then you should write a UDF that returns a boolean for CONTAINS, something like:
public class Contains extends EvalFunc<Boolean> {
#Override
public Boolean exec(Tuple input) throws IOException
{
return input.get(0).toString().contains(input.get(1).toString());
}
}
i didn't test this, but this is the direction i would have tried.

For Copying to Hadoop.
1. You can install Hadoop Client in the other machine and then do
hadoop dfs -copyFromLocal from commandline
2. You could simple write a java code that would use FileSystem API to copy to hadoop.
For Pig.
Assuming you know field 2 may contain XYZTechnologies
A = load '<input-hadoop-dir>' using PigStorage() as (X:chararray,Y:chararray);
-- There should not be "(" and ")" after 'matches'
B = Filter A by Y matches '.*XYZTechnologies.*';
STORE B into 'Hadoop=Path' using PigStorage();

Hi you can use the hadoop grep function to find the specific string in the file.
for e.g my file contains some data as follows
Hi myself xyz. i like hadoop.
hadoop is good.
i am practicing.
so the hadoop command is
hadoop fs -text 'file name with path' | grep 'string to be found out'
Pig shell:
--Load the file data into the pig variable
**data = LOAD 'file with path' using PigStorage() as (text:chararray);
-- find the required text
txt = FILTER data by ($0 MATCHES '.string to be found out.');
--display the data.
dump txt; ---or use Illustrate txt;
-- storing it in another file
STORE txt into "path' using PigStorage();

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to structure unstructured data using apache pig - hadoop

I have a file containing the following line: 3124,"hello...",ku4 3125,"hello,hi",ab2 I want to load the file such that it has three columns. I used PigStorage(',') but it is also splitting "hello,hi" into two. I want it under a single field. How can I achieve this?

Related

How to merge CSV files in Hadoop?

Apache Pig: How to concat strings in load function?

export data to csv using hive sql

how to load multiple text files in a folder in pig using load command?

How to transfer files between machines in Hadoop and search for a string using Pig

Categories

Resources