Parsing Text File and Importing to a Table in HBase - hadoop

I am new to HBase I have exported the table data in a TextFormat to a text file in the following format .
72 6f 77 31 keyvalues={row1/cf:a/1444817478342/Put/vlen=6/ts=0}
Same data I want to import to the table ,I have tried by giving this file input to the Hbase import but it is expecting SequenceFile Format and tried to tweak the import by changing input format class to TextInputFormat but still not working.Any guide lines to achieve my requirement.

Instead of export you can using java program to upload data.
Sample code:
public class HBaseDataInsert {
Configuration conf;
HTable hTable;
HBaseScan hbaseScan;
public HBaseDataInsert() throws IOException {
conf = HBaseConfiguration.create();
hTable = new HTable(conf, "emp_java");
}
public void upload_transactionFile() throws IOException {
String currentLine = null;
BufferedReader br = new BufferedReader(
new FileReader("transactionsFile.csv"));
while ((currentLine = br.readLine()) != null) {
System.out.println(currentLine);
String[] line = currentLine.split(",");
Put p = new Put(Bytes.toBytes(line[0] + "_" + line[1]));
p.add(Bytes.toBytes("details"), Bytes.toBytes("Name"), Bytes.toBytes(line[0]));
p.add(Bytes.toBytes("details"), Bytes.toBytes("id"), Bytes.toBytes(line[1]));
p.add(Bytes.toBytes("details"), Bytes.toBytes("DATE"), Bytes.toBytes(line[2]));
p.add(Bytes.toBytes("transaction details"), Bytes.toBytes("TRANSACTION_TYPE"), Bytes.toBytes(line[3]));
hTable.put(p);
}
br.close();
hTable.close();
}

The Export and import, by default works with sequence file dumps. If your requirement is just to load from one table to other, assuming both have similar formats, you can use below commands. Ths input and output directories are HDFS directories.
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import

Related

Oracle DB Java Stored Procedure: line starting with hash sign #

I have inherited an old Oracle 12c database, which I'm running on a local Oracle 19c server. The database contains the following Java Stored Procedure
create or replace and compile java source named fill_share_content as
import java.io.*;
import java.sql.*;
public class Fill_Share_Content
{
public static void execute(String directory) throws SQLException
{
File path = new File( directory );
String[] list = path.list();
String separator = path.separator;
for(int i = 0; i < list.length; i++)
{
String filename = list[i];
File datei = new File( directory + separator + filename );
if ( datei.isFile() )
{
Timestamp filedate = new Timestamp( datei.lastModified() );
#sql { insert into Share_Content (filename, filedate) values (:filename, :filedate) };
}
}
}
};
/
The problem occurs when attempting to execute the statement to create and compile the Java SP: line 20
#sql { insert into Share_Content (filename, filedate) values (:filename, :filedate) };
throws an error
error: illegal character: '#'
Generally I am able to create, compile and execute Java Stored Procedures on the database. Questions:
What is the meaning of the line starting with the hash sign? I am not familiar with such a construct in Java, something specific to Java stored procedures in Oracle?
How can I get I create and compile the Java stored procedure? Allegedly the code is already running on another instance, so the code should work.
The code is SQLJ and is documented here.
However, from Oracle 12.2, Oracle does not support running SQLJ.
You need to convert the SQLJ code to JDBC if you want to run it in Oracle 19.
This should give you a start:
CREATE AND COMPILE JAVA SOURCE NAMED fill_share_content as
import java.io.*;
import java.sql.*;
import oracle.jdbc.driver.OracleDriver;
public class Fill_Share_Content
{
public static void execute(String directory) throws SQLException
{
Connection con = null;
try {
OracleDriver ora = new OracleDriver();
con = ora.defaultConnection();
} catch (SQLException e) {
return;
}
Timestamp filedate = new Timestamp(System.currentTimeMillis());
PreparedStatement ps = con.prepareStatement(
"insert into Share_Content (filename, filedate) values (:filename, :filedate)"
);
ps.setString(1, directory);
ps.setTimestamp(2, filedate);
ps.execute();
}
};
/
db<>fiddle here

How to disable/avoid linesToSkp(1) from next file onwards in spring batch while processing large csv file

We have large csv file with 100 millions records, and used spring batch to load, read and write to database by splitting file with 1 million records using "SystemCommandTasklet". Below is snippet,
#Bean
#StepScope
public SystemCommandTasklet splitFileTasklet(#Value("#{jobParameters[filePath]}") final String inputFilePath) {
SystemCommandTasklet tasklet = new SystemCommandTasklet();
final File file = BatchUtilities.prefixFile(inputFilePath, AppConstants.PROCESSING_PREFIX);
final String command = configProperties.getBatch().getDataLoadPrep().getSplitCommand() + " " + file.getAbsolutePath() + " " + configProperties.getBatch().getDataLoad().getInputLocation() + System.currentTimeMillis() / 1000;
tasklet.setCommand(command);
tasklet.setTimeout(configProperties.getBatch().getDataLoadPrep().getSplitCommandTimeout());
executionContext.put(AppConstants.FILE_PATH_PARAM, file.getPath());
return tasklet;
}
and batch-config:
batch:
data-load-prep:
input-location: /mnt/mlr/prep/
split-command: split -l 1000000 --additional-suffix=.csv
split-command-timeout: 900000 # 15 min
schedule: "*/60 * * * * *"
lock-at-most: 5m
With above config, I could able to read load and write successfully to database. However, found a bug with below snippet that, after splitting the file, only first file will have headers, but next splitted file does not have hearders in the first line. So, I have to either disable or avoid linesToSkip(1) config for FlatFileItemReader(CSVReader).
#Configuration
public class DataLoadReader {
#Bean
#StepScope
public FlatFileItemReader<DemographicData> demographicDataCSVReader(#Value("#{jobExecutionContext[filePath]}") final String filePath) {
return new FlatFileItemReaderBuilder<DemographicData>()
.name("data-load-csv-reader")
.resource(new FileSystemResource(filePath))
.linesToSkip(1) // Need to avoid this from 2nd splitted file onwards as splitted file does not have headers
.lineMapper(lineMapper())
.build();
}
public LineMapper<DemographicData> lineMapper() {
DefaultLineMapper<DemographicData> defaultLineMapper = new DefaultLineMapper<>();
DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer();
lineTokenizer.setNames("id", "mdl65DecileNum", "mdl66DecileNum", "hhId", "dob", "firstName", "middleName",
"lastName", "addressLine1", "addressLine2", "cityName", "stdCode", "zipCode", "zipp4Code", "fipsCntyCd",
"fipsStCd", "langName", "regionName", "fipsCntyName", "estimatedIncome");
defaultLineMapper.setLineTokenizer(lineTokenizer);
defaultLineMapper.setFieldSetMapper(new DemographicDataFieldSetMapper());
return defaultLineMapper;
}
}
Note: Loader should not skip first row from second file while loading.
Thank you in advance. Appreciate any suggestions.
I would do it in the SystemCommandTasklet with the following command:
tail -n +2 data.csv | split -l 1000000 --additional-suffix=.csv
If you really want to do it with Java in your Spring Batch job, you can use a custom reader or an item processor that filters the header. But I would not recommend this approach as it introduces an additional test for each item (given the large number of lines in your input file, this could impact the performance of your job).

Hadoop Map Reduce: How to create a reduce function for this?

I hit a brick wall. I have the following files I've generated from previous MR functions.
Product Scores (I have)
0528881469 1.62
0594451647 2.28
0594481813 2.67
0972683275 4.37
1400501466 3.62
where column 1 = product_id, and column 2 = product_rating
Related Products (I have)
0000013714 [0005080789,0005476798,0005476216,0005064341]
0000031852 [B00JHONN1S,B002BZX8Z6,B00D2K1M3O,0000031909]
0000031887 [0000031852,0000031895,0000031909,B00D2K1M3O]
0000031895 [B002BZX8Z6,B00JHONN1S,0000031909,B008F0SU0Y]
0000031909 [B002BZX8Z6,B00JHONN1S,0000031895,B00D2K1M3O]
where column 1 = product_id, and column 2 = array of also_bought products
The file I am trying to create now combines both of these files into the following:
Recommended Products (I need)
0000013714 [<0005080789, 2.34>,<0005476798, 4.58>,<0005476216, 2.32>]
0000031852 [<0005476798, 4.58>,<0005080789, 2.34>,<0005476216, 2.32>]
0000031887 [<0005080789, 2.34>,<0005476798, 4.58>,<0005476216, 2.32>]
0000031895 [<0005476216, 2.32>,<0005476798, 4.58>,<0005080789, 2.34>]
0000031909 [<0005476216, 2.32>,<0005080789, 2.34>,<0005476798, 4.58>]
where column 1 = product_id and column 2 = array of tuples of
I'm just totally stuck at the moment, I thought I had a plan for this but it turned out that it was not a very good plan and it didn't work.
Two approaches based on your size of Product Scores data:
If your Product Scores file is not huge, you can load that up in Hadoop Distributed Cache.(Now available in Jobs itself) Job.addCacheFile()
Then, process the Related Products file and fetch the necessary rating in the Reducer and write it out. Quick and dirty. But, if Product Scores is a huge file then probably not the correct way to go about this problem.
Reduce side Joins. Various examples available, for eg., refer to this link to get an idea.
As you already have defined a schema, you can create hive tables on top of it and get the output using queries. This would save you a lot of time.
Edit: Moreover, If you already have map-reduce jobs ton create this file, you can add hive jobs, which creates external hive tables on these reducer outputs and then query them.
I ended up using a MapFile. I transformed both the ProductScores and RelatedProducts data sets into two MapFiles and then made a Java program that pulled information out of these MapFiles when needed.
MapFileWriter
public class MapFileWriter {
public static void main(String[] args) {
Configuration conf = new Configuration();
Path inputFile = new Path(args[0]);
Path outputFile = new Path(args[1]);
Text txtKey = new Text();
Text txtValue = new Text();
try {
FileSystem fs = FileSystem.get(conf);
FSDataInputStream inputStream = fs.open(inputFile);
Writer writer = new Writer(conf, fs, outputFile.toString(), txtKey.getClass(), txtKey.getClass());
writer.setIndexInterval(1);
while (inputStream.available() > 0) {
String strLineInInputFile = inputStream.readLine();
String[] lstKeyValuePair = strLineInInputFile.split("\\t");
txtKey.set(lstKeyValuePair[0]);
txtValue.set(lstKeyValuePair[1]);
writer.append(txtKey, txtValue);
}
writer.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
MapFileReader
public class MapFileReader {
public static void main(String[] args) {
Configuration conf = new Configuration();
FileSystem fs;
Text txtKey = new Text(args[1]);
Text txtValue = new Text();
MapFile.Reader reader;
try {
fs = FileSystem.get(conf);
try {
reader = new MapFile.Reader(fs, args[0], conf);
reader.get(txtKey, txtValue);
} catch (Exception e) {
e.printStackTrace();
}
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("The value for Key " + txtKey.toString() + " is " + txtValue.toString());
}
}

Loading Files in UDF

I have a requirement of populating a field based on the evaluation of a UDF. The input to the UDF would be some other fields in the input and as well as an csv sheet. Presently, the approach I have taken is to load the CSV file, group it ALL and then pass it as a bag to the UDF along with other required parameters. However, its taking a very long time to complete the process (roughly about 3 hours) for source data of 170k records and as well as csv records of about 150k.
I'm sure there must be much better efficient way to handle this and hence need your inputs.
source_alias = LOAD 'src.csv' USING
PigStorage(',') AS (f1:chararray,f2:chararray,f3:chararray);
csv_alias = LOAD 'csv_file.csv' USING
PigStorage(',') AS (c1:chararray,c2:chararray,c3:chararray);
grpd_csv_alias = GROUP csv_alias ALL;
final_alias = FOREACH source_alias GENERATE f1 AS f1,
myUDF(grpd_csv_alias, f2) AS derived_f2;
Here is my UDF on a high level.
public class myUDF extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
String f2Response = "N";
DataBag csvAliasBag = (DataBag)input.get(0);
String f2 = (String) input.get(1);
try {
Iterator<Tuple> bagIterator = csvAliasBag.iterator();
while (bagIterator.hasNext()) {
Tuple localTuple = (Tuple)bagIterator.next();
String col1 = ((String)localTuple.get(1)).trim().toLowerCase();
String col2 = ((String)localTuple.get(2)).trim().toLowerCase();
String col3 = ((String)localTuple.get(3)).trim().toLowerCase();
String col4 = ((String)localTuple.get(4)).trim().toLowerCase();
<Custom logic to populate f2Response based on the value in f2 and as well as col1, col2, col3 and col4>
}
}
return f2Response;
}
catch(Exception e){
throw new IOException("Caught exception processing input row ", e);
}
}
}
I believe the process is taking too long because of building and passing csv_alias to the UDF for each row in the source file.
Is there any better way to handle this?
Thanks
For small files, you can put them on the distributed cache. This copies the file to each task node as a local file then you load it yourself. Here's an example from the Pig docs UDF section. I would not recommend parsing the file each time, however. Store your results in a class variable and check to see if it's been initialized. If the csv is on the local file system, use getShipFiles. If the csv you're using is on HDFS, used the getCachedFiles method. Notice that for HDFS there's a file path followed by a # and some text. To the left of the # is the HDFS path and to the right is the name you want it to be called when it's copied to the local file system.
public class Udfcachetest extends EvalFunc<String> {
public String exec(Tuple input) throws IOException {
String concatResult = "";
FileReader fr = new FileReader("./smallfile1");
BufferedReader d = new BufferedReader(fr);
concatResult +=d.readLine();
fr = new FileReader("./smallfile2");
d = new BufferedReader(fr);
concatResult +=d.readLine();
return concatResult;
}
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(1);
list.add("/user/pig/tests/data/small#smallfile1"); // This is hdfs file
return list;
}
public List<String> getShipFiles() {
List<String> list = new ArrayList<String>(1);
list.add("/home/hadoop/pig/smallfile2"); // This local file
return list;
}
}

Pig Not Interpreting Int Correctly -- Custom Loader

So this is my first time to ever use Pig and I'm having a hard time getting it to interpret my data correctly. I dont want to have to define a schema for my input files until run time, so I wrote a super simple custom loader where the only changes I made to PigStorage were changing the GetSchema Method to read the first two lines of my file and create a schema off of it:
public ResourceSchema getSchema(String location,
Job job) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(location.replace("file://", "")));
String[] line = br.readLine().split(",");
String[] data = br.readLine().split(",");
List<FieldSchema> fields = new ArrayList<FieldSchema>();
for(int f = 0; f< line.length; f++)
{
Byte type = GetType(data[f].replace("\"", ""));
fields.add(new FieldSchema(line[f].replace("\"", ""), type));
}
schema = new ResourceSchema(new Schema(fields));
return schema;
}
private Byte GetType(Object Data)
{
try{
int number = Integer.parseInt(Data.toString());
return org.apache.pig.data.DataType.INTEGER;
}
catch(Exception e){}
try{
double dnumber = Double.parseDouble(Data.toString());
return org.apache.pig.data.DataType.DOUBLE;
}
catch(Exception e){}
return org.apache.pig.data.DataType.CHARARRAY;
}
When I load a file and run DESCRIBE on it, it looks like what I want, for instance:
{CU_NUMBER: int,CYCLE_DATE: chararray,JOIN_NUMBER: int,RSSD: int,CU_TYPE: int,CU_NAME: chararray}
And the first 10 Rows look like this:
(1,9/30/2013 0:00:00,2,"50377","1","MORRIS SHEPPARD TEXARKANA")
(5,9/30/2013 0:00:00,6,"859879","1","FIRST CASTLE")
(6,9/30/2013 0:00:00,7,"54571","1","THE NEW ORLEANS FIREMEN'S")
(12,9/30/2013 0:00:00,11,"56678","1","FRANKLIN TRUST")
(13,9/30/2013 0:00:00,12,"861676","1","E")
(16,9/30/2013 0:00:00,14,"59277","1","WOODMEN")
(19,9/30/2013 0:00:00,16,"863773","1","NEW HAVEN TEACHERS")
(22,9/30/2013 0:00:00,17,"61074","1","WATERBURY CONNECTICUT TEACHER")
(26,9/30/2013 0:00:00,19,"866372","1","FARMERS")
(28,9/30/2013 0:00:00,21,"953375","1","CENTRIS")
However, when I try to do stuff with the data like:
FOICU = LOAD 'file:///home/biadmin/NCUA/foicu.txt' USING org.apache.pig.builtin.PigStorageInferSchema(',', '-schema');
FirstSixColumns = FOREACH FOICU GENERATE CU_NUMBER, CYCLE_DATE, JOIN_NUMBER, RSSD, CU_TYPE, CU_NAME;
TopTen = LIMIT FirstSixColumns 10;
FOICUFiltered = FILTER TopTen BY CU_NUMBER > 20;
CU_FIVE = FILTER TopTen BY CU_NUMBER == 5;
DUMP FOICUFiltered;
DUMP CU_FIVE;
FOICUFiltered returns all 10 rows even though 7 of them have a CU_NUMBER less than 20:
(1,9/30/2013 0:00:00,2,"50377","1","MORRIS SHEPPARD TEXARKANA")
(5,9/30/2013 0:00:00,6,"859879","1","FIRST CASTLE")
(6,9/30/2013 0:00:00,7,"54571","1","THE NEW ORLEANS FIREMEN'S")
(12,9/30/2013 0:00:00,11,"56678","1","FRANKLIN TRUST")
(13,9/30/2013 0:00:00,12,"861676","1","E")
(16,9/30/2013 0:00:00,14,"59277","1","WOODMEN")
(19,9/30/2013 0:00:00,16,"863773","1","NEW HAVEN TEACHERS")
(22,9/30/2013 0:00:00,17,"61074","1","WATERBURY CONNECTICUT TEACHER")
(26,9/30/2013 0:00:00,19,"866372","1","FARMERS")
(28,9/30/2013 0:00:00,21,"953375","1","CENTRIS")
And CU_FIVE returns no rows at all.
Does anybody know what I've done wrong here and is there a better way to dynamically load the schema at run time without using schema files?

Resources