When creating and loading HFile programmatically to HBase new entries are unavailable - hadoop

I'm trying to create HFiles programmatically and loading them in a running HBase instance. I found a lot of info in HFileOutputFormat and in LoadIncrementalHFiles
I managed to create the new HFile, send it to the cluster. In the cluster web interface the new store file appears but the new keyrange is unavailable.
InputStream stream = ProgrammaticHFileGeneration.class.getResourceAsStream("ga-hourly.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(stream));
String line = null;
Map<byte[], String> rowValues = new HashMap<byte[], String>();
while((line = reader.readLine())!=null) {
String[] vals = line.split(",");
String row = new StringBuilder(vals[0]).append(".").append(vals[1]).append(".").append(vals[2]).append(".").append(vals[3]).toString();
rowValues.put(row.getBytes(), line);
}
List<byte[]> keys = new ArrayList<byte[]>(rowValues.keySet());
Collections.sort(keys, byteArrComparator);
HBaseTestingUtility testingUtility = new HBaseTestingUtility();
testingUtility.startMiniCluster();
testingUtility.createTable("table".getBytes(), "data".getBytes());
Writer writer = new HFile.Writer(testingUtility.getTestFileSystem(),
new Path("/tmp/hfiles/data/hfile"),
HFile.DEFAULT_BLOCKSIZE, Compression.Algorithm.NONE, KeyValue.KEY_COMPARATOR);
for(byte[] key:keys) {
writer.append(new KeyValue(key, "data".getBytes(), "d".getBytes(), rowValues.get(key).getBytes()));
}
writer.appendFileInfo(StoreFile.BULKLOAD_TIME_KEY, Bytes.toBytes(System.currentTimeMillis()));
writer.appendFileInfo(StoreFile.MAJOR_COMPACTION_KEY, Bytes.toBytes(true));
writer.close();
Configuration conf = testingUtility.getConfiguration();
LoadIncrementalHFiles loadTool = new LoadIncrementalHFiles(conf);
HTable hTable = new HTable(conf, "table".getBytes());
loadTool.doBulkLoad(new Path("/tmp/hfiles"), hTable);
ResultScanner scanner = hTable.getScanner("data".getBytes());
Result next = null;
System.out.println("Scanning");
while((next = scanner.next()) != null) {
System.out.format("%s %s\n", new String(next.getRow()), new String(next.getValue("data".getBytes(), "d".getBytes())));
}
Did anyone actually make this work ? I have a compilable / testable version up on my github

take a look at the LoadIncrementalHFiles test in the hbase source code : https://github.com/apache/hbase/blob/7c46646994b7a9d6f947cf12796579ef48d0b0bd/src/test/java/org/apache/hadoop/hbase/mapreduce/TestLoadIncrementalHFiles.java

Related

Filling XFA data to xfa form pdf with ITEXT 7 in C#

PdfReader reader = new PdfReader(sourceXfaPath);
PdfWriter writer = new PdfWriter(exportPdf);
PdfDocument pdfDoc = new PdfDocument(reader, writer, new StampingProperties().UseAppendMode());
PdfAcroForm form = PdfAcroForm.GetAcroForm(pdfDoc, true);
XfaForm xfa = form.GetXfaForm();
xfa.FillXfaForm(new FileStream(exportXfaXml,FileMode.Open));
xfa.Write(pdfDoc);
pdfDoc.Close();
reader.Close();
codes are above:
but it doesn't create pdf document, I'm not sure why it doesn't create it.
I just tried xfa.FillXfaForm(new FileStream(exportXfaXml,FileMode.Open)); to xfa.FillXfaForm(XmlReader.Create(path));
but it shows me same error
public void customGenerate(string sourceFilePath, string destinationtFilePath, string replacementXmlFilePath)
{
PdfReader pdfReader = new PdfReader(sourceFilePath);
using (MemoryStream ms = new MemoryStream())
{
using (PdfStamper stamper = new PdfStamper(pdfReader, ms, '\0', true))
//using (PdfStamper stamper = new PdfStamper(pdfReader, ms))
{
XfaForm xfaForm = new XfaForm(pdfReader);
XmlDocument doc = new XmlDocument();
doc.Load(replacementXmlFilePath);
xfaForm.DomDocument = doc;
xfaForm.Changed = true;
XfaForm.SetXfa(xfaForm, stamper.Reader, stamper.Writer);
stamper.Close();
}
var bytes = ms.ToArray();
System.IO.File.WriteAllBytes(destinationtFilePath, bytes);
ms.Close();
}
pdfReader.Close();
}
This is my code to add Javacript to XFA PDF.
I'm going to add JavaScript[handle button , checkbox] to XFA pdf with avoid signature.

Java HivePreparedStatement example

Can someone show an example java program of HivePreparedStatement that can be used to connect to hive tables and retrieve data from it.
https://hive.apache.org/javadocs/r1.2.2/api/org/apache/hive/jdbc/HivePreparedStatement.html
I was trying with the following code and I am unable to complete it as there are many parameters for the constructor
TCLIService.Iface client =null;
EmbeddedThriftBinaryCLIService embeddedClient = new EmbeddedThriftBinaryCLIService();
embeddedClient.init(null);
client = embeddedClient;
TOpenSessionReq openReq = new TOpenSessionReq();
JdbcConnectionParams connParams = Utils.parseURL("jdbc:hive2://168.61.32.157:10000/nadb");
Map<String, String> openConf = new HashMap<String, String>();
// for remote JDBC client, try to set the conf var using 'set foo=bar'
for (Map.Entry<String, String> hiveConf : connParams.getHiveConfs().entrySet()) {
openConf.put("set:hiveconf:" + hiveConf.getKey(), hiveConf.getValue());
}
// For remote JDBC client, try to set the hive var using 'set hivevar:key=value'
for (Map.Entry<String, String=> hiveVar : connParams.getHiveVars().entrySet()) {
openConf.put("set:hivevar:" + hiveVar.getKey(), hiveVar.getValue());
}
// switch the database
openConf.put("use:database", connParams.getDbName());
// set the fetchSize
openConf.put("set:hiveconf:hive.server2.thrift.resultset.default.fetch.size",
Integer.toString(fetchSize));
// set the session configuration
Map<String, String> sessVars = connParams.getSessionVars();
if (sessVars.containsKey(HiveAuthFactory.HS2_PROXY_USER)) {
openConf.put(HiveAuthFactory.HS2_PROXY_USER,
sessVars.get(HiveAuthFactory.HS2_PROXY_USER));
}
openReq.setConfiguration(openConf);
// Store the user name in the open request in case no non-sasl authentication
if (JdbcConnectionParams.AUTH_SIMPLE.equals(sessConfMap.get(JdbcConnectionParams.AUTH_TYPE))) {
openReq.setUsername(sessConfMap.get(JdbcConnectionParams.AUTH_USER));
openReq.setPassword(sessConfMap.get(JdbcConnectionParams.AUTH_PASSWD));
}
TOpenSessionResp openResp = client.OpenSession(openReq);
// validate connection
Utils.verifySuccess(openResp.getStatus());
if (!supportedProtocols.contains(openResp.getServerProtocolVersion())) {
throw new TException("Unsupported Hive2 protocol");
}
protocol = openResp.getServerProtocolVersion();
sessHandle = openResp.getSessionHandle();
HivePreparedStatement hivePreparedStatement = new HivePreparedStatement(con, client, sessHandle, sql);
Is there any simple method for doing this?

Memory issues when running Spark job on relatively large input

I am running a spark cluster with 50 machines. Each machine is a VM with 8-core, and 50GB memory (41 seems to be available to Spark).
I am running on several input folders, I estimate the size of input to be ~250GB gz compressed.
Although it seems to me that the amount and configuration of machines I am using seems to be sufficient, after about 40 minutes of run the job fail, I can see following errors in the logs:
2558733 [Result resolver thread-2] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 345.0 in stage 1.0 (TID 345, hadoop-w-3.c.taboola-qa-01.internal): java.lang.OutOfMemoryError: Java heap space
java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
java.lang.StringCoding.decode(StringCoding.java:193)
java.lang.String.<init>(String.java:416)
java.lang.String.<init>(String.java:481)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:699)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:660)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
and also:
2653545 [Result resolver thread-2] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 122.1 in stage 1.0 (TID 392, hadoop-w-22.c.taboola-qa-01.internal): java.lang.OutOfMemoryError: GC overhead limit exceeded
java.lang.StringCoding$StringDecoder.decode(StringCoding.java:149)
java.lang.StringCoding.decode(StringCoding.java:193)
java.lang.String.<init>(String.java:416)
java.lang.String.<init>(String.java:481)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:699)
com.doit.customer.dataconverter.Phase0$3.call(Phase0.java:660)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:164)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
How do I go about debugging such an issue?
EDIT: I Found the root cause of the problem. It is this piece of code:
private static final int MAX_FILE_SIZE = 40194304;
....
....
JavaPairRDD<String, List<String>> typedData = filePaths.mapPartitionsToPair(new PairFlatMapFunction<Iterator<String>, String, List<String>>() {
#Override
public Iterable<Tuple2<String, List<String>>> call(Iterator<String> filesIterator) throws Exception {
List<Tuple2<String, List<String>>> res = new ArrayList<>();
String fileType = null;
List<String> linesList = null;
if (filesIterator != null) {
while (filesIterator.hasNext()) {
try {
Path file = new Path(filesIterator.next());
// filter non-trc files
if (!file.getName().startsWith("1")) {
continue;
}
fileType = getType(file.getName());
Configuration conf = new Configuration();
CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
CompressionCodec codec = compressionCodecs.getCodec(file);
FileSystem fs = file.getFileSystem(conf);
ContentSummary contentSummary = fs.getContentSummary(file);
long fileSize = contentSummary.getLength();
InputStream in = fs.open(file);
if (codec != null) {
in = codec.createInputStream(in);
} else {
throw new IOException();
}
byte[] buffer = new byte[MAX_FILE_SIZE];
BufferedInputStream bis = new BufferedInputStream(in, BUFFER_SIZE);
int count = 0;
int bytesRead = 0;
try {
while ((bytesRead = bis.read(buffer, count, BUFFER_SIZE)) != -1) {
count += bytesRead;
}
} catch (Exception e) {
log.error("Error reading file: " + file.getName() + ", trying to read " + BUFFER_SIZE + " bytes at offset: " + count);
throw e;
}
Iterable<String> lines = Splitter.on("\n").split(new String(buffer, "UTF-8").trim());
linesList = Lists.newArrayList(lines);
// get rid of first line in file
Iterator<String> it = linesList.iterator();
if (it.hasNext()) {
it.next();
it.remove();
}
//res.add(new Tuple2<>(fileType,linesList));
} finally {
res.add(new Tuple2<>(fileType, linesList));
}
}
}
return res;
}
Particularly allocating a buffer of size 40M for each file in order to read the content of the file using BufferedInputStream. This causes the stack memory to end at some point.
The thing is:
If I read line by line (which does not require a buffer), it will be
very non-efficient read
If I allocate one buffer and reuse it for
each file read - is it possible in parallelism sense? Or will it get
overwritten by several threads?
Any suggestions are welcome...
EDIT 2: Fixed first memory issue by moving the byte array allocation outside the iterator, so it gets reused by all partition elements. But there is still the new String(buffer, "UTF-8").trim()) which gets created for the split purpose - that's an object that gets also created every time. I could use a stringbuffer/builder but then how would I set the charset encoding without a String object ?
Eventually I changed the code as follows:
// Transform list of files to list of all files' content in lines grouped by type
JavaPairRDD<String,List<String>> typedData = filePaths.mapToPair(new PairFunction<String, String, List<String>>() {
#Override
public Tuple2<String, List<String>> call(String filePath) throws Exception {
Tuple2<String, List<String>> tuple = null;
try {
String fileType = null;
List<String> linesList = new ArrayList<String>();
Configuration conf = new Configuration();
CompressionCodecFactory compressionCodecs = new CompressionCodecFactory(conf);
Path path = new Path(filePath);
fileType = getType(path.getName());
tuple = new Tuple2<String, List<String>>(fileType, linesList);
// filter non-trc files
if (!path.getName().startsWith("1")) {
return tuple;
}
CompressionCodec codec = compressionCodecs.getCodec(path);
FileSystem fs = path.getFileSystem(conf);
InputStream in = fs.open(path);
if (codec != null) {
in = codec.createInputStream(in);
} else {
throw new IOException();
}
BufferedReader r = new BufferedReader(new InputStreamReader(in, "UTF-8"), BUFFER_SIZE);
// Get rid of the first line in the file
r.readLine();
// Read all lines
String line;
while ((line = r.readLine()) != null) {
linesList.add(line);
}
} catch (IOException e) { // Filtering of files whose reading went wrong
log.error("Reading of the file " + filePath + " went wrong: " + e.getMessage());
} finally {
return tuple;
}
}
});
So now I do not use a buffer in size of 40M but rather build the lines list dynamically using an array list. This solved my current memory issue, but now I got other strange errors failing the job. Will report those in a different question...

MiniProfiler - ProfiledDbDataAdapter

Trying to get MiniProfiler to profile loading a DataTable via a Stored Proc
// Use a DbDataAdapter to return data from a SP using a DataTable
var sqlConnection = new SqlConnection(#"server=.\SQLEXPRESS;database=TestDB;Trusted_Connection=True;");
DbConnection connection = new StackExchange.Profiling.Data.ProfiledDbConnection(sqlConnection, MiniProfiler.Current);
var table = new DataTable();
DbDataAdapter dataAdapter = new SqlDataAdapter("GetCountries", sqlConnection);
ProfiledDbDataAdapter prdataAdapter = new ProfiledDbDataAdapter(dataAdapter, null);
// null reference exception here - SelectCommand is null
prdataAdapter.SelectCommand.CommandType = CommandType.StoredProcedure;
prdataAdapter.SelectCommand.Parameters.Add(new SqlParameter("#countryID", 2));
prdataAdapter.Fill(table);
ViewBag.table = table;
Problem: Null reference exception on SelectCommand
Please ignore lack of usings / disposing
I can successfully profile using ProfiledDbCommand to a SP:
// call a SP from DbCommand
var sqlConnection2 = new SqlConnection(#"server=.\SQLEXPRESS;database=TestDB;Trusted_Connection=True;");
DbConnection connection2 = new StackExchange.Profiling.Data.ProfiledDbConnection(sqlConnection2, MiniProfiler.Current);
if (connection2 != null)
{
using (connection2)
{
try
{
// Create the command.
DbCommand command = new SqlCommand();
ProfiledDbCommand prcommand = new ProfiledDbCommand(command, connection2, null);
prcommand.CommandType = CommandType.StoredProcedure;
prcommand.CommandText = "dbo.GetCountries";
prcommand.Parameters.Add(new SqlParameter("#countryID", 3));
prcommand.Connection = connection2;
//command.CommandText = "SELECT Name FROM Countries";
//command.CommandType = CommandType.Text;
// Open the connection.
connection2.Open();
// Retrieve the data.
DbDataReader reader = prcommand.ExecuteReader();
while (reader.Read())
{
var text = reader["Name"];
result += text + ", ";
}
}
catch (Exception ex)
{
}
}
}
Edit1:
This works:
// 2) SqlConnection, SqlDataAdapter.. dataAdapter command - works
var sqlConnection = new SqlConnection(#"server=.\SQLEXPRESS;database=TestDB;Trusted_Connection=True;");
var table = new DataTable();
//var dataAdapter = new SqlDataAdapter("GetCountries", sqlConnection);
var dataAdapter = new SqlDataAdapter("select * from countries", sqlConnection);
//dataAdapter.SelectCommand.CommandType = CommandType.StoredProcedure;
//dataAdapter.SelectCommand.Parameters.AddWithValue("#countryID", 2);
dataAdapter.Fill(table);
This doesn't DataTable.. but does with DataSet
var sqlConnection = new SqlConnection(#"server=.\SQLEXPRESS;database=TestDB;Trusted_Connection=True;");
DbConnection connection = new ProfiledDbConnection(sqlConnection, MiniProfiler.Current);
DbDataAdapter dataAdapter = new SqlDataAdapter("select * from countries", sqlConnection);
ProfiledDbDataAdapter prdataAdapter = new ProfiledDbDataAdapter(dataAdapter, null);
var table = new DataTable();
var dataset = new DataSet();
// Doesn't work
//prdataAdapter.Fill(table);
// Does work
prdataAdapter.Fill(dataset);
It turns out that though ProfiledDbDataAdapter inherited from DbDataAdapter, it did not override the default functionality of DbDataAdapter.Fill(DataTable), leading to the errors that you saw.
I fixed this in the MiniProfiler code. Fix is available in nuget, version 3.0.10-beta7 and higher.
I have tested this with your code from above and it works for me:
DbConnection connection =
new ProfiledDbConnection(sqlConnection, MiniProfiler.Current);
var sql = "select * from countries";
DbDataAdapter dataAdapter = new SqlDataAdapter(sql, sqlConnection);
ProfiledDbDataAdapter prdataAdapter = new ProfiledDbDataAdapter(dataAdapter);
var table = new DataTable();
dataAdapter.Fill(table); // this now works

HBase "between" Filters

I'm trying to retrieve rows with in range, using Filter List but I'm not successful.
Below is my code snippet.
I want to retrieve data between 1000 and 2000.
HTable table = new HTable(conf, "TRAN_DATA");
List<Filter> filters = new ArrayList<Filter>();
SingleColumnValueFilter filter1 = new SingleColumnValueFilter(Bytes.toBytes("TRAN"),
Bytes.toBytes("TRAN_ID"),
CompareFilter.CompareOp.GREATER, new BinaryComparator(Bytes.toBytes("1000")));
filter1.setFilterIfMissing(true);
filters.add(filter1);
SingleColumnValueFilter filter2 = new SingleColumnValueFilter(Bytes.toBytes("TRAN"),
Bytes.toBytes("TRAN_ID"),
CompareFilter.CompareOp.LESS,new BinaryComparator(Bytes.toBytes("2000")));
filters.add(filter2);
FilterList filterList = new FilterList(filters);
Scan scan = new Scan();
scan.setFilter(filterList);
ResultScanner scanner1 = table.getScanner(scan);
System.out.println("Results of scan #1 - MUST_PASS_ALL:");
int n = 0;
for (Result result : scanner1) {
for (KeyValue kv : result.raw()) {
System.out.println("KV: " + kv + ", Value: "
+ Bytes.toString(kv.getValue()));
{
n++;
}
}
scanner1.close();
Tried with all possible ways using
1. SingleColumnValueFilter filter2 = new SingleColumnValueFilter(Bytes.toBytes("TRANSACTIONS"),
Bytes.toBytes("TRANS_ID"),
CompareFilter.CompareOp.LESS, new SubstringComparator("5000"));
SingleColumnValueFilter filter2 = new SingleColumnValueFilter(Bytes.toBytes("TRANSACTIONS"),
Bytes.toBytes("TRANS_ID"),
CompareFilter.CompareOp.LESS, Bytes.toBytes("5000")); None of above approaches work :(
One thing which is certainly off here is when creating the FILTERLIST you also have to specify FilterList.Operator, otherwise not sure how filterlist would handle multiple filters. In your case it should be something like :-
FilterList filterList = new FilterList(FilterList.Operator.MUST_PASS_ALL, filters);
See if this helps.
Looks OK. Check if you have persisted them as "1000" and "2000" not as byte[] for 1000 and 2000.
Rest looks fine to me.
1 FilterList default operator is Operator.MUST_PASS_ALL. It's ok of your code about this.
2 If you put the string as bytes in hbase, your choice is wrong.
because "1000" < "2" < "5000"
Put put = new Put(rowKey_ForTest);
put.add(ColumnFamilyName, QName1, Bytes.toBytes("2"));
table.put(put);
List<Filter> filters = new ArrayList<Filter>();
SingleColumnValueFilter filter1 = new SingleColumnValueFilter(
ColumnFamilyName, QName1, CompareOp.GREATER,
new BinaryComparator(Bytes.toBytes("1000")));
filters.add(filter1);
SingleColumnValueFilter filter2 = new SingleColumnValueFilter(
ColumnFamilyName, QName1, CompareOp.LESS, new BinaryComparator(
Bytes.toBytes("5000")));
filters.add(filter2);
FilterList filterList = new FilterList(filters);
Scan scan = new Scan();
scan.setFilter(filterList);
List<String> resultRowKeys = new ArrayList<String>();
ResultScanner resultScanner = table.getScanner(scan);
for (Result result = resultScanner.next(); result != null; result = resultScanner
.next()) {
resultRowKeys.add(Bytes.toString(result.getRow()));
}
Util.close(resultScanner);
Assert.assertEquals(1, resultRowKeys.size());
3 If you put the int as bytes, your code is wrong.
You should use Bytes.toBytes(int), not Bytes.toBytes(String).
My test code is on https://github.com/zhang-xzhi/simplehbase
There is a lot of test about hbase.
Or you can post your put code, so we can check how did you save your data to hbase.
Or you can debug it, check your value's format.
See if it can help.

Resources