hbase scan timerange return old version - filter

I have one question about hbase scan by using timerange.
I create a 'test' table,it has one family 'cf' and one version , after I put 4 rows data in that table, and scan that table by using timerange, however, I get a old version row within the timerange.
for example:
create 'test',{NAME=>'cf',VERSIONS=>1}
put 'test','row1','cf:u','value1'
put 'test','row2','cf:u','value2'
put 'test','row3','cf:u','value3'
put 'test','row3','cf:u','value4'
and then I scan this table,the following is the output:
hbase(main):008:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:u, timestamp=1340259691771, value=value1
row2 column=cf:u, timestamp=1340259696975, value=value2
row3 column=cf:u, timestamp=1340259704569, value=value4
it it right,row3 have the newest version.
however,If I use scan it with timerange I get this:
hbase(main):010:0> scan 'test',{TIMERANGE=>[1340259691771,1340259704569]}
ROW COLUMN+CELL
row1 column=cf:u, timestamp=1340259691771, value=value1
row2 column=cf:u, timestamp=1340259696975, value=value2
row3 column=cf:u, timestamp=1340259701085, value=value3
it return row3 old version, but this table I set version equal 1
if I increase maxtimestamp ,I get :
hbase(main):011:0> scan 'test',{TIMERANGE=>[1340259691771,1340259704570]}
ROW COLUMN+CELL
row1 column=cf:u, timestamp=1340259691771, value=value1
row2 column=cf:u, timestamp=1340259696975, value=value2
row3 column=cf:u, timestamp=1340259704569, value=value4
3 row(s) in 0.0330 seconds
It is right,I can understand it.
What I want is scan a table within a timerange,it return only newest version, I know there is a TimestampsFilter, however that filter only support specific timestamp ,not time range.
Is there any way to scan a table within a timerange and only return newest verion?
I try to write my own timerangefilter,the following is my code.
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.ArrayList;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.filter.Filter;
import org.apache.hadoop.hbase.filter.FilterBase;
import org.apache.hadoop.hbase.filter.ParseFilter;
import com.google.common.base.Preconditions;
public class TimeRangeFilter extends FilterBase {
private long minTimeStamp = Long.MIN_VALUE;
private long maxTimeStamp = Long.MAX_VALUE;
public TimeRangeFilter(long minTimeStamp, long maxTimeStamp) {
Preconditions.checkArgument(maxTimeStamp >= minTimeStamp, "max timestamp %s must be big than min timestamp %s", maxTimeStamp, minTimeStamp);
this.maxTimeStamp = maxTimeStamp;
this.minTimeStamp = minTimeStamp;
}
#Override
public ReturnCode filterKeyValue(KeyValue v) {
if (v.getTimestamp() >= minTimeStamp && v.getTimestamp() <= maxTimeStamp) {
return ReturnCode.INCLUDE;
} else if (v.getTimestamp() < minTimeStamp) {
// The remaining versions of this column are guaranteed
// to be lesser than all of the other values.
return ReturnCode.NEXT_COL;
}
return ReturnCode.SKIP;
}
public static Filter createFilterFromArguments(ArrayList<byte[]> filterArguments) {
long minTime, maxTime;
if (filterArguments.size() < 2)
return null;
minTime = ParseFilter.convertByteArrayToLong(filterArguments.get(0));
maxTime = ParseFilter.convertByteArrayToLong(filterArguments.get(1));
return new TimeRangeFilter(minTime, maxTime);
}
#Override
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
out.writeLong(minTimeStamp);
out.writeLong(maxTimeStamp);
}
#Override
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
this.minTimeStamp = in.readLong();
this.maxTimeStamp = in.readLong();
}
}
I add this jar into hbase HBASE_CLASSPATH in hbase-env.sh, however,I get the following error:
org.apache.hadoop.hbase.client.ScannerCallable#a9255c, java.io.IOException: IPC server unable to read call parameters: Error in readFields

Dape,
When you set the max versions to 1 and have more than one entry for a cell, Hbase tombstones the older cells and gets and scans cannot see them unless ofcourse you specify a particular timestamp range which qualifies only one cell. The tombstoned cells are only deleted after a Major_compact is run on the table, which is when the older cells would stop popping up.
To always get the latest cells from a scan all you need to do is use the method below -
Result.getColumnLatest(family, qualifier)

java.io.IOException: IPC server unable to read call parameters: Error in readFields
you need to copy the jars to all region servers, and edit HBASE_CLASSPATH in hbase-env.sh on region servers accordingly
you can specify timerange and MaxVersions on Scanner to get old versions within the time range
scan.setMaxVersions(Integer.MAX_VALUE);
scan.setTimeRange(startVersion, endVersion);

I think this is exactly the same problem I ran into here: HBase get returns old values even with max versions = 1
It turns out to be a bug in hbase.
See: https://issues.apache.org/jira/browse/HBASE-10102

Related

Why the TiDB performance drop for 10 times when the updated field value is random?

I set up the TiDB, TiKV and PD cluster in order to benchmark them with YCSB tool, connected by the MySQL driver.
The cluster consists of 5 instances for each of TiDB, TiKV and PD.
Each node run a single TiDB, TiKV and PD instance.
However, when I play around the YCSB code in the update statement, I notice that if the value of the updated field is fixed and hardcoded, the total throughput is ~30K tps and the latency at ~30ms. If the updated field value is random, the total throughput is ~2k tps and the latency is around ~300ms.
The update statement creation code is as follow:
#Override
public String createUpdateStatement(StatementType updateType) {
String[] fieldKeys = updateType.getFieldString().split(",");
StringBuilder update = new StringBuilder("UPDATE ");
update.append(updateType.getTableName());
update.append(" SET ");
for (int i = 0; i < fieldKeys.length; i++) {
update.append(fieldKeys[i]);
String randStr = RandomCharStr(); // 1) 3K tps with 300ms latency
//String randStr = "Hardcode-Field-Value"; // 2) 20K tps with 20ms latency
update.append(" = '" + randStr + "'");
if (i < fieldKeys.length - 1) {
update.append(", ");
}
}
// update.append(fieldKey);
update.append(" WHERE ");
update.append(JdbcDBClient.PRIMARY_KEY);
update.append(" = ?");
return update.toString();
}
How do we account for this performance gap?
Is it due to the DistSQL query cache, as discussed in this post?
I manage to figure this out from this post (Same transaction returns different results when i ran multiply times) and pr (https://github.com/pingcap/tidb/issues/7644).
It is because TiDB will not perform the txn if the updated field is identical to the previous value.

jdbcTemplate.queryForList returns list of Map where all column values are NULL

Any calls using jdbcTemplate.queryForList returns a list of Maps which have NULL values for all columns. The columns should've had string values.
I do get the correct number of rows when compared to the result I get when I run the same query in a native SQL client.
I am using the JDBC ODBC bridge and the database is MS SQL server 2008.
I have the following code in my DAO:
public List internalCodeDescriptions(String listID) {
List rows = jdbcTemplate.queryForList("select CODE, DESCRIPTION from CODE_DESCRIPTIONS where LIST_ID=? order by sort_order asc", new Object[] {listID});
//debugcode start
try {
Connection conn1 = jdbcTemplate.getDataSource().getConnection();
Statement stat = conn1.createStatement();
boolean sok = stat.execute("select code, description from code_descriptions where list_id='TRIGGER' order by sort_order asc");
if(sok) {
ResultSet rs = stat.getResultSet();
ResultSetMetaData rsmd = rs.getMetaData();
String columnname1=rsmd.getColumnName(1);
String columnname2=rsmd.getColumnName(2);
int type1 = rsmd.getColumnType(1);
int type2 = rsmd.getColumnType(2);
String tn1 = rsmd.getColumnTypeName(1);
String tn2 = rsmd.getColumnTypeName(2);
log.debug("Testquery gave resultset with:");
log.debug("Column 1 -name:" + columnname1 + " -typeID:"+type1 + " -typeName:"+tn1);
log.debug("Column 2 -name:" + columnname2 + " -typeID:"+type2 + " -typeName:"+tn2);
int i=1;
while(rs.next()) {
String cd=rs.getString(1);
String desc=rs.getString(2);
log.debug("Row #"+i+": CODE='"+cd+"' DESCRIPTION='"+desc+"'");
i++;
}
} else {
log.debug("Query execution returned false");
}
} catch(SQLException se) {
log.debug("Something went haywire in the debug code:" + se.toString());
}
log.debug("Original jdbcTemplate list result gave:");
Iterator<Map<String, Object>> it1= rows.iterator();
while(it1.hasNext()) {
Map mm = (Map)it1.next();
log.debug("Map:"+mm);
String code=(String)mm.get("CODE");
String desc=(String)mm.get("description");
log.debug("CODE:"+code+" : "+desc);
}
//debugcode end
return rows;
}
As you can see I've added some debugging code to list the results from the queryForList and I also obtain the connection from the jdbcTemplate object and uses that to sent the same query using the basic jdbc methods (listID='TRIGGER').
What is puzzling me is that the log outputs something like this:
Testquery gave resultset with:
Column 1 -name:code -typeID:-9 -typeName:nvarchar
Column 2 -name:decription -typeID:-9 -typeName:nvarchar
Row #1: CODE='C1' DESCRIPTION='BlodoverxF8rin eller bruk av blodprodukter'
Row #2: CODE='C2' DESCRIPTION='Kodetilfelle, hjertestans/respirasjonstans'
Row #3: CODE='C3' DESCRIPTION='Akutt dialyse'
...
Row #58: CODE='S14' DESCRIPTION='Forekomst av hvilken som helst komplikasjon'
...
Original jdbcTemplate list result gave:
Map:(CODE=null, DESCRIPTION=null)
CODE:null : null
Map:(CODE=null, DESCRIPTION=null)
CODE:null : null
...
58 repetitions total.
Why does the result from the queryForList method return NULL in all columns for every row? How can I get the result I want using jdbcTemplate.queryForList?
The xF8 should be the letter ΓΈ so I have some encoding issues, but I can't see how that may cause all values - also strings not containing any strange letters (se row#2) - to turn into NULL values in the list of maps returned from the jdbcTemplate.queryForList method.
The same code ran fine on another server against a MySQL Server 5.5 database using the jdbc driver for MySQL.
The issue was resolved by using the MS SQL Server jdbc driver rather than using the JDBC ODBC bridge. I don't know why it didn't work with the bridge though.

How do I transform a parameter in Pig?

I need to process a dataset in Pig, which is available once per day at midnight. Therefor I have an Oozie coordinator that takes care of the scheduling and spawns a workflow every day at 00:00.
The file names follow the URI scheme
hdfs://${dataRoot}/input/raw${YEAR}${MONTH}${DAY}${HOUR}.avro
where ${HOUR} is always '00'.
Each entry in the dataset contains a UNIX timestamp and I want to filter out those entries which have a timestamp before 11:45pm (23:45). As I need to run on datasets from the past, the value of the timestamp defining the threshold needs to be set dynamically according to the day currently processed. For example, proessing the dataset from December, 12th 2013 needs the threshold 1418337900. For this reason, setting the threshold must be done by the coordinator.
To the best of my knowledge, there is no possibility to transfrom a formatted date into a UNIX timestamp in EL. I came up with a quite hacky solution:
The coordinator passes date and time of the threshold to the respective workflow which starts the parameterized instance of the Pig script.
Excerpt of the coordinator.xml:
<property>
<name>threshold</name>
<value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -15, 'MINUTE'), 'yyyyMMddHHmm')}</value>
</property>
Excerpt of the workflow.xml:
<action name="foo">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<script>${applicationPath}/flights.pig</script>
<param>jobInput=${jobInput}</param>
<param>jobOutput=${jobOutput}</param>
<param>threshold=${threshold}</param>
</pig>
<ok to="end"/>
<error to="error"/>
</action>
The Pig script needs to convert this formatted datetime into a UNIX timestamp. Therefor, I have writte a UDF:
public class UnixTime extends EvalFunc<Long> {
private long myTimestamp = 0L;
private static long convertDateTime(String dt, String format)
throws IOException {
DateFormat formatter;
Date date = null;
formatter = new SimpleDateFormat(format);
try {
date = formatter.parse(dt);
} catch (ParseException ex) {
throw new IOException("Illegal Date: " + dt + " format: " + format);
}
return date.getTime() / 1000L;
}
public UnixTime(String dt, String format) throws IOException {
myTimestamp = convertDateTime(dt, format);
}
#Override
public Long exec(Tuple input) throws IOException {
return myTimestamp;
}
}
In the Pig script, a macro is created, initializing the UDF with the input of the coordinator/workflow. Then, you can filter the timestamps.
DEFINE THRESH mystuff.pig.UnixTime('$threshold', 'yyyyMMddHHmm');
d = LOAD '$jobInput' USING PigStorage(',') AS (time: long, value: chararray);
f = FILTER d BY d <= THRESH();
...
The problem that I have leads me to the more general question, if it is possible to transform an input parameter in Pig and use it again as some kind of constant.
Is there a better way to solve this problem or is my approach needlessly complicated?
Edit: TL;DR
After more searching I found someone with the same problem:
http://grokbase.com/t/pig/user/125gszzxnx/survey-where-are-all-the-udfs-and-macros
Thanks Gaurav for recommending the UDFs in piggybank.
It seems that there is no performant solution without using declare and a shell script.
You can put the Pig script into a Python script and pass the value.
#!/usr/bin/python
import sys
import time
from org.apache.pig.scripting import Pig
P = Pig.compile("""d = LOAD '$jobInput' USING PigStorage(',') AS (time: long, value: chararray);
f = FILTER d BY d <= '$thresh';
""")
jobinput = {whatever you defined}
thresh = {whatever you defined in the UDF}
Q = P.bind({'thresh':thresh,'jobinput':jobinput})
results = Q.runSingle()
if results.isSuccessful() == "FAILED":
raise "Pig job failed"

how can I export hbase table using starttime endtime?

I am trying to perform incremental backup , I have already checked Export option but couldn't figure out start time option.Also please suggest on CopyTable , how can I restore.
Using CopyTable you just receive copy of given table on the same or another cluster (actually CopyTable MapReduce job). No miracle.
Its your own decision how to restore. Obvious options are:
Use the same tool to copy table back.
Just get / put selected rows (what I think you need here). Please pay attention you should keep timestamps while putting data back.
Actually for incremental backup it's enough for you to write job which scans table and gets/puts rows with given timestamps into table with the name calculated by date. Restore should work in reverse direction - read table with calculated name and put its record with the same timestamp.
I'd also recommend to you following technique: table snapshot (CDH 4.2.1 uses HBase 0.94.2). It looks not applicable for incremental backup but maybe you find something useful here like additional API. From the point of view of backup now it looks nice.
Hope this will help somehow.
The source code suggests
int versions = args.length > 2? Integer.parseInt(args[2]): 1;
long startTime = args.length > 3? Long.parseLong(args[3]): 0L;
long endTime = args.length > 4? Long.parseLong(args[4]): Long.MAX_VALUE;
The accepted answer doesn't pass version as a parameter. How did it work then?
hbase org.apache.hadoop.hbase.mapreduce.Export test /bkp_destination/test 1369060183200 1369063567260023219
From source code this boils down to -
1369060183200 - args[2] - version
1369063567260023219 - args[3] - starttime
Attaching source for ref:
private static Scan getConfiguredScanForJob(Configuration conf, String[] args) throws IOException {
Scan s = new Scan();
// Optional arguments.
// Set Scan Versions
int versions = args.length > 2? Integer.parseInt(args[2]): 1;
s.setMaxVersions(versions);
// Set Scan Range
long startTime = args.length > 3? Long.parseLong(args[3]): 0L;
long endTime = args.length > 4? Long.parseLong(args[4]): Long.MAX_VALUE;
s.setTimeRange(startTime, endTime);
// Set cache blocks
s.setCacheBlocks(false);
// set Start and Stop row
if (conf.get(TableInputFormat.SCAN_ROW_START) != null) {
s.setStartRow(Bytes.toBytesBinary(conf.get(TableInputFormat.SCAN_ROW_START)));
}
if (conf.get(TableInputFormat.SCAN_ROW_STOP) != null) {
s.setStopRow(Bytes.toBytesBinary(conf.get(TableInputFormat.SCAN_ROW_STOP)));
}
// Set Scan Column Family
boolean raw = Boolean.parseBoolean(conf.get(RAW_SCAN));
if (raw) {
s.setRaw(raw);
}
if (conf.get(TableInputFormat.SCAN_COLUMN_FAMILY) != null) {
s.addFamily(Bytes.toBytes(conf.get(TableInputFormat.SCAN_COLUMN_FAMILY)));
}
// Set RowFilter or Prefix Filter if applicable.
Filter exportFilter = getExportFilter(args);
if (exportFilter!= null) {
LOG.info("Setting Scan Filter for Export.");
s.setFilter(exportFilter);
}
int batching = conf.getInt(EXPORT_BATCHING, -1);
if (batching != -1){
try {
s.setBatch(batching);
} catch (IncompatibleFilterException e) {
LOG.error("Batching could not be set", e);
}
}
LOG.info("versions=" + versions + ", starttime=" + startTime +
", endtime=" + endTime + ", keepDeletedCells=" + raw);
return s;
}
Found out the issue here, the hbase documentation says
hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
so after trying a few of combinations, I found out that it is converted to a real example like below code
hbase org.apache.hadoop.hbase.mapreduce.Export test /bkp_destination/test 1369060183200 1369063567260023219
where
test is tablename,
/bkp_destination/test is backup destination folder,
1369060183200 is starttime,
1369063567260023219 is endtime

prepared statement in multithreading

I have used MERGE command in my prepared statement,and when i was executed it in a single threaded env,its working fine,But in multi threaded environment,it causes some problem.That is data is duplicated,that is if i have 5 threads,each record will duplicate 5 times.I think there is no lock in db to help the thread.
My code:
//db:oracle
sb.append("MERGE INTO EMP_BONUS EB USING (SELECT 1 FROM DUAL) on (EB.EMP_id = ?) WHEN MATCHED THEN UPDATE SET TA =?,DA=?,TOTAL=?,MOTH=? WHEN NOT MATCHED THEN "+ "INSERT (EMP_ID, TA, DA, TOTAL, MOTH, NAME)VALUES(?,?,?,?,?,?) ");
//sql operation,calling from run() method
public void executeMerge(String threadName) throws Exception {
ConnectionPro cPro = new ConnectionPro();
Connection connE = cPro.getConection();
connE.setAutoCommit(false);
System.out.println(sb.toString());
System.out.println("Threadname="+threadName);
PreparedStatement pStmt= connE.prepareStatement(sb.toString());
try {
count = count + 1;
for (Employee employeeObj : employee) {//datalist of employee
pStmt.setInt(1, employeeObj.getEmp_id());
pStmt.setDouble(2, employeeObj.getSalary() * .10);
pStmt.setDouble(3, employeeObj.getSalary() * .05);
pStmt.setDouble(4, employeeObj.getSalary()
+ (employeeObj.getSalary() * .05)
+ (employeeObj.getSalary() * .10));
pStmt.setInt(5, count);
pStmt.setDouble(6, employeeObj.getEmp_id());
pStmt.setDouble(7, employeeObj.getSalary() * .10);
pStmt.setDouble(8, employeeObj.getSalary() * .05);
pStmt.setDouble(9, employeeObj.getSalary()
+ (employeeObj.getSalary() * .05)
+ (employeeObj.getSalary() * .10));
pStmt.setInt(10, count);
pStmt.setString(11, threadName);
// pStmt.executeUpdate();
pStmt.addBatch();
}
pStmt.executeBatch();
connE.commit();
} catch (Exception e) {
connE.rollback();
throw e;
} finally {
pStmt.close();
connE.close();
}
}
if employee.size=5, thread count =5,after execution i would get 25 records instead of 5
If there is no constraint (i.e. a primary key or a unique key constraint on the emp_id column in emp_bonus), there would be nothing to prevent the database from allowing each thread to insert 5 rows. Since each database session cannot see uncommitted changes made by other sessions, each thread would see that there was no row in emp_bonus with the emp_id the thread is looking for (I'm assuming that employeeObj.getEmp_id() returns the same 5 emp_id values in each thread) so each thread would insert all 5 rows leaving you with a total of 25 rows if there are 5 threads. If you have a unique constraint that prevents the duplicate rows from being inserted, Oracle will allow the other 4 threads to block until the first thread commits allowing the subsequent threads to do updates rather than inserts. Of course, this will cause the threads to be serialized defeating any performance gains you would get from running multiple threads.

Resources