how can I export hbase table using starttime endtime? - hadoop

I am trying to perform incremental backup , I have already checked Export option but couldn't figure out start time option.Also please suggest on CopyTable , how can I restore.

Using CopyTable you just receive copy of given table on the same or another cluster (actually CopyTable MapReduce job). No miracle.
Its your own decision how to restore. Obvious options are:
Use the same tool to copy table back.
Just get / put selected rows (what I think you need here). Please pay attention you should keep timestamps while putting data back.
Actually for incremental backup it's enough for you to write job which scans table and gets/puts rows with given timestamps into table with the name calculated by date. Restore should work in reverse direction - read table with calculated name and put its record with the same timestamp.
I'd also recommend to you following technique: table snapshot (CDH 4.2.1 uses HBase 0.94.2). It looks not applicable for incremental backup but maybe you find something useful here like additional API. From the point of view of backup now it looks nice.
Hope this will help somehow.

The source code suggests
int versions = args.length > 2? Integer.parseInt(args[2]): 1;
long startTime = args.length > 3? Long.parseLong(args[3]): 0L;
long endTime = args.length > 4? Long.parseLong(args[4]): Long.MAX_VALUE;
The accepted answer doesn't pass version as a parameter. How did it work then?
hbase org.apache.hadoop.hbase.mapreduce.Export test /bkp_destination/test 1369060183200 1369063567260023219
From source code this boils down to -
1369060183200 - args[2] - version
1369063567260023219 - args[3] - starttime
Attaching source for ref:
private static Scan getConfiguredScanForJob(Configuration conf, String[] args) throws IOException {
Scan s = new Scan();
// Optional arguments.
// Set Scan Versions
int versions = args.length > 2? Integer.parseInt(args[2]): 1;
s.setMaxVersions(versions);
// Set Scan Range
long startTime = args.length > 3? Long.parseLong(args[3]): 0L;
long endTime = args.length > 4? Long.parseLong(args[4]): Long.MAX_VALUE;
s.setTimeRange(startTime, endTime);
// Set cache blocks
s.setCacheBlocks(false);
// set Start and Stop row
if (conf.get(TableInputFormat.SCAN_ROW_START) != null) {
s.setStartRow(Bytes.toBytesBinary(conf.get(TableInputFormat.SCAN_ROW_START)));
}
if (conf.get(TableInputFormat.SCAN_ROW_STOP) != null) {
s.setStopRow(Bytes.toBytesBinary(conf.get(TableInputFormat.SCAN_ROW_STOP)));
}
// Set Scan Column Family
boolean raw = Boolean.parseBoolean(conf.get(RAW_SCAN));
if (raw) {
s.setRaw(raw);
}
if (conf.get(TableInputFormat.SCAN_COLUMN_FAMILY) != null) {
s.addFamily(Bytes.toBytes(conf.get(TableInputFormat.SCAN_COLUMN_FAMILY)));
}
// Set RowFilter or Prefix Filter if applicable.
Filter exportFilter = getExportFilter(args);
if (exportFilter!= null) {
LOG.info("Setting Scan Filter for Export.");
s.setFilter(exportFilter);
}
int batching = conf.getInt(EXPORT_BATCHING, -1);
if (batching != -1){
try {
s.setBatch(batching);
} catch (IncompatibleFilterException e) {
LOG.error("Batching could not be set", e);
}
}
LOG.info("versions=" + versions + ", starttime=" + startTime +
", endtime=" + endTime + ", keepDeletedCells=" + raw);
return s;
}

Found out the issue here, the hbase documentation says
hbase org.apache.hadoop.hbase.mapreduce.Export <tablename> <outputdir> [<versions> [<starttime> [<endtime>]]]
so after trying a few of combinations, I found out that it is converted to a real example like below code
hbase org.apache.hadoop.hbase.mapreduce.Export test /bkp_destination/test 1369060183200 1369063567260023219
where
test is tablename,
/bkp_destination/test is backup destination folder,
1369060183200 is starttime,
1369063567260023219 is endtime

Related

How to purge old content in firebase realtime database

I am using Firebase realtime database and overtime there is a lot of stale data in it and I have written a script to delete the stale content.
My Node structure looks something like this:
store
- {store_name}
- products
- {product_name}
- data
- {date} e.g. 01_Sep_2017
- some_event
Scale of the data
#Stores: ~110K
#Products: ~25
Context
I want to cleanup all the data which is like 30 months old. I tried the following approach :-
For each store, traverse all the products and for each date, delete the node
I ran ~30 threads/script instances and each thread is responsible for deleting a particular date of data in that month. The whole script is running for ~12 hours to delete a month data with above structure.
I have placed a limit/cap on the number of pending calls in each script and it is evident from logging that each script reaches the limit very quickly and speed of firing the delete call is much faster than speed of deletion So here firebase becomes a bottleneck.
Pretty evident that I am running purge script at client side and to gain performance script should be executed close to the data to save network round trip time.
Questions
Q1. How to delete firebase old nodes efficiently ?
Q2. Is there a way we can set a TTL on each node so that it cleans up automatically ?
Q3. I have confirmed from multiple nodes that data has been deleted from the nodes but firebase console is not showing decrease in data. I also tried to take backup of data and it still is showing some data which is not there when I checked the nodes manually. I want to know the reason behind this inconsistency.
Does firebase make soft deletions So when we take backups, data is actually there but is not visible via firebase sdk or firebase console because they can process soft deletes but backups don't ?
Q4. For the whole duration my script is running, I have a continuous rise in bandwidth section. With below script I am only firing delete calls and I am not reading any data still I see a consistency with database read. Have a look at this screenshot ?
Is this because of callbacks of deleted nodes ?
Code
var stores = [];
var storeIndex = 0;
var products = [];
var productIndex = -1;
const month = 'Oct';
const year = 2017;
if (process.argv.length < 3) {
console.log("Usage: node purge.js $beginDate $endDate i.e. node purge 1 2 | Exiting..");
process.exit();
}
var beginDate = process.argv[2];
var endDate = process.argv[3];
var numPendingCalls = 0;
const maxPendingCalls = 500;
/**
* Url Pattern: /store/{domain}/products/{product_name}/data/{date}
* date Pattern: 01_Jan_2017
*/
function deleteNode() {
var storeName = stores[storeIndex],
productName = products[productIndex],
date = (beginDate < 10 ? '0' + beginDate : beginDate) + '_' + month + '_' + year;
numPendingCalls++;
db.ref('store')
.child(storeName)
.child('products')
.child(productName)
.child('data')
.child(date)
.remove(function() {
numPendingCalls--;
});
}
function deleteData() {
productIndex++;
// When all products for a particular store are complete, start for the new store for given date
if (productIndex === products.length) {
if (storeIndex % 1000 === 0) {
console.log('Script: ' + beginDate, 'PendingCalls: ' + numPendingCalls, 'StoreIndex: ' + storeIndex, 'Store: ' + stores[storeIndex], 'Time: ' + (new Date()).toString());
}
productIndex = 0;
storeIndex++;
}
// When all stores have been completed, start deleting for next date
if (storeIndex === stores.length) {
console.log('Script: ' + beginDate, 'Successfully deleted data for date: ' + beginDate + '_' + month + '_' + year + '. Time: ' + (new Date()).toString());
beginDate++;
storeIndex = 0;
}
// When you have reached endDate, all data has been deleted call the original callback
if (beginDate > endDate) {
console.log('Script: ' + beginDate, 'Deletion script finished successfully at: ' + (new Date()).toString());
process.exit();
return;
}
deleteNode();
}
function init() {
console.log('Script: ' + beginDate, 'Deletion script started at: ' + (new Date()).toString());
getStoreNames(function() {
getProductNames(function() {
setInterval(function() {
if (numPendingCalls < maxPendingCalls) {
deleteData();
}
}, 0);
});
});
}
PS: This is not the exact structure I have but it is very similar to what we have (I have changed the node names and tried to make the example a realistic example)
Whether the deletes can be done more efficiently depends on how you now do them. Since you didn't share the minimal code that reproduces your current behavior it's hard to say how to improve it.
There is no support for a time-to-live property on documents. Typically developers do the clean-up in a administrative program/script that runs periodically. The more frequently you run the cleanup script, the less work it has to do, and thus the faster it will be.
Also see:
Delete firebase data older than 2 hours
How to delete firebase data after "n" days
Firebase actually deletes the data from disk when you tell it to. There is no way through the API to retrieve it, since it is really gone. But if you have a backup from a previous day, the data will of course still be there.

jdbcTemplate.queryForList returns list of Map where all column values are NULL

Any calls using jdbcTemplate.queryForList returns a list of Maps which have NULL values for all columns. The columns should've had string values.
I do get the correct number of rows when compared to the result I get when I run the same query in a native SQL client.
I am using the JDBC ODBC bridge and the database is MS SQL server 2008.
I have the following code in my DAO:
public List internalCodeDescriptions(String listID) {
List rows = jdbcTemplate.queryForList("select CODE, DESCRIPTION from CODE_DESCRIPTIONS where LIST_ID=? order by sort_order asc", new Object[] {listID});
//debugcode start
try {
Connection conn1 = jdbcTemplate.getDataSource().getConnection();
Statement stat = conn1.createStatement();
boolean sok = stat.execute("select code, description from code_descriptions where list_id='TRIGGER' order by sort_order asc");
if(sok) {
ResultSet rs = stat.getResultSet();
ResultSetMetaData rsmd = rs.getMetaData();
String columnname1=rsmd.getColumnName(1);
String columnname2=rsmd.getColumnName(2);
int type1 = rsmd.getColumnType(1);
int type2 = rsmd.getColumnType(2);
String tn1 = rsmd.getColumnTypeName(1);
String tn2 = rsmd.getColumnTypeName(2);
log.debug("Testquery gave resultset with:");
log.debug("Column 1 -name:" + columnname1 + " -typeID:"+type1 + " -typeName:"+tn1);
log.debug("Column 2 -name:" + columnname2 + " -typeID:"+type2 + " -typeName:"+tn2);
int i=1;
while(rs.next()) {
String cd=rs.getString(1);
String desc=rs.getString(2);
log.debug("Row #"+i+": CODE='"+cd+"' DESCRIPTION='"+desc+"'");
i++;
}
} else {
log.debug("Query execution returned false");
}
} catch(SQLException se) {
log.debug("Something went haywire in the debug code:" + se.toString());
}
log.debug("Original jdbcTemplate list result gave:");
Iterator<Map<String, Object>> it1= rows.iterator();
while(it1.hasNext()) {
Map mm = (Map)it1.next();
log.debug("Map:"+mm);
String code=(String)mm.get("CODE");
String desc=(String)mm.get("description");
log.debug("CODE:"+code+" : "+desc);
}
//debugcode end
return rows;
}
As you can see I've added some debugging code to list the results from the queryForList and I also obtain the connection from the jdbcTemplate object and uses that to sent the same query using the basic jdbc methods (listID='TRIGGER').
What is puzzling me is that the log outputs something like this:
Testquery gave resultset with:
Column 1 -name:code -typeID:-9 -typeName:nvarchar
Column 2 -name:decription -typeID:-9 -typeName:nvarchar
Row #1: CODE='C1' DESCRIPTION='BlodoverxF8rin eller bruk av blodprodukter'
Row #2: CODE='C2' DESCRIPTION='Kodetilfelle, hjertestans/respirasjonstans'
Row #3: CODE='C3' DESCRIPTION='Akutt dialyse'
...
Row #58: CODE='S14' DESCRIPTION='Forekomst av hvilken som helst komplikasjon'
...
Original jdbcTemplate list result gave:
Map:(CODE=null, DESCRIPTION=null)
CODE:null : null
Map:(CODE=null, DESCRIPTION=null)
CODE:null : null
...
58 repetitions total.
Why does the result from the queryForList method return NULL in all columns for every row? How can I get the result I want using jdbcTemplate.queryForList?
The xF8 should be the letter ΓΈ so I have some encoding issues, but I can't see how that may cause all values - also strings not containing any strange letters (se row#2) - to turn into NULL values in the list of maps returned from the jdbcTemplate.queryForList method.
The same code ran fine on another server against a MySQL Server 5.5 database using the jdbc driver for MySQL.
The issue was resolved by using the MS SQL Server jdbc driver rather than using the JDBC ODBC bridge. I don't know why it didn't work with the bridge though.

How do I transform a parameter in Pig?

I need to process a dataset in Pig, which is available once per day at midnight. Therefor I have an Oozie coordinator that takes care of the scheduling and spawns a workflow every day at 00:00.
The file names follow the URI scheme
hdfs://${dataRoot}/input/raw${YEAR}${MONTH}${DAY}${HOUR}.avro
where ${HOUR} is always '00'.
Each entry in the dataset contains a UNIX timestamp and I want to filter out those entries which have a timestamp before 11:45pm (23:45). As I need to run on datasets from the past, the value of the timestamp defining the threshold needs to be set dynamically according to the day currently processed. For example, proessing the dataset from December, 12th 2013 needs the threshold 1418337900. For this reason, setting the threshold must be done by the coordinator.
To the best of my knowledge, there is no possibility to transfrom a formatted date into a UNIX timestamp in EL. I came up with a quite hacky solution:
The coordinator passes date and time of the threshold to the respective workflow which starts the parameterized instance of the Pig script.
Excerpt of the coordinator.xml:
<property>
<name>threshold</name>
<value>${coord:formatTime(coord:dateOffset(coord:nominalTime(), -15, 'MINUTE'), 'yyyyMMddHHmm')}</value>
</property>
Excerpt of the workflow.xml:
<action name="foo">
<pig>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<script>${applicationPath}/flights.pig</script>
<param>jobInput=${jobInput}</param>
<param>jobOutput=${jobOutput}</param>
<param>threshold=${threshold}</param>
</pig>
<ok to="end"/>
<error to="error"/>
</action>
The Pig script needs to convert this formatted datetime into a UNIX timestamp. Therefor, I have writte a UDF:
public class UnixTime extends EvalFunc<Long> {
private long myTimestamp = 0L;
private static long convertDateTime(String dt, String format)
throws IOException {
DateFormat formatter;
Date date = null;
formatter = new SimpleDateFormat(format);
try {
date = formatter.parse(dt);
} catch (ParseException ex) {
throw new IOException("Illegal Date: " + dt + " format: " + format);
}
return date.getTime() / 1000L;
}
public UnixTime(String dt, String format) throws IOException {
myTimestamp = convertDateTime(dt, format);
}
#Override
public Long exec(Tuple input) throws IOException {
return myTimestamp;
}
}
In the Pig script, a macro is created, initializing the UDF with the input of the coordinator/workflow. Then, you can filter the timestamps.
DEFINE THRESH mystuff.pig.UnixTime('$threshold', 'yyyyMMddHHmm');
d = LOAD '$jobInput' USING PigStorage(',') AS (time: long, value: chararray);
f = FILTER d BY d <= THRESH();
...
The problem that I have leads me to the more general question, if it is possible to transform an input parameter in Pig and use it again as some kind of constant.
Is there a better way to solve this problem or is my approach needlessly complicated?
Edit: TL;DR
After more searching I found someone with the same problem:
http://grokbase.com/t/pig/user/125gszzxnx/survey-where-are-all-the-udfs-and-macros
Thanks Gaurav for recommending the UDFs in piggybank.
It seems that there is no performant solution without using declare and a shell script.
You can put the Pig script into a Python script and pass the value.
#!/usr/bin/python
import sys
import time
from org.apache.pig.scripting import Pig
P = Pig.compile("""d = LOAD '$jobInput' USING PigStorage(',') AS (time: long, value: chararray);
f = FILTER d BY d <= '$thresh';
""")
jobinput = {whatever you defined}
thresh = {whatever you defined in the UDF}
Q = P.bind({'thresh':thresh,'jobinput':jobinput})
results = Q.runSingle()
if results.isSuccessful() == "FAILED":
raise "Pig job failed"

failed to manipulate my Arraylist

I need help , I have an arrayList of objects . This object contains multiple fields I'm interested in this question by two date fields (date_panne date_mise and running) and two other time fields (heure_panne and time start),
And I would like to obtain the sum of the difference between (date_panne, heure_panne) and (date_mise_en_marche; heure_mise_en_marche) to give the total time of failure.
if someone can help me please I will be gratful this is my function :
public String disponibile() throws Exception {
int nbreArrets = 0;
List<Intervention> allInterventions = interventionDAO.fetchAllIntervention();
List<Intervention> listInterventions = new ArrayList<Intervention>();
for (Intervention currentIntervention : allInterventions) {
if (currentIntervention.getId_machine() == this.intervention.getId_machine()
&& currentIntervention.getDate_panne().compareTo(getProductionStartDate()) >= 0
&& currentIntervention.getDate_panne().compareTo(getProductionEndDate()) <= 0) {
listInterventions.add(currentIntervention);
}
}
savedInterventionList = listInterventions;
return "successView" ;
}
Assuming the the dates are truncated to the day and are of type java.util.Date, and that the times only contain hours, minutes, seconds and milliseconds and are also of type Date, start by creating a method like
private Date combine(Date dateOnly, Date timeOnly) {
Calendar dateCalendar = Calendar.getInstance();
dateCalendar.setTime(dateOnly);
Calendar timeCalendar = Calendar.getInstance();
timeCalendar.setTime(timeOnly);
dateCalendar.add(Calendar.HOUR_OF_DAY, timeCalendar.get(Calendar.HOUR_OF_DAY));
dateCalendar.add(Calendar.MINUTE, timeCalendar.get(Calendar.MINUTE));
dateCalendar.add(Calendar.SECOND, timeCalendar.get(Calendar.SECOND));
dateCalendar.add(Calendar.MILLISECOND, timeCalendar.get(Calendar.MILLISECOND));
return dateCalendar.getTime();
}
Now, it's simply a matter of looping through the interventions you want to sum, computing the difference between the dates as milliseconds, and add them:
long totalMillis = 0L;
for (Intervention intervention : interventions) {
Date marche = combine(intervention.getDateMiseEnMarche(), intervention.getTimeMiseEnMarche());
Date panne = combine(intervention.getDatePanne(), intervention.getTimePanne());
long differenceInMillis = marche.getTime() - panne.getTime();
totalMillis += differenceInMillis;
}

Entity Framework SaveChanges() first call is very slow

I appreciate that this issue has been raised a couple of times before, but I can't find a definitive answer (maybe there isn't one!).
Anyway the title tells it all really. Create a new context, add a new entity, SaveChanges() takes 20 seconds. Add second entity in same context, SaveChanges() instant.
Any thoughts on this? :-)
============ UPDATE =============
I've created a very simple app running against my existing model to show the issue...
public void Go()
{
ModelContainer context = new ModelContainer(DbHelper.GenerateConnectionString());
for (int i = 1; i <= 5; i++)
{
DateTime start = DateTime.Now;
Order order = context.Orders.Single(c => c.Reference == "AA05056");
DateTime end = DateTime.Now;
double millisecs = (end - start).TotalMilliseconds;
Console.WriteLine("Query " + i + " = " + millisecs + "ms (" + millisecs / 1000 + "s)");
start = DateTime.Now;
order.Note = start.ToLongTimeString();
context.SaveChanges();
end = DateTime.Now;
millisecs = (end - start).TotalMilliseconds;
Console.WriteLine("SaveChanges " + i + " = " + millisecs + "ms (" + millisecs / 1000 + "s)");
Thread.Sleep(1000);
}
Console.ReadKey();
}
Please do not comment on my code - unless it is an invalid test ;)
The results are:
Query 1 = 3999.2288ms (3.9992288s)
SaveChanges 1 = 3391.194ms (3.391194s)
Query 2 = 18.001ms (0.018001s)
SaveChanges 2 = 4.0002ms (0.0040002s)
Query 3 = 14.0008ms (0.0140008s)
SaveChanges 3 = 3.0002ms (0.0030002s)
Query 4 = 13.0008ms (0.0130008s)
SaveChanges 4 = 3.0002ms (0.0030002s)
Query 5 = 10.0005ms (0.0100005s)
SaveChanges 5 = 3.0002ms (0.0030002s)
The first query takes time which I assume is the view generation? Or db connection?
The first save takes nearly 4 seconds which for the more complex save in my app takes over 20 seconds which is not acceptable.
Not sure where to go with this now :-(
UPDATE...
SQL Profiler shows first query and update are fast and are not different for first. So I know delay is Entity Framework as suspected.
It might not be the SaveChanges call - the first time you make any call to the database in EF, it has to do some initial code generation from the metadata. You can pre-generate this though at compile-time: http://msdn.microsoft.com/en-us/library/bb896240.aspx
I would be surprised if that's the only problem, but it might help.
Also have a look here: http://msdn.microsoft.com/en-us/library/cc853327.aspx
I would run the following code on app start up and see how long it takes and if after that the first SaveChanges is fast.
public static void UpdateDatabase()
{
//Note: Using SetInitializer is reconnended by Ladislav Mrnka with reputation 275k
//http://stackoverflow.com/questions/9281423/entity-framework-4-3-run-migrations-at-application-start
Database.SetInitializer<DAL.MyDbContext>(
new MigrateDatabaseToLatestVersion<DAL.MyDbContext,
Migrations.MyDbContext.Configuration>());
using (var db = new DAL.MyDbContext()) {
db.Database.Initialize(false);//Execute the migrations now, not at the first access
}
}

Resources