How to set autoflush=false in HBase table - hadoop

I have this code that saves to HBase HTABLE. The expected behavior is that the table will push the commits or "flush" the puts to hbase for each partition.
NOTE: This is the updated code
rdd.foreachPartition(p => {
val table = connection.getTable(TableName.valueOf(HTABLE))
val mutator = connection.getBufferedMutator(TableName.valueOf(HTABLE))
p.foreach(row => {
val hRow = new Put(rowkey)
hRow.addColumn....
// use table.exists instead of table.checkAndPut (in favor of BufferedMutator's flushCommits)
val exists = table.exists(new Get(rowkey))
if (!exists) {
hRow.addColumn...
}
mutator.mutate(hRow)
})
table.close()
mutator.flush()
mutator.close()
})
In HBase 1.1, HTable is deprecated and there's no flushCommits() available in org.apache.hadoop.hbase.client.Table.
Replacing BufferedMutator.mutate(put) is ok for normal puts, but mutator does not have any checkAndPut similar to Table.

In the new API, BufferedMutatoris used.
You could change Table t = connection.getTable(TableName.valueOf("foo")) to BufferedMutator t = connection.getBufferedMutator(TableName.valueOf("foo")). And then change t.put(p); to t.mutate(p);
It works for me!
There is little information about that when I was searching, even in the official document. Hope my answer is helpful, and someone could update the document.

You need to set autoFlush to false see section 11.7.4
in http://hbase.apache.org/0.94/book/perf.writing.html

You dont need to do anything since you DONT want to buffer puts at Client side. By default, HBase client will not buffer the PUTS at client side.
Explicit calls to flushCommits() is only required when the client handling when to send data to HBase RegionServers.

Related

Check if data already exists before inserting into BigQuery table (using Python)

I am setting up a daily cron job that appends a row to BigQuery table (using Python), however, duplicate data is being inserted. I have searched online and I know that there is a way to manually remove duplicate data, but I wanted to see if I could avoid this duplication in the first place.
Is there a way to check a BigQuery table to see if a data record already exists first in order to avoid inserting duplicate data? Thanks.
CODE SNIPPET:
import webapp2
import logging
from googleapiclient import discovery
from oath2client.client import GoogleCredentials
PROJECT_ID = 'foo'
DATASET_ID = 'bar'
TABLE_ID = 'foo_bar_table’
class UpdateTableHandler(webapp2.RequestHandler):
def get(self):
credentials = GoogleCredentials.get_application_default()
service = discovery.build('bigquery', 'v2', credentials=credentials)
try:
the_fruits = Stuff.query(Stuff.fruitTotal >= 5).filter(Stuff.fruitColor == 'orange').fetch();
for fruit in the_fruits:
#some code here
basket = dict()
basket['id'] = fruit.fruitId
basket['Total'] = fruit.fruitTotal
basket['PrimaryVitamin'] = fruit.fruitVitamin
basket['SafeRaw'] = fruit.fruitEdibleRaw
basket['Color'] = fruit.fruitColor
basket['Country'] = fruit.fruitCountry
body = {
'rows': [
{
'json': basket,
'insertId': str(uuid.uuid4())
}
]
}
response = bigquery_service.tabledata().insertAll(projectId=PROJECT_ID,
datasetId=DATASET_ID,
tableId=TABLE_ID,
body=body).execute(num_retries=5)
logging.info(response)
except Exception, e:
logging.error(e)
app = webapp2.WSGIApplication([
('/update_table', UpdateTableHandler),
], debug=True)
The only way to test whether the data already exists is to run a query.
If you have lots of data in the table, that query could be expensive, so in most cases we suggest you go ahead and insert the duplicate, and then merge duplicates later on.
As Zig Mandel suggests in a comment, you can query over a date partition if you know the date when you expect to see the record, but that may still be expensive compared to inserting and removing duplicates.

Spring data Neo4j Affected row count

Considering a Spring Boot, neo4j environment with Spring-Data-neo4j-4 I want to make a delete and get an error message when it fails to delete.
My problem is since the Repository.delete() returns void I have no ideia if the delete modified anything or not.
First question: is there any way to get the last query affected lines? for example in plsql I could do SQL%ROWCOUNT
So anyway, I tried the following code:
public void deletesomething(Long somethingId) {
somethingRepository.delete(getExistingsomething(somethingId).getId());
}
private something getExistingsomething(Long somethingId, int depth) {
return Optional.ofNullable(somethingRepository.findOne(somethingId, depth))
.orElseThrow(() -> new somethingNotFoundException(somethingId));
}
In the code above I query the database to check if the value exist before I delete it.
Second question: do you recommend any different approach?
So now, just to add some complexity, I have a cluster database and db1 can only Create, Update and Delete, and db2 and db3 can only Read (this is ensured by the cluster sockets). db2 and db3 will receive the data from db1 from the replication process.
For what I seen so far replication can take up to 90s and that means that up to 90s the database will have a different state.
Looking again to the code above:
public void deletesomething(Long somethingId) {
somethingRepository.delete(getExistingsomething(somethingId).getId());
}
in debug that means:
getExistingsomething(somethingId).getId() // will hit db2
somethingRepository.delete(...) // will hit db1
and so if replication has not inserted the value in db2 this code wil throw the exception.
the second question is: without changing those sockets is there any way for me to delete and give the correct response?
This is not currently supported in Spring Data Neo4j, if you wish please open a feature request.
In the meantime, perhaps the easiest work around is to fall down to the OGM level of abstraction.
Create a class that is injected with org.neo4j.ogm.session.Session
Use the following method on Session
Example: (example is in Kotlin, which was on hand)
fun deleteProfilesByColor(color : String)
{
var query = """
MATCH (n:Profile {color: {color}})
DETACH DELETE n;
"""
val params = mutableMapOf(
"color" to color
)
val result = session.query(query, params)
val statistics = result.queryStatistics() //Use these!
}

Wakanda Datastore - Find and Replace?

I've got a lot of values in a legacy Wakanda datastore which I need to update to some new values. Is there a curl-like command in the wakanda data browser page that can be used to do a mass find-and-replace in a table?
If your dataclass is called MyDataClass and the attribute you want to update is myAttribute you can use the following server-side script :
var newValue = "new value";
ds.MyDataClass.all().forEach(function(entity){
entity.myAttribute = newValue;
entity.save();
});
You can also use a transaction if you want to commit or rollback the whole operation
I don't think there is a way to do a mass of find/replace in the dataBrowser,
But I suggest you to use a query in the server side that search the records with the value you need to replace, and then a loop on this collection to set the new values
As mentioned in other answers, you are likely best to loop over a collection. There is no concept of a mass replace in Wakanda like you see in many other databases.
var myCollection = ds.DataClassName.query("attributeName == :1", "valueToFind");
myCollection.forEach(function(e){
e.attributeName = "newValue";
e.save();
});
So a fake "person" data type might look like this:
var blankFirsts = ds.Person.query("firstname == :1", "");
blankFirsts.forEach(function(person){
person.firstname = "no name";
person.save();
});

How can I change the column name of an existing Class in the Parse.com Web Browser interface?

I couldn't find a way to change a column name, for a column I just created, either the browser interface or via an API call. It looks like all object-related API calls manipulate instances, not the class definition itself?
Anyone know if this is possible, without having to delete and re-create the column?
This is how I did it in python:
import json,httplib,urllib
connection = httplib.HTTPSConnection('api.parse.com', 443)
params = urllib.urlencode({"limit":1000})
connection.connect()
connection.request('GET', '/1/classes/Object?%s' % params, '', {
"X-Parse-Application-Id": "yourID",
"X-Parse-REST-API-Key": "yourKey"
})
result = json.loads(connection.getresponse().read())
objects = result['results']
for object in objects:
connection = httplib.HTTPSConnection('api.parse.com', 443)
connection.connect()
objectId = object['objectId']
objectData = object['data']
connection.request('PUT', ('/1/classes/Object/%s' % objectId), json.dumps({
"clonedData": objectData
}), {
"X-Parse-Application-Id": "yourID",
"X-Parse-REST-API-Key": "yourKEY",
"Content-Type": "application/json"
})
This is not optimized - you can batch 50 of the processes together at once, but since I'm just running it once I didn't do that. Also since there is a 1000 query limit from parse, you will need to do run the load multiple times with a skip parameter like
params = urllib.urlencode({"limit":1000, "skip":1000})
From this Parse forum answer : https://www.parse.com/questions/how-can-i-rename-a-column
Columns cannot be renamed. This is to avoid breaking an existing app.
If your app is still under development, you can just query for all the
objects in your class and copy the value of the old column to the new
column. The REST API is very useful for this. You may them drop the
old column in the Data Browser
Hope it helps
Yes, it's not a feature provided by Parse (yet). But there are some third party API management tools that you can use to rename the fields in the response. One free tool is called apibond.com
It's a work around, but I hope it helps

sqlmetal.exe run and output generated but how do I query my database?

I have run sqlmetal.exe agaisnt my database.
SqlMetal.exe /server:server /database:dbname /code:mapping.cs
I have included this into my solution. So I can now create an object for each of the database tables. Great. I now wish to use ling to query by database. Can I presume that none of the connection etc is handled by the output of sqlmetal.exe. If this is correct what ways can I use ling to query my database?
Does the generated code include a Data Context (a class which inherits from System.Data.Linq.DataContext)? If so, then that's probably what you're looking for. Something like this:
var db = new SomeDataContext();
// You can also specify a connection string manually in the above constructor if you want
var records = db.SomeTable.Where(st => st.id == someValue);
// and so on...

Resources