Lead and lag in Hbase - hadoop

I'm trying to figure out how to do the equivalent of Oracle's LEAD and LAG in Hbase or some other sort of pattern that will solve my problem. I could write a MapReduce program that does this quite easily, but I'd love to be able to exploit the fact that the data is already sorted in the way I need it to be.
My problem is as follows: I have a rowkey and a value that looks like:
(employee name + timestamp) => data:salary
So, some example data might be:
miller, bob;2010-01-14 => data:salary=90000
miller, bob;2010-11-04 => data:salary=102000
miller, bob;2011-12-03 => data:salary=107000
monty, fred;2010-04-10 => data:salary=19000
monty, fred;2011-09-09 => data:salary=24000
What I want to do is calculate the changes of salary, record by record. I want to transform the above data into differences between records:
miller, bob;2010-01-14 => data:salarydiff=90000
miller, bob;2010-11-04 => data:salarydiff=12000
miller, bob;2011-12-03 => data:salarydiff=5000
monty, fred;2010-04-10 => data:salarydiff=19000
monty, fred;2011-09-09 => data:salarydiff=5000
I'm up for changing the rowkey strategy if necessary.

What I'd do is change the key so that the timestamp will be descending (newer salary first)
miller, bob;2011-12-03 => data:salary=107000
miller, bob;2010-11-04 => data:salary=102000
miller, bob;2010-01-14 => data:salary=90000
Now you can do a simple map job that will scan the table. Then in the map you create a new Scan to the current key. Scan.next to get the previous salary, calculate the diff and store it in a new column on the current row key
Basically in your mapper class (the one that inherits TableMapper) you override the setup method and get the configuration
#Override
protected void setup(Mapper.Context context) throws IOException,InterruptedException {
Configuration config = context.getConfiguration();
table = new HTable(config,<Table Name>);
}
Then inside the map you extract the row key from the row parmeter, create the new Scan and continue as explained above
In most cases the next record would be in the same region - occasionally it might go to another regionserver

Related

Problem with adding record to database after seeds in Laravel

I add test records to the database using seeds
public function run()
{
DB::table('categories')->insert([
['id' => 1,'name' => 'Select a category', 'slug' => null],
['id' => 2,'name' => 'Computers', 'slug' => 'computer-&-office'],
]);
}
But then, if I want to add a new record to the database, already through the form, I get the error
SQLSTATE[23505]: Unique violation: 7 ERROR: duplicate key value violates unique constraint "categories_pkey"
I understand that when I add a new record through the form, it is created with id = 1, and I already have this id in the database. How can I avoid this error?
You should remove id from insert() and make it auto increment in mysql,
It complains about a unique constraint, meaning your primary key is indexed as "categories_pkey" or you have another field that is unique.
This happens because you are inserting a record and a record already exists where that column must be unique.
In general production workflow, when you add a record you never specify an ID. Most cases (there are exceptions) ID is an autoincrement integer, meaning it adds up automatically. On the first insert the database set its ID to 1, the second to 2 and so on.
As a seeder, its generally a good idea to set up the ID so you know that a certain ID matches a certain item (as a base content of a project like user states or roles).
As a regular workflow (from a form submission), you can have something like this
DB::table('categories')->insert([
['name' => 'some value', 'slug' => 'some slug']
]);
However, I don't advise to use DB::table when Laravel provides ActiveRecords pattern (ORM, called Eloquent) which you should take a look here.
https://laravel.com/docs/8.x/eloquent#introduction
Besides the benefits of layer abstraction and working with activerecords, It produces a much cleaner code like
$data = ['slug' => 'some slug', 'name' => 'some name'];
Category::create($data);

Bulk Insert into Mongo - Ruby

I am new to Ruby and Mongo and am working with twitter data. I'm using Ruby 1.9.3 and Mongo gems.
I am querying bulk data out of Mongo, filtering out some documents, processing the remaining documents (inserting new fields) and then writing new documents into Mongo.
The code below is working but runs relatively slow as I loop through using .each and then insert new documents into Mongo one at a time.
My Question: How can this be structured to process and insert in bulk?
cursor = raw.find({'user.screen_name' => users[cur], 'entities.urls' => []},{:fields => params})
cursor.each do |r|
if r['lang'] == "en"
score = r['retweet_count'] + r['favorite_count']
timestamp = Time.now.strftime("%d/%m/%Y %H:%M")
#Commit to Mongo
#document = {:id => r['id'],
:id_str => r['id_str'],
:retweet_count => r['retweet_count'],
:favorite_count => r['favorite_count'],
:score => score,
:created_at => r['created_at'],
:timestamp => timestamp,
:user => [{:id => r['user']['id'],
:id_str => r['user']['id_str'],
:screen_name => r['user']['screen_name'],
}
]
}
#collection.save(#document)
end #end.if
end #end.each
Any help is greatly appreciated.
In your case there is no way to make this much faster. One thing you could do is retrieve the documents in bulks, processing them and the reinserting them in bulks, but it would still be slow.
To speed this up you need to do all the processing server side, where the data already exist.
You should either use the aggregate framework of mongodb if the result document does not exceed 16mb or for more flexibility but slower execution (much faster than the potential your solution has) you can use the MapReduce framework of mongodb
What exactly are you doing? Why not going pure ruby or pure mongo (well that's ruby too) ? and Why do you really need to load every single attribute?
What I've understood from your code is you actually create a completely new document, and I think that's wrong.
You can do that with this in ruby side:
cursor = YourModel.find(params)
cursor.each do |r|
if r.lang == "en"
r.score = r.retweet_count + r.favorite_count
r.timestamp = Time.now.strftime("%d/%m/%Y %H:%M")
r.save
end #end.if
end #end.each
And ofcourse you can import include Mongoid::Timestamps in your model and it handles your created_at, and updated_at attribute (it creates them itself)
in mongoid it's a little harder
first you get your collection with use my_db then the next code will generate what you want
db.models.find({something: your_param}).forEach(function(doc){
doc.score = doc.retweet_count + doc.favorite_count
doc.timestamp = new Timestamp()
db.models.save(doc)
}
);
I don't know what was your parameters, but it's easy to create them, and also mongoid really do lazy loading, so if you don't try to use an attribute, it won't load that. You can actually save a lot of time not using every attribute.
And these methods, change the existing document, and won't create another one.

complex projection from many-to-many tables, DbContext and entity framework and Linq

I have a DbContext with these four many-to-many related entities:
Classes <-> Students
Classes <-> Assignments
Classes <-> Contents
Classes <-> Announcements
Now I need a Linq code (or sth. better!) which will give us last 3 Assignments, last 3 Contents, last 3 Announcements of each Class where the student with StudentId = X is in those classes.
In another hand, the student logged in to website and we wanna show him/her last Assignments, Contents, Announcements of each Classes which he/she is study in it.
this code is not correct but may help u understand my need. Also this code takes many times to run (50ms isn't many?):
Edit: Code was almost correct so moved to answer , look at accepted answer. Any other (better, faster) solutions appreciated.
Thanks in advance.
You better start the query with student:
from s in db.Students
where s.StudentId == CurrentUser
from c in s.Classes
from as in c.Assignments.OrderByDescending(As => As.Id).Take(3)
from co in c.Contents.OrderByDescending(Co => Co.Id).Take(3)
from an in c.Announcements.OrderByDescending(An => An.Id).Take(3)
select new { <selected properties> }
The last part (selected properties) is important. By selecting only a subset of properties you narrow down the result set from the database. Without this, you create a very wide and long result set because of the large number of joins in the (SQL) query.
Based on #Gert Arnold recommendation I had changed a little my code:
start the query with student
narrow down the result by changing selected properties
db.Students.Where(st => st.StudentId.Equals(CurrentUser)).SelectMany(S => S.Classes, (S, C) => new
{
Name = C.Name,
Assignments = C.Assignments.Select(AS => new { AS.Id, AS.Name }).OrderByDescending(As => As.Id).Take(3),
Contents = C.Contents.Select(Co => new { Co.Id, Co.Title }).OrderByDescending(Co => Co.Id).Take(3),
Announcements = C.Announcements.Select(An => new { An.Id, An.Title }).OrderByDescending(An => An.Id).Take(3)
});

Sequel Migration update with a row's ID

How can you run a Sequel migration that updates a newly added column with a value from the row?
The Sequel documentation shows how you can update the column with a static value:
self[:artists].update(:location=>'Sacramento')
What I need to do is update the new column with the value of the ID column:
something like:
self[:artists].each |artist| do
artist.update(:location => artist[:id])
end
But the above doesn't work and I have been unable to figure out how to get it to go.
Thanks!
artist in your loop is a Hash, so you are calling Hash#update, which just updates the Hash instance, it doesn't modify the database. That's why your loop doesn't appear to do anything.
I could explain how to make the loop work (using all instead of each and updating a dataset restricted to the matching primary key value), but since you are just assigning the value of one column to the value of another column for all rows, you can just do:
self[:artists].update(:location=>:id)
if you need update all rows of a table, because it is a new column that need be populate
artists = DB[:artists]
artists.update(:column_name => 'new value')
or if you need, update only a unique row into your migration file you can:
artists = DB[:artists]
artists.where(:id => 1).update(:column_name1 => 'new value1', :column_name2 => "other")

Custom Magento Report for Taxable/Non-Taxable sales

Let me preface by saying I'm new to Magento as well as Data Collections in general (only recently begun working with OOP/frameworks).
I've followed the excellent tutorial here and I'm familiar with Alan Storm's overviews on the subject. My aim is to create a custom Magento report which, given a start/end date, will return the following totals:
Taxable Net (SUM subtotal for orders with tax)
Non-Taxable Net (SUM subtotal for orders without tax)
*Total Gross Sales (Grand total)
*Total Net Sales (Grand subtotal)
*Total Shipping
*Total Tax
*For these figures, I realize they are available in existing separate reports or can be manually calculated from them, however the purpose of this report is to give our store owner a single page to visit and file to export to send to his accountant for tax purposes.
I have the basic report structure already in place in Adminhtml including the date range, and I'm confident I can include additional filters if needed for order status/etc. Now I just need to pull the correct Data collection and figure out how to retrieve the relevant data.
My trouble is I can't make heads or tails of how the orders data is stored, what Joins are necessary (if any), how to manipulate the data once I have it, or how they interface with the Grid I've set up. The existing tutorials on the subject that I've found are all specifically dealing with product reports, as opposed to the aggregate sales data I need.
Many thanks in advance if anyone can point me in the right direction to a resource that can help me understand how to work with Magento sales data, or offer any other insight.
I have been working on something extremely similar and I used that tutorial as my base.
Expanding Orders Join Inner
Most of the order information you need is located in sales_flat_order with relates to $this->getTable('sales/order')
This actually already exists in her code but the array is empty so you need to populate it with the fields you want, here for example is mine:
->joinInner(
array('order' => $this->getTable('sales/order')),
implode(' AND ', $orderJoinCondition),
array(
'order_id' => 'order.entity_id',
'store_id' => 'order.store_id',
'currency_code' => 'order.order_currency_code',
'state' => 'order.state',
'status' => 'order.status',
'shipping_amount' => 'order.shipping_amount',
'shipping_tax_amount' => 'order.shipping_tax_amount',
'shipping_incl_tax' => 'base_shipping_incl_tax',
'subtotal' => 'order.subtotal',
'subtotal_incl_tax' => 'order.subtotal_incl_tax',
'total_item_count' => 'order.total_item_count',
'created_at' => 'order.created_at',
'updated_at' => 'order.updated_at'
))
To find the fields just desc sales_flat_order in mysql.
Adding additional Join Left
Ok so if you want information from other tables you need to add an ->joinLeft() for example I needed the shipment tracking number:
Create the Join condition:
$shipmentJoinCondition = array(
$orderTableAliasName . '.entity_id = shipment.order_id'
);
Perform the join left:
->joinLeft(
array('shipment' => $this->getTable('sales/shipment_track')),
implode(' AND ', $shipmentJoinCondition),
array(
'track_number' => 'shipment.track_number'
)
)
Sorry I couldn't go into more depth just dropping the snippet for you here.
Performing Calculations
To modify the data returned to the grid you have to change addItem(Varien_Object $item) in your model, basically whatever is returned from here get put in the grid, and well I am not 100% sure how it works and it seems a bit magical to me.
Ok first things first $item is an object, whatever you do to this object will stay with the object (sorry terrible explanation): Example, I wanted to return each order on a separate line and for each have (1/3, 2/3, 3/3), any changes I made would happen globally to the order object so they would all show (3/3). So keep this in mind, if funky stuff starts happening use PHP Clone.
$item_array = clone $item;
So now onto your logic, you can add any key you want to the array and it will be accessible in Grid.php
For example(bad since subtotal_incl_tax exists) :
$item_array['my_taxable_net_calc'] = $item['sub_total'] + $item['tax'];
Then at the end do:
$this->_items[] = $item_array;
return $this->_items;
You can also add more rows based on the existing by just adding more data to $this->_items[];
$this->_items[] = $item_array;
$this->_items[] = $item_array;
return $this->_items;
Would return same item on two lines.
Sorry I have started to lose the plot, if something doesn't make sense just ask, hope this helped.
Oh and to add to Block/Adminhtml/namespace/Grid.php
$this->addColumn('my_taxable_net_calc', array(
'header' => Mage::helper('report')->__('Taxable Net'),
'sortable' => false,
'filter' => false,
'index' => 'my_taxable_net_calc'
));

Resources