Documents not inserting to mongoDB? - ruby

I've set up a mongoDB on an Ubuntu AWS instance. I also have something like 920 files ranging in size from 5Mb to 2Gb or so.
Once each unzipped text file is uniq'd with uniq, I run the following script to insert them into the DB:
require 'mongo'
require 'bson'
Mongo::Logger.logger.level = ::Logger::FATAL
puts "Working..."
db = Mongo::Client.new([ 'localhost:27017' ], :database => 'supers')
coll = db[:hashes]
# suppressors = File.open('_combined.txt')
suppressors = Dir['./_uniqued_*.txt']
count = suppressors.count
puts "Found #{count}"
suppressors.each_with_index do |fileroute, i|
suppressor = File.open(fileroute, 'r')
percentage = ((i+1) / count.to_f * 100).round(2)
puts "Working on `#{fileroute}` (#{i+1}/#{count} - #{percentage})"
c = 0
suppressor.each_line do |hash|
c+=1
coll.update_one({ :_id => hash }, { :$inc => { :count => 1 } }, { upsert: true} )
puts "Processed 50k records for #{fileroute}" if c % 50_000 == 0
end
end
The idea is, if the record already exists, the $inc will set the count to 2 or 3 so I'll be able to find all the duplicates by running a query on the DB later.
I connected to the instance via RoboMongo and at first every time I refreshed the following query:
db.getCollection('hashes').count({})
I'd see that it was filling up the DB very quickly. There's lots of files but I figured I'd leave it overnight.
However after some time the result got stuck at 3788104. I was worried there was some hard size limit (df says I'm only using 35% of the HDD space)
Is there something in the config file which automatically limits the amount of records which can be inserted or something?
PS: is it just me or is either upsert or .each_line incredibly slow?

MongoDB's update model is based on write concerns, meaning that calling the function updateOne alone is not a guarantee for success.
If the version of MongoDB is at least 2.6, the function updateOne will return a document with information about any errors. If the version of MongoDB is older, an explicit call of the getLastError command will return the document with possible errors.
If the database does not contain all desired documents, it is likely that this returned document will contain errors.
In both cases, the write concern can be adjusted to the desired level, i.e., gives control about how many mongo instances must have propagated the change for it to be considered a success.
(Note: I am not familiar with the Ruby driver, this is assuming it behaves like the shell).

Related

How to handle simple geospatial queries (distances) in ruby with activerecord-postgis-adapter?

I've got a lot of lat / lon points in a csv file, I've created a table which has a point in the 4326 projection (table postcode, field location)
I'm building data like this:-
factory = ::RGeo::Cartesian.preferred_factory(:has_z_coordinate => false)
p = factory.point(data_hash[:latitude], data_hash[:longitude])
and storing p in the location field.
The issue then is that I want to find "near" records to a given point.
I've seen some promising code at:-
https://github.com/rgeo/activerecord-postgis-adapter/blob/master/test/spatial_queries_test.rb
so I wrote the following:-
factory = ::RGeo::Cartesian.preferred_factory(:has_z_coordinate => false)
p = factory.point(53.7492, 1.6023)
res = Postcode.where(Postcode.arel_table[:location].st_distance(p).lt(1000));
res.each do |single|
puts single.postcode
end
But I'm getting exceptions (unsupported: RGeo::Cartesian::PointImpl)
I assume I'm needing to do some converting or something, any pointers appreciated!
I think your problem lies in the factory you use. Try to generate point from a spherical factory:
p = RGeo::Geographic.spherical_factory(srid: 4326).point(53.7492, 1.6023)
Also check rails logs to see the output query and run it manually in PG. Make sure that the query runs without problems.

Getting the objects with similar secondary index in Riak?

Is there a way to get all the objects in key/value format which are under one similar secondary index value. I know we can get the list of keys for one secondary index (bucket/{{bucketName}}/index/{{index_name}}/{{index_val}}). But somehow my requirements are such that if I can get all the objects too. I don't want to perform a separate query for each key to get the object details separately if there is way around it.
I am completely new to Riak and I am totally a front-end guy, so please bear with me if something I ask is of novice level.
In Riak, it's sometimes the case that the better way is to do separate lookups for each key. Coming from other databases this seems strange, and likely inefficient, however you may find your query will be faster over an index and a bunch of single object gets, than a map/reduce for all the objects in a single go.
Try both these approaches, and see which turns out fastest for your dataset - variables that affect this are: size of data being queried; size of each document; power of your cluster; load the cluster is under etc.
Python code demonstrating the index and separate gets (if the data you're getting is large, this method can be made memory-efficient on the client, as you don't need to store all the objects in memory):
query = riak_client.index("bucket_name", 'myindex', 1)
query.map("""
function(v, kd, args) {
return [v.key];
}"""
)
results = query.run()
bucket = riak_client.bucket("bucket_name")
for key in results:
obj = bucket.get(key)
# .. do something with the object
Python code demonstrating a map/reduce for all objects (returns a list of {key:document} objects):
query = riak_client.index("bucket_name", 'myindex', 1)
query.map("""
function(v, kd, args) {
var obj = Riak.mapValuesJson(v)[0];
return [ {
'key': v.key,
'data': obj,
} ];
}"""
)
results = query.run()

Bulk Insert into Mongo - Ruby

I am new to Ruby and Mongo and am working with twitter data. I'm using Ruby 1.9.3 and Mongo gems.
I am querying bulk data out of Mongo, filtering out some documents, processing the remaining documents (inserting new fields) and then writing new documents into Mongo.
The code below is working but runs relatively slow as I loop through using .each and then insert new documents into Mongo one at a time.
My Question: How can this be structured to process and insert in bulk?
cursor = raw.find({'user.screen_name' => users[cur], 'entities.urls' => []},{:fields => params})
cursor.each do |r|
if r['lang'] == "en"
score = r['retweet_count'] + r['favorite_count']
timestamp = Time.now.strftime("%d/%m/%Y %H:%M")
#Commit to Mongo
#document = {:id => r['id'],
:id_str => r['id_str'],
:retweet_count => r['retweet_count'],
:favorite_count => r['favorite_count'],
:score => score,
:created_at => r['created_at'],
:timestamp => timestamp,
:user => [{:id => r['user']['id'],
:id_str => r['user']['id_str'],
:screen_name => r['user']['screen_name'],
}
]
}
#collection.save(#document)
end #end.if
end #end.each
Any help is greatly appreciated.
In your case there is no way to make this much faster. One thing you could do is retrieve the documents in bulks, processing them and the reinserting them in bulks, but it would still be slow.
To speed this up you need to do all the processing server side, where the data already exist.
You should either use the aggregate framework of mongodb if the result document does not exceed 16mb or for more flexibility but slower execution (much faster than the potential your solution has) you can use the MapReduce framework of mongodb
What exactly are you doing? Why not going pure ruby or pure mongo (well that's ruby too) ? and Why do you really need to load every single attribute?
What I've understood from your code is you actually create a completely new document, and I think that's wrong.
You can do that with this in ruby side:
cursor = YourModel.find(params)
cursor.each do |r|
if r.lang == "en"
r.score = r.retweet_count + r.favorite_count
r.timestamp = Time.now.strftime("%d/%m/%Y %H:%M")
r.save
end #end.if
end #end.each
And ofcourse you can import include Mongoid::Timestamps in your model and it handles your created_at, and updated_at attribute (it creates them itself)
in mongoid it's a little harder
first you get your collection with use my_db then the next code will generate what you want
db.models.find({something: your_param}).forEach(function(doc){
doc.score = doc.retweet_count + doc.favorite_count
doc.timestamp = new Timestamp()
db.models.save(doc)
}
);
I don't know what was your parameters, but it's easy to create them, and also mongoid really do lazy loading, so if you don't try to use an attribute, it won't load that. You can actually save a lot of time not using every attribute.
And these methods, change the existing document, and won't create another one.

SqlAlchemy - when I iterate on a query, do I get a list or a iterator?

I'm starting to learn how to use SQLAlchemy and I'm running into some efficiency problems.
I created an object mapping an existing big table on our Oracle database:
engine = create_engine(connectionString, echo=False)
class POI(object):
def __repr__(self):
return "{poi_id} - {title}, {city} - {uf}".format(**self.__dict__)
def loadSession():
metadata = MetaData(engine)
_poi = Table('tbl_ourpois', metadata, autoload = True)
mapper(POI, _poi)
Session = sessionmaker(bind = engine)
session = Session()
return session
This table have millions of registries. When I do a simple query and try to iterate over it:
session = loadSession()
for poi in session.query(POI):
print poi
I noticed two things: (1) it takes some minutes for it to start printing objects on the screen, (2) memory usage starts to grow like crazy. So, my conclusion was that this code was fetching all the result set in a list and then iterating over it. Is this correct?
With cx_Oracle, when I do a query like:
conn = cx_Oracle.connect(connectionString)
cursor = conn.cursor()
cursor.execute("select * from tbl_ourpois")
for poi in cursor:
print poi
the resulting cursor behaves as an iterator that gets results into a buffer and returns them as they are needed intead of loading the whole thing in a list. This loop starts printing results almost instantly and memory usage is pretty low and constant.
Can I get this kind of behavior wiht SQLAlchemy? Is there a way to get a constant memory iterator out of session.query(POI) instead of a list?

Ruby: DataMapper and has n, :xyz, :through => Resource

I've encountered following issue:
there are 2 models: X and Y, they're associated with each other like this: has n, :<name>, :through => Resouce; when i'm doing something like x.ys = array_with_500_ys it takes really long time because DataMapper inserts only one association per query (insert into xs_ys(x_id, y_id) values(xid, yid)). This takes really long.
The question is: how to make this faster?
Thanks.
Because DataMapper has abstracted the 'back end', the standard behaviour is to insert one record at a time as SQL (or whatever you are using).
Assuming you are using an SQL backend, such as Postgres, you could drop back to raw SQL, and do the following:
x = X.first
query = "INSERT INTO xs_ys(x_id, y_id) VALUES"
vals = []
array_with_500_ys.each do |y|
vals << "(#{x.id}, #{y.id})"
end
repository.adapter.execute(query + vals.join(','));
This creates one 'insert', passing all records to be inserted. Not sure if this would be any faster, but you could put it into a background job if you need the app not to time out for the user.

Resources