Recommended spec for an OSM tile server on EC2 - amazon-ec2

I am building an OSM tile server with mod_tile/renderd, and osm2pgsql, as per instructions here: https://switch2osm.org/manually-building-a-tile-server-16-04-2-lts/
With my current spec EC2 server t2.xlarge, Ubuntu 16.04, I can just about work with a country-sized map, although rendering tiles on the fly is still slow, so render_list is needed. I have tried all performance tweaks I could find to speed up rendering, but what I really think I need is a more powerful server, particularly as the eventual aim is a planet sized import. Most server specs I can find for this are very outdated.
Would anybody have recommendations for an EC2 instance (or general cloud server specs) for building a planet sized OSM tile server in 2018?

I found upgrading to a m5.2xlarge server was sufficient to work with the planet database - on my previous 16gb server, I was running out of RAM for many DB tasks. Other important issues to resolve were:
Build spatial indexes on the entire table geometries, which was not done by osm2pgsql in my case. I did already have partial indexes from running openstreetmap-carto/scripts/indexes.py but these were not suitable for my style and not being used, so I needed to create these indexes:
CREATE INDEX planet_osm_polygon_index ON planet_osm_polygon USING GIST(way)
CREATE INDEX planet_osm_line_index ON planet_osm_line USING GIST(way)
Manually set a layer extent in the style xml file (I just used the map extent) - I had omitted it, which means it had to be calculated by a time consuming PostGIS query, see: https://github.com/mapnik/mapnik/wiki/OptimizeRenderingWithPostGIS
Run a VACUUM and ANALYZE
I have now been able to run render_list on zooms 0-11, and the server can generate further zoom levels on the demand without issue.

Related

How do I setup esrally to use with elassandra and my own tests?

I'm wondering whether others have attempt benchmarking Elassandra (more specifically, I'm using express-cassandra) using esrally. I'm hoping to not spend to much more time on esrally if that's not a good solution to test Elassandra.
Reading the documentation, it looks like Rally is capable of starting from scratch: Download Elasticsearch, install the source, build it, run it, connect, create a full schema, then start testing with data filling the schema (possibly done with some random data), do queries, ...
I already have everything in place and the only thing I really want to see a few things such as:
Which of 10 different memory setup is faster.
Which type of searches work, whether option 1, 2 and 3 from my existing software create drastic slow downs or not...
Whether insertion while doing searches have a effects on the speed of my searches.
I'm not going to change many parameters other than the memory (-Xmx, -Xms, maybe some others... like cached row in a separate heap.) For sure, I want to run all the tests with the latest Elassandra and not consider rebuilding or anything of the sort.
From reading the documentation, there is no mention of Elassandra. I found total of TWO PAGES in Google about testing with esrally and Elassandra and that did not boost my confidence that it's doable...
I would imagine that I have to use the benchmark-only pipeline. That at least removes all the gathering of the source, building, etc. I guess it also reduces the number of parameters I get in the resulting benchmark, but I don't need all the details...
Have you had any experience with such a setup? (Elassandra + esrally)
Yes, esrally works with Elassandra by using the --benchmark-only option.
To automate the creation of elassandra clusters to benchmark, you could either use ecm or k8s helm chart.
For instance, using ccm :
ecm create bench_cluster -v 6.2.3.10 -n 3 -s -e
esrally --pipeline=benchmark-only --target hosts=127.0.0.1:9200,127.0.0.2:9200,127.0.0.3:9200
ecm remove bench_cluster
For testing specific scenarios, you can write custom tracks.

Dynamically loaded Markers: DDOS prevention

My app shows a map where locations (or Markers) are dynamically loaded via an ajax (and database) request after every map Bounds changes.
I'm convinced that this solution is not scalable : at the moment, Europe area shows a total of 10 markers.
If the database grows and I display for instance 1000 locations, that means 1000 rows would be returned to the user.
This is not a JS / UI since I use the MarkerCluster plugin and I avoid the redraw of loaded locations's markers.
I made some tweaks :
- Delay the Ajax request thanks to an Idle gmaps event
- Increase the minimal zoom level, so the entire world can't be displayed.
But this is not enough.
There are lots of ways to approach this but I will just put here the two I think are most appropriate from your question.
First is to really control from your web app what information is asked for and when. You could write this all yourself in javascript and implement caching techniques ect. There are a number of libraries out there that do most of this work for you though.
I would recommend one of the following:
OpenGeo SDK
OpenLayers
GeoExt
Leaflet
All of these have ways of controlling local caching, when to get the data and what data is gathered from the server. Most of them can also be extended to add any functionality that is missing. The top two I know support google maps (as well as a number of others) as well.
If you need to add even more control over your data locally you could even look at implementing something like PouchDB. I think this is more suited to mobile applications or instances where the network connection is either really slow or intermittent.
This sort of solution should be able to easily handle 1000's to 10000's of features with 100's of users.
If you are really going to scale up to 100000's to 1000000's of features with 100's to 1000's of users then I would suggest adding a tile server to the soloution above. The tile server will sit between your web application and your data base. Most of them have lots of caching settings and optimistions for dealing with large datasets and pushing them out to a client. Because they push out tiles rather than features the data output remains reasonably constant even as the number of features grow. The OpenGeo SDK and Openlayers libraries I mentioned above can work really well with any of the following tile servers:
GeoServer
Mapserver
MapGuide
Quantum GIS Server
If you are reluctant to do any coding there are some offers that work out of the box for enterprise environments. They are all expensive and from your question I think they are probably not what you are looking for.

Store Images to display in SOLR search results

I have built a SOLR Index which has the image thumbnail urls that I want to render an image along with the search results. The problem is that those images can run into millions and I think storing the images in index as binary data would make the index humongous.
I am seeking guidance on how to efficiently store those images after rendering them from the URLs , should I use the plain file system and have them rendered by tomcat , or should I use a JCR repository like Apache Jackrabbit ?
Any guidance would be greatly appreciated.
Thank You.
I would evaluate the effective requiriments before finally deciding how to persist the images.
Do you require versioning?
Are you planning to stir eonly the images or additional metadata?
Do you have any requirements in horizontal scaling?
Do you require any image processing or scaling?
Do you need access to the image metatdata?
Do you require additional tooling for managing the images?
Are you willing to invest time in learning an additional technology?
Storing on the file system and making them available by an image sppoler implementation is the most simple way to persist your images.
But if you identify some of the above mentioned requirements (which are typical for a content repo or a dam system), then would end up reinventing the wheel with the filesystem approach.
The other option is using a kind of content repository. A JCR repo like for example Jackrabbit or it's commercial implementation CRX is one option. Alfresco (supports CMIS) would be the another valid.
Features like versioning, post processing (scaling ...), metadata extraction and management belong are supported by both mentioned repository solutions. But this requires you to learn a new technology which can be time consuming. Both mentioned repository technologies can get complex.
If horizontal scaling is a requirement I would consider a commercially supported repository implementations (CRX or Alfresco Enterprise) because the communty releases are lacking this functionality.
Me personally I would really depend any decision on the above mentioned requirements.
I extensively worked with Jackrabbit, CRX and Alfresco CE and EE and personally I would go for the Alfresco as I experienced it to scale better with larger amounts of data.
I'm not aware of a image pooling solution that fits your needs exactly but it shouldn't be to difficult to implement that, apart from the fact that recurring scaling operations may be very resource intensive.
I would go for the following approach if FS is enough for you:
Separate images and thumbnail into two locations.
The images root folder will remain, the thumbnails folder is
temporary.
Create a temporary thumbnail folder for each indexing run.
All thumbnails for that run are stored under that location, scaling
can be achived with i.e ImageMagick.
The temporary thumbnail folder can then easily be dropped as soon as
the next run has been completed.
If you are planning to store millions of images then avoid putting all files in the same directory. Browsing flat hierarchies with two many entries will be a nightmare.
Better create a tree structure by i.e. inverting the current datetime (year/month/day/hour/minute ... 2013/06/01/08/45).
This makes sure that the number of files inside the last folder get's not too big (Alfresco is using the same pattern for storing binary objects on the FS and it has proofen to work nicely).

How to access internal dashboards in Cube?

I am currently playing around with Cube . I have been able to send events to an evaluator and can query these via HTTP GET.
How do I go about making a customized dashboard to visualize the events/queries? I saw this website http://corner.squareup.com/2011/09/cube.html has a video for creating a "dashboard in 60 seconds" but I do not know where to find this.
Great question.
The last version of Cube on GitHub to have the fat, non-static, non-Cubism client side from the video was
e51cd376.
If you have a clone of the current repository use:
git reset --hard e51cd376
Then start your mongodb, and node evaluator.
The mongodb database format does not seem to have substantially changed.
You could: at least to some extent, use the current collector, dump the database into a duplicated db, or maybe run the two separate versions of cube evaluators side-by-side, depending on your dashboarding needs.
Uglier but a bit quicker for mashing around with the queries.

Rails, how to migrate large amount of data?

I have a Rails 3 app running an older version of Spree (an open source shopping cart). I am in the process of updating it to the latest version. This requires me to run numerous migrations on the database to be compatible with the latest version. However the apps current database is roughly around 300mb and to run the migrations on my local machine (mac os x 10.7, 4gb ram, 2.4GHz Core 2 Duo) takes over three days to complete.
I was able to decrease this time to only 16 hours using an Amazon EC2 instance (High-I/O On-Demand Instances, Quadruple Extra Large). But 16 hours is still too long as I will have to take down the site to perform this update.
Does anyone have any other suggestions to lower this time? Or any tips to increase the performance of the migrations?
FYI: using Ruby 1.9.2, and Ubuntu on the Amazon instance.
Dropping indices beforehand and adding them again afterwards is a good idea.
Also replacing .where(...).each with .find_each and perhaps adding transactions could help, as already mentioned.
Replace .save! with .save(:validate => false), because during the migrations you are not getting random inputs from users, you should be making known-good updates, and validations account for much of the execution time. Or using .update_attribute would also skip validations where you're only updating one field.
Where possible, use fewer AR objects in a loop. Instantiating and later garbage collecting them takes CPU time and uses more memory.
Maybe you have already considered this:
Tell the db not to bother making sure everything is on disk (no WAL, no fsync, etc), you now have an in memory db which should make a very big difference. (Since you have taken the db offline you can just restore from a backup in the unlikely event of power loss or similar). Turn fsync/WAL on when you are done.
It is likely that you can do some of the migrations before you take the db offline. Test this in staging env of course. That big user migration might very well be possible to do live. Make sure that you don't do it in a transaction, you might need to modify them a bit.
I'm not familiar with your exact situation but I'm sure there are even more things you can do unless this isn't enough.
This answer is more about approach than a specific technical solution. If your main criteria is minimum downtime (and data-integrity of course) then the best strategy for this is to not use rails!
Instead you can do all the heavy work up-front and leave just the critical "real time" data migration (i'm using "migration" in the non-rails sense here) as a step during the switchover.
So you have your current app with its db schema and the production data. You also (presumably) have a development version of the app based on the upgraded spree gems with the new db schema but no data. All you have to do is figure out a way of transforming the data between the two. This can be done in a number of ways, for example using pure SQL and temporary tables where necessary or using SQL and ruby to generate insert statements. These steps can be split up so that data that is fairly "static" (reference tables, products, etc) can be loaded into the db ahead of time and the data that changes more frequently (users, sesssions, orders, etc) can be done during the migration step.
You should be able to script this export-transform-import procedure so that it is repeatable and have tests/checks after it's complete to ensure data integrity. If you can arrange access to the new production database during the switchover then it should be easy to run the script against it. If you're restricted to a release process (eg webistrano) then you might have to shoe-horn it into a rails migration but you can run raw SQL using execute.
Take a look at this gem.
https://github.com/zdennis/activerecord-import/
data = []
data << Order.new(:order_info => 'test order')
Order.import data
Unfortunaltly the downrated solution is the only one. What is really slow in rails are the activerecord models. The are not suited for tasks like this.
If you want a fast migration you will have to do it in sql.
There is an other approach. But you will always have to rewrite most of the migrations...

Resources