I am currently building a backend that uses a neural network.
I need to save the weights (which may be 2 dimensional, 3 dimensional or 4 dimensional), and restore them.
I am currently using heroku, and thus need to save them either to PostgreSQL, or to a S3 bucket and retrieve it everytime the system boots.
What is the go-to solution for storing and restoring weights for ML applications in production, where the weights may be multiple hundred-thousand entries and the matrix can be well over 100mb?
We do store neural network weights over S3 and it works pretty good.
Also we update those weights as we train with new data and update to s3. The live system checks for modified s3 object and update the weights / neural network based on the stored data.
https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectHEAD.html
aws s3api head-object --bucket my-bucket --key object.h5
You can also enable accelerated endpoint for speed. This is needed if you are downloading the object over the internet.
https://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html
Hope it helps.
Just for completeness: Using Postgres and the Postgres datatype ARRAY, it is also possible to store large matrices for production.
Related
I am trying to do a full load a very huge table (600+ million records) which resides in an Oracle On-Prem database. My destination is Azure Synapse Dedicated Pool.
I have already tried following:
Using ADF Copy activity with Source Partitioning, as source table is having 22 partitions
I increased the Copy Parallelism and DIU to a very high level
Still, I am able to fetch only 150 million records in 3 hrs whereas the ask is to complete the full load in around 2 hrs as the source would be freezed to users during that time frame so that Synapse can copy the data
How a full copy of data can be done from Oracle to Synapse in that time frame?
For a change, I tried loading data from Oracle to ADLS Gen 2, but its slow as well
There are a number of factors to consider here. Some ideas:
how fast can the table be read? What indexing / materialized views are in place? Is there any contention at the database level to rule out?
Recommendation: ensure database is set up for fast read on the table you are exporting
as you are on-premises, what is the local network card setup and throughput?
Recommendation: ensure local network setup is as fast as possible
as you are on-premises, you must be using a Self-hosted Integration Runtime (SHIR). What is the spec of this machine? eg 8GB RAM, SSD for spooling etc as per the minimum specification. Where is this located? eg 'near' the datasource (in the same on-premises network) or in the cloud. It is possible to scale out SHIRs by having up to four nodes but you should ensure via the metrics available to you that this is a bottleneck before scaling out.
Recommendation: consider locating the SHIR 'close' to the datasource (ie in the same network)
is the SHIR software version up-to-date? This gets updated occasionally so it's good practice to keep it updated.
Recommendation: keep the SHIR software up-to-date
do you have Express Route or going across the internet? ER would probably be faster
Recommendation: consider Express Route. Alternately consider Data Box for a large one-off export.
you should almost certainly land directly to ADLS Gen 2 or blob storage. Going straight into the database could result in contention there and you are dealing with Synapse concepts such as transaction logging, DWU, resource class and queuing contention among others. View the metrics for the storage in the Azure portal to determine it is under stress. If it is under stress (which I think unlikely), consider multiple storage accounts
Recommendation: load data to ADLS2. Although this might seem like an extra step, it provides a recovery point and avoids contention issues by attempting to do the extract and load all at the same time. I would only load directly to the database if you can prove it goes faster and you definitely don't need the recovery point
what format are you landing in the lake? Converting to parquet is quite compute intensive for example. Landing to the lake does leave an audit trail and give you a position to recover from if things go wrong
Recommendation: use parquet for a compressed format. You may need to optimise the file size.
ultimately the best thing to do would be one big bulk load (say taking the weekend) and then do incremental upserts using a CDC mechanism. This would allow you to meet your 2 hour window.
Recommendation: consider a one-off big bulk load and CDC / incremental loads to stay within the timeline
In summary, it's probably your network but you have a lot of investigation to do first, and then a number of options I've listed above to work through.
wBob provided a good summary of things you good look at to increase your transfer speed. In addition to that, you could try to bulk export your data into chunks of data files, and in-parallel transfer the files to azure datalake or azure blob storage, this way you can maximize your network throughput.
Once the data is on the datalake, you can scale up your Synapse instance and take advantage of fast loads using the COPY command.
I faced the same problem in our organization, and the fastest way to get the data out of SQL Server was using bcp into a fast storage layer.
My current approach is that I have a few containers:
raw (the actual raw files or exports, separated into folders like servicenow-cases, servicenow-users, playvox-evaluations, etc.)
staging (lightly transformed raw data)
analytics (these are Parquet file directories which consolidate and partition the files)
visualization (we use a 3rd party tool which syncs with Azure Blob, but only CSV files currently. This is almost the exact same as the analytics container)
However, it could also make some sense to create more containers and kind of use them like I would use a database schema. For example, one container for ServiceNow data, another for LogMeIn data, another for our telephony system, etc.
Is there any preferred approach?
Based on your description, it seems you are tangled to use a small number of containers to store a large number of blobs or make a large number of containers to store a small number of blobs. If all you think about is parallelism and scalability, you can rest assured, just you design a storage structure that suits you. Because partitioning in Azure Blob storage is done at the blob level, not the container.
Each of these two approaches has their advantages and disadvantages.
For a small number of containers, it can save the cost of creating containers (the operation of creating containers need you to pay for it). But at the same time, when you try to list the blobs in the container, the objects in it will be listed. If you still have a subset inside, you still need to continue to obtain, in this case the performance is less than the Lots of Container Solution. And at the same time, the security boundary you set will apply to all blobs in this container. This is not necessarily what you want.
For a large number of structured containers, more containers can set more security boundaries (custom access permissions, access control SAS signatures). It is also easy to list blobs, no more messy subsets are needed to catch. But again, its disadvantage is that it will have more consumption in creating containers (in extreme cases, it will increase a lot of costs. In general, it does not matter. a website that calculates costs: https://azure.microsoft.com/en-us/pricing/calculator/?cdn=disable).
I am working on a project with a requirement of coming up with a generic dashboard where a users can do different kinds of grouping, filtering and drill down on different fields. For this we are looking for a search store that allows slice and dice of data.
There would be multiple sources of data and would be storing it in the Search Store. There may be some pre-computation required on the source data which can be done by an intermediate components.
I have looked through several blogs to understand whether ES can be used reliably as a primary datastore too. It mostly depends on the use-case we are looking for. Some of the information about the use case that we have :
Around 300 million record each year with 1-2 KB.
Assuming storing 1 year data, we are today with 300 GB but use-case can go up to 400-500 GB given growth of data.
As of now not sure, how we will push data, but roughly, it can go up to ~2-3 million records per 5 minutes.
Search request are low, but requires complex queries which can search data for last 6 weeks to 6 months.
document will be indexed across almost all the fields in document.
Some blogs say that it is reliable enough to use as a primary data store -
http://chrisberkhout.com/blog/elasticsearch-as-a-primary-data-store/
http://highscalability.com/blog/2014/1/6/how-hipchat-stores-and-indexes-billions-of-messages-using-el.html
https://karussell.wordpress.com/2011/07/13/jetslide-uses-elasticsearch-as-database/
And some blogs say that ES have few limitations -
https://www.found.no/foundation/elasticsearch-as-nosql/
https://www.found.no/foundation/crash-elasticsearch/
http://www.quora.com/Why-should-I-NOT-use-ElasticSearch-as-my-primary-datastore
Has anyone used Elastic Search as the sole truth of data without having a primary storage like PostgreSQL, DynamoDB or RDS? I have looked up that ES has certain issues like split brains and index corruption where there can be a problem with the data loss. So, I am looking to know if anyone has used ES and have got into any troubles with the data
Thanks.
Short answer: it depends on your use case, but you probably don't want to use it as a primary store.
Longer answer: You should really understand all of the possible issues that can come up around resiliency and data loss. Elastic has some great documentation of these issues which you should really understand before using it as a primary data store. In addition Aphyr's post on the topic is a good resource.
If you understand the risks you are taking and you believe that those risks are acceptable (e.g. because small data loss is not a problem for your application) then you should feel free to go ahead and try it.
It is generally a good idea to design redundant data storage solutions. For example, it could be a fast and reliable approach to first just push everything as flat data to a static storage like s3 then have ES pull and index data from there. If you need more flexibility leveraging some ORM, you could have an RDS or Redshift layer in between. This way the data can always be rebuilt in ES.
It depends on your needs and requirements how you set the balance between redundancy and flexibility/performance. If there's a lot of data involved, you could store the raw data statically and just index some parts of it by ES.
Amazon Lambda offers great features:
Many developers store objects in Amazon S3 while using Amazon DynamoDB
to store and index the object metadata and enable high speed search.
AWS Lambda makes it easy to keep everything in sync by running a
function to automatically update the index in Amazon DynamoDB every
time objects are added or updated from Amazon S3.
Since 2015 when this question was originally posted a lot of resiliency issues have been found and addressed, and in recent years a lot of features and specifically stability and resiliency features have been added, that it's definitely something to consider given the right use-cases and leveraging the right features in the right way.
So as of 2022, my answer to this question is - yes you can, as long as you do it correctly and for the right use-case.
happy new year and best wishes!
we are collecting a great amount of GPS positions for analytics purpose that we would like to store and process (2-3GB daily data) using Heroku / Amazon services. We are looking for a suitable solution. We initially thought about a system where the data are directly uploaded to Amazon S3, a Worker Dyno constantly tries to process them and puts the GPS positions to a Heroku PostGIS database, then another Worker Dyno would be used on demand to compute analytics output on the fly. We also heard about Amazon Elastic Map Reduce that works directly with raw data in S3 without a PostGIS database. And we need your guidance.
What are your recommendation for this kind of needs for storing and processing data (Heroku add-on, architectures, etc)? What do you think of the 2 alternatives listed above?
Many thanks
It is difficult to give a precise answer as the details of your processing are not clear. Do you need per user analytics, per region analytics, across days etc.
I can point you to some related services:
Amazon Kinesis - a new service that is targeted at such use cases (Internet of things like). You can PUT your readings from various sources (including directly from the mobile devices) and read them on the server side.
Amazon DynamoDB - NoSQL DB that AWS recently added a geospatial library for it: http://www.allthingsdistributed.com/2013/09/dynamodb-geospatial.html
http://aws.typepad.com/aws/2013/09/new-geo-library-for-dynamodb-.html
RDS with PostgreSQL - PostgreSQL is very good for GIS calculation and with RDS it is even easier to manage as most of the DBA work that is needed (installation, updates, backup, restore, etc.) are done by RDS service.
S3 - THE place to store your data for batch processing. Note that it is best to have larger files for most processing cases like EMR. You can have a connector that reading the data from Kinesis and storing it into S3 (see GitHub example: https://github.com/awslabs/amazon-kinesis-connectors/tree/master/src/main/java/com/amazonaws/services/kinesis/connectors/s3)
Amazon EMR - this is the cluster management service that makes running jobs like Hadoop jobs much easier. You can find a presentation about using EMR for geospatial analytics in re:invent BDT201 and video
You should also consider pre-processing the data to limit the number of redudant records. Most of your positions are likely to be at the same location. In otherwords, the device will be sitting still much of the time.
one approach is to store a new position only if its speed is greater then 0 and the last stored location is also at 0. That way you store only the fist location after the device stops moving. There will be noise on the GPS speed so you iwll not get rid of every resting position.
Another option would be to store only when a new position is some distance from the previously stored position.
You can always return a result for any requested time by finding the closest record before the requested timestamp.
If you use the range compression, consider setting the required distance at least as large as the expected RMS error for the GPS device, likly to be about 5M minimum, use a longer distance if you can stand it.
Doing the math for distance between Geo locations can be resource expensive, pre-calculate a delta lat lon value to use with incoming positions to speed that up.
EMR launched Kinesis connector so one could process such a dataset using familiar tools in Hadoop Ecosystem. Did you see http://aws.typepad.com/aws/2014/02/process-streaming-data-with-kinesis-and-elastic-mapreduce.html ?
Currently, I have Pig script running on top of Amazon EMR to load a bunch of files from S3 and then I will do the filter processing and group the data into phone number, so the data will be like (phonenumber:chararray, bag:{mydata:chararray}). Next I will have to store each phone number into different S3 buckets (possibly buckets in different accounts that I have access to). Seems org.apache.pig.piggybank.storage.MultiStorage is the best use at here, but it doesn't work, as there are 2 problems I am facing:
There are a lot of phone numbers (approximate 20,000), to store each
phone number into different S3 buckets is very very slow and the
program is even out of memory.
There is no way for me to look
up my lookup table to decide where is the buckets to store into.
So I am wondering if anyone can help out? The second one probably can solve by written up my own UDF store function, but for the first one, how to solve it? Thanks.
S3 is limited to 100 buckets per account. More than that, the creation of a bucket is not immediate, as you need to wait for the bucket to be ready.
However you can have as many objects as you want in a bucket. You can write the phone numbers as different object relatively quick. Especially if you are taking care at the name of your objects: objects in S3 are stored by prefix. If you are giving all your objects that same prefix, S3 will try to put all of them on the same "hot" area, getting less performance. If you choose the prefix to be different (usually simply reverse the id or time), you will improve it significantly.
You can also take a look at DynamoDB, which is a scalable NoSQL DB in AWS. You can get very high throughput for the time of building your index. You can later export it to S3 as well, using Hive over Elastic MapReduce.