We are testing out Azure Synapse Link for Dataverse due to data export service being deprecated. We did some initial tests on low volume data, and were able to get the service up and running fairly easily. I've now tried setting it up for a larger dataset and it seems the initial synch for these entities is extrememly slow.
We tried to synch 3 of our larger entities, two of these entities have approximately 7 million records and one of somewhere around 12 million. In three days, it's managed to get through around 7 million between them, so still around 19 million to go. So it looks like it could easily take another several days for it to finish these three.
I'm trying to understand where the bottleneck is but struggling to find guidance on larger datasets.
The examples and documentation from Microsoft show setting up the link with a standard storage account. I've left the partitions on the entities as "month", and the workspace is using a serverless pool. I can see in the monitor, that this pool isn't doing anything which makes sense as I understand it'll only be used once we start issuing queries.
So, I have a few questions around the setup:
Is there a way to monitor where my bottleneck in this synchronisation is? I.e. how could I tell if it's the workspace, the storage account.. something else?
Is "standard" tier on the storage account sufficient for these sorts of volumes? (I can't find any guidelines around this but maybe I'm looking in the wrong place)
If I set it to premium, I'm assuming we want a "Block Blob" account type. Is that right?
Are there any other aspects that will affect the performance? For example, it's a non-production D365 instance, would that make a difference too?
MS suggests not adding more than a 5 entities at a time, so seeing as there are only 3 I am wondering if these being a bit bigger would I see improved performance synching them 1 at a time?
Thanks!
Related
I am working on a project with a requirement of coming up with a generic dashboard where a users can do different kinds of grouping, filtering and drill down on different fields. For this we are looking for a search store that allows slice and dice of data.
There would be multiple sources of data and would be storing it in the Search Store. There may be some pre-computation required on the source data which can be done by an intermediate components.
I have looked through several blogs to understand whether ES can be used reliably as a primary datastore too. It mostly depends on the use-case we are looking for. Some of the information about the use case that we have :
Around 300 million record each year with 1-2 KB.
Assuming storing 1 year data, we are today with 300 GB but use-case can go up to 400-500 GB given growth of data.
As of now not sure, how we will push data, but roughly, it can go up to ~2-3 million records per 5 minutes.
Search request are low, but requires complex queries which can search data for last 6 weeks to 6 months.
document will be indexed across almost all the fields in document.
Some blogs say that it is reliable enough to use as a primary data store -
http://chrisberkhout.com/blog/elasticsearch-as-a-primary-data-store/
http://highscalability.com/blog/2014/1/6/how-hipchat-stores-and-indexes-billions-of-messages-using-el.html
https://karussell.wordpress.com/2011/07/13/jetslide-uses-elasticsearch-as-database/
And some blogs say that ES have few limitations -
https://www.found.no/foundation/elasticsearch-as-nosql/
https://www.found.no/foundation/crash-elasticsearch/
http://www.quora.com/Why-should-I-NOT-use-ElasticSearch-as-my-primary-datastore
Has anyone used Elastic Search as the sole truth of data without having a primary storage like PostgreSQL, DynamoDB or RDS? I have looked up that ES has certain issues like split brains and index corruption where there can be a problem with the data loss. So, I am looking to know if anyone has used ES and have got into any troubles with the data
Thanks.
Short answer: it depends on your use case, but you probably don't want to use it as a primary store.
Longer answer: You should really understand all of the possible issues that can come up around resiliency and data loss. Elastic has some great documentation of these issues which you should really understand before using it as a primary data store. In addition Aphyr's post on the topic is a good resource.
If you understand the risks you are taking and you believe that those risks are acceptable (e.g. because small data loss is not a problem for your application) then you should feel free to go ahead and try it.
It is generally a good idea to design redundant data storage solutions. For example, it could be a fast and reliable approach to first just push everything as flat data to a static storage like s3 then have ES pull and index data from there. If you need more flexibility leveraging some ORM, you could have an RDS or Redshift layer in between. This way the data can always be rebuilt in ES.
It depends on your needs and requirements how you set the balance between redundancy and flexibility/performance. If there's a lot of data involved, you could store the raw data statically and just index some parts of it by ES.
Amazon Lambda offers great features:
Many developers store objects in Amazon S3 while using Amazon DynamoDB
to store and index the object metadata and enable high speed search.
AWS Lambda makes it easy to keep everything in sync by running a
function to automatically update the index in Amazon DynamoDB every
time objects are added or updated from Amazon S3.
Since 2015 when this question was originally posted a lot of resiliency issues have been found and addressed, and in recent years a lot of features and specifically stability and resiliency features have been added, that it's definitely something to consider given the right use-cases and leveraging the right features in the right way.
So as of 2022, my answer to this question is - yes you can, as long as you do it correctly and for the right use-case.
We're using a Geoserver, and we've a performance problems in production with a large number of users.
We've made some load test with : 250, 150, and 20 threads. We've noticed that Geoserver works better with 20 threads than with 150 threads, and when thread number increase (150 or 250), performance decrease.
Is it normal ? How Geoserver manage the users request ? Does Geoserver use asynchronous strategy to manage users request ?
Thanks in advance.
bsh
Sounds pretty normal. Threads (and cpu context switches) aren't free, and at some point you are going to spend more time thrashing around switch threads than actually doing anything useful. Often better to have a much smaller number of threads (number of cores * 2 is often reasonable) combined with some sort of front end queue that will accept a connection and hold it until a worker is free.
Here are some real-world use case statistics for you; in production, for mobile/web apps serving 'google-maps' style users for the outdoor market, my company has tested various configurations (several of these discussed by theonlysandman, a contributor to this question), and which also support the observation by Tyler Evans, also a contributor to this question).
We need loads of greater than 5000 requests / second ('qps), and as our Geoserver instances ubiquitously topped at nearly 100 qps each, we'd need to horizontally and vertically scale to over 50 Geoserver instances.
Parameters: mostly vector sources, local PostGIS databases all less than 2tb each and no table > 1M records (or if greater than 1M, simplified geometry at nodes > 1m apart), 60%-40%-10% WMS/WMTS/WFS requests, google cloud hosted servers, each 32 core, ssd drive cluster to 4Tb.
The bottleneck of qps appears to be Geoserver itself. (Styling, reprojection, all the niceties that come with it). I'm not advocating it is poorly written, but the heavier a car gets the slower it might drive.
If we replicate the wfs requests using GO or python +/- gdal to directly access postgis data we get faster throughput than geoserver (up to 1000 qps each instance or more, where PostGIS becomes the bottleneck).
The same goes for our homemade Java microservice based on PostGIS that creates pbf/mvt tiles from postgis- it, too, was very quick- at about 1000 qps.
Nginx for us performed slightly better than php (~110 qps vs ~89 qps), but this could be a result of a configuration of apache.
Where do we go from here? In all of our production use cases, for our users, serving miniature sharded sqlite/mbtile databases (vector or raster)... and maintaining them with custom code... was far more performant and scalable.
We may write a Java plugin for geoserver that pushes GeoWebCache TMS tiles into a Google Storage Bucket designed for slippy z/x/y calls... this way we could more easily maintain a tile pyramid with updates etc., using Geoserver tools.
The more threads, the harder the load on the server. See WikiPedia article on thrasing.
Geoserver performance is affect by many things. My advice is to look at each one and see where the bottle neck be occurring.
Here are a list of question to set you on the correct path:
What are the specs of your machine? It should have an SSD.
Are you generating your tiles on the way? Or are they pre-seeded?
If they you are pre-seeding, is that running?
NOTE: pre-seeding helps but hammers the system so best done out of production.
What is the source for your data, if postgis, are you using spatial indexes?
Is PostgreSQL/postgis on the same machine?
How many types of tiles are you generating?
NOTE: you could be generating extra tiles which you don't need/use.
Do you use GeoWebCache?
With some more details, I can help you out.
I'm new to everything that is 'the cloud.'
I will be developing a website/platform that will have around 15,000,000 estimated monthly visitors after the first year of production.
I'm assuming that the site will have 5 page views per visitor, and 100kb of data transfer per page.
I've contacted several cloud hosting companies, but they tell me that I need to have 'hardware requirements.'
Since I'm rather clueless about IT stuff, I'd like to know:
What are the factors that need to be analyzed in order to determine
How many servers are required
VPUs / server required
RAM / server required
Total storage / server required
Big thanks in advance!
I don't agree with the other answer as it's nearly total guesswork, as will anything you can generate yourself.
The only surefire way to know is to get some hardware, stick your application on it and run some load testing to see if you can get to the point you want to traffic wise, and with a certain amount of free overhead on the servers. Only then will you know what you need. No-one else can answer this question as every application is different. This is your application, only you can test it.
Data given wont help much in determining what numbers you want. But based on my experience I'll try to help you in analysis.
15,000,000 visits a month means 700K visits a day (assuming approx 30-35% visits are by repeat visitors).
700Kx5=3.5million page views a day.
Assuming 14 hours of active period, typical for single timezeone sites. Its 70reqs/sec.
With this big userbase few thing you surely need is a high performance DB server, with one slave.
Config of these DB server
Memory so that whole active data + indexes fits in memory (No swapping/thrashing should happen). This you need to calculate based on
what you will be storing for user and for how long.
Use some reliable storage like RAID10 (higher read/write bandwith).
Take enough storage, see that its elastic enough. (like AWS EBS).
Make frontend app server lightweight and horizontally scalable. Put them behind a loadbalancer (use software loadbalancer like nginx or HAproxy). You should be able to put as many as you go to your goal.
For loadbalacer and frontend take 4CPU, 4-8GB RAM servers.
How much each frontend can take need to be tested using a load testing method and realistic test data.
Reduce load on database/persistent using a inmemory/+persistent caches like memcached/membase/redis etc. Take a servers with 8GB and add more as you feel need.
I have not discussed about DB partitioning. Do that only when you feel the need of it. Do not over invest at start.
With 15M users a month, this setup should be enough, but again it all depends on you 1. memory footprint, 2. amount of active data
I tried to answer as much as possible. Comments on points you disagree or wanna discuss more.
I am curious if anybody did benchmarks for accessing of data in NoSQL databases vs Oracle (particularly I am talking about Oracle RAC)?
The project requires to work with at least 10mil+ of records, search among them (but not necessary have to be real time), the read is very important for speed, and it's also very important to guarantee HA and reliability (can't lose records!!!)
I can see for myself how say Cassandra/MongoDB might be better fit (because key value storage will provide faster reads than SQL when you go over 10mil records), but I find difficult to articulate all of them nicely. Any links? Suggestions? Bullet points?
Thanks!
10 million records. Assume 250 bytes per record. That is about 2.5 Gb of data, which is well within the capacity of a basic desktop / laptop PC. The data volumes are insignificant (unless each record is sized in Mb, such as picture or audio).
What you do need to talk about is transaction volumes (separated into read and write) and what you consider HA. Read-only HA is easy relative to "Read-write HA". It can be trivial to replicate a read-only data set off to multiple servers at different geographic locations and distribute a query workload on them.
It's much harder to scale out an update heavy workload, which is why you often hear about systems going into meltdown when tickets for a big concert are released. Quite simply there's a fixed number of seats and you can't have ten duplicated systems each selling what they think is available. There has to be a single source of truth, which means a bottleneck (and potentially a single point of failure).
On the HA aspect, RAC is a shared storage technology which generally means your RAC nodes are in close proximity. That can make them vulnerable to localized events such as a building fire or telecoms breakdown. Data Guard is the Oracle technology that relates to off-site replication and failover.
Mostly when you come to comparison of NoSQL vs SQL, you have to understand a very important difference between them. Data in NoSQL may be inconsistent in cost order to achieve HA.
What do I mean by inconsistent? It depends, but usually around 3-5 seconds to propagate the data around nodes. NoSQL database provide mechanism to manage and eliminate that, but if you want all your data be consistent in real time, then you simply use classic SQL, like Oracle RAC.
Coming back to speed comparison: it's simply incomparable which one is faster, because it relays on factors like network infrastructure, computing power and database model etc. But important thing is that at some point you may reach the moment that SQL is economically inefficient to maintain and you have to switch to NoSQL.
I have a large amount of data I need to store, and be able to generate reports on - each one representing an event on a website (we're talking over 50 per second, so clearly older data will need to be aggregated).
I'm evaluating approaches to implementing this, obviously it needs to be reliable, and should be as easy to scale as possible. It should also be possible to generate reports from the data in a flexible and efficient way.
I'm hoping that some SOers have experience of such software and can make a recommendation, and/or point out the pitfalls.
Ideally I'd like to deploy this on EC2.
Wow. You are opening up a huge topic.
A few things right off the top of my head...
think carefully about your schema for inserts in the transactional part and reads in the reporting part, you may be best off keeping them separate if you have really large data volumes
look carefully at the latency that you can tolerate between real-time reporting on your transactions and aggregated reporting on your historical data. Maybe you should have a process which runs periodically and aggregates your transactions.
look carefully at any requirement which sees you reporting across your transactional and aggregated data, either in the same report or as a drill-down from one to the other
prototype with some meaningful queries and some realistic data volumes
get yourself a real production quality, enterprise ready database, i.e. Oracle / MSSQL
think about using someone else's code/product for the reporting e.g. Crystal/BO / Cognos
as I say, huge topic. As I think of more I'll continue adding to my list.
HTH and good luck
#Simon made a lot of excellent points, I'll just add a few and re-iterate/emphasize some others:
Use the right datatype for the Timestamps - make sure the DBMS has the appropriate precision.
Consider queueing for the capture of events, allowing for multiple threads/processes to handle the actual storage of the events.
Separate the schemas for your transactional and data warehouse
Seriously consider a periodic ETL from transactional db to the data warehouse.
Remember that you probably won't have 50 transactions/second 24x7x365 - peak transactions vs. average transactions
Investigate partitioning tables in the DBMS. Oracle and MSSQL will both partition on a value (like date/time).
Have an archiving/data retention policy from the outset. Too many projects just start recording data with no plans in place to remove/archive it.
Im suprised none of the answers here cover Hadoop and HDFS - I would suggest that is because SO is a programmers qa and your question is in fact a data science question.
If youre dealing with a large number of queries and large processing time, you would use HDFS (a distributed storage format on EC) to store your data and run batch queries (I.e. analytics) on commodity hardware.
You would then provision as many EC2 instances as needed (hundreds or thousands depending on how big your data crunching requirements are) and run map reduce queires against.your data to produce reports.
Wow.. This is a huge topic.
Let me begin with databases. First get something good if you are going to have crazy amounts to data. I like Oracle and Teradata.
Second, there is a definitive difference between recording transactional data and reporting/analytics. Put your transactional data in one area and then roll it up on a regular schedule into a reporting area (schema).
I believe you can approach this two ways
Throw money at the problem: Buy best in class software (databases, reporting software) and hire a few slick tech people to help
Take the homegrown approach: Build only what you need right now and grow the whole thing organically. Start with a simple database and build a web reporting framework. There are a lot of descent open-source tools and inexpensive agencies that do this work.
As far as the EC2 approach.. I'm not sure how this would fit into a data storage strategy. The processing is limited which is where EC2 is strong. Your primary goal is effecient storage and retreival.