Frequent changes on attributes, is it wise to use Directory Service - opendj

I am trying to design a access solution using opendj, down the line i realized system will have frequent changes on accounts, users in the tree.
With your experience dealing with Open DJ / Directory Service is it wise to manage frequent changes and bulk changes .

I think you need to define frequent.
But OpenDJ / ForgeRock Directory Services have been designed to be able to deal with a larger throughput of updates than traditional directory servers. On our lab machines we're able to sustain 15 000 modifications per seconds on a single server in a replicated environment that is up to date in milliseconds.

Related

Copy activity taking more times to copy the data from On-prem Oracle database to Azure Synapse Analytics

I am trying to copy the data from Oracle Database to Azure Synapse and it is taking more time around 3 days to copy 900 GB of data.
My Oracle database is an on-prem database and I have configured self-hosted IR.
I have configured the staging as well while copying the data from Oracle on-prem database to Azure Synapse.
Not sure, Why it is taking this much time, how we can check and fix this data copy issues
Here, the issue could be due to various reasons. The alternative is to host a growing number of concurrent workloads. Alternatively, you may wish to improve your performance while maintaining your current workload level.
By increasing the number of concurrent tasks that can execute on a node, you can scale up the self-hosted IR.
Scale up is only possible if the node's CPU and memory are not completely occupied.
By adding extra nodes, you can scale out the self-hosted IR (machines).
You may specify the parallelism you want the copy activity to employ by setting the parallel Copies attribute. Consider this parameter to be the maximum number of threads allowed within the copy operation. The threads work in tandem. The threads either read from or write to the sink data storage.
Here, are the referred Docs: Self-hosted integration runtime scalability & High availability and scalability

How to serve data from HDFS fast/in realtime in my services?

For now, in my company, every team that needs to serve data from HDFS to users creates own tool for that task.
We want to create a generic tool for fast/in real-time serving that data through HTTP from HDFS to my services. By generic I mean the tool should serve data only for my selected services added to the configuration and that should be the only action that users have to perform to use this generic tool. This new tool should be informed about new data that appeared in HDFS and then invoke some kind of job that moves data from HDFS to our fast storage.
Applications can update their data every day or every hour but every service can do it at different times (service A can be updated every day at 7 AM and service B can be updated every hour). I think we do not want to use any schemas and want to access our data using the only key and partition date. Querying is not necessary.
We do not know yet how much capacity or read/writes per second our tool needs to withstand.
We’ve worked some solution for our problem but we are interested if there are similar solutions in open source already or maybe any of you had a similar use case?
This is our proposal of architecture:
architecture
If you need to access HDFS over HTTP then WebHDFS might fit your use case. You could add a caching layer to speed up requests for hot files, but I think as long as you are using HDFS you'll never get sub-second response for any file that isn't already cached. You must decide if that is acceptable for you.
I'm unsure of how well WebHDFS works with huge files.
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

How to schedule backups for development, testing, and production environments?

I was educating myself about website launches. I was reading this document to prepare an implementation check-list for my future website launches.
On page 11, they mentioned
Schedule backups for development, testing, and production environments
I can imagine taking backup of a database and website code.
How can someone automate and schedule backup of an entire x, y or z environment?
If this is a very broad question then I do not need the step by step document for doing this. Maybe some starting point which help me imagine this will be good enough.
Update:
I have sent an inquiry to Pantheon looking forward for their response.
It all depends on how you set up your environments, how complex they are, and what you are trying to back up.
How often do your environments change?
You can use something like Docker to keep track of the changes to your environments. It basically gives you version control for environments.
If you're using a hosting provider like DigitalOcean, they provide snapshots and backups of entire virtual images.
Read about rsync - https://wiki.archlinux.org/index.php/Full_system_backup_with_rsync
You could use rsync if you manually keep track of all of your config files and handle the scheduling and syncing yourself.
As for the actual site:
What's changing on your sites?
For changes to the code of a site, you should be using version control so that every single change is backed up to a repository like Github.
Most of the time though, the only thing that is changing is data in a database. Your database can easily be managed by a third party like Amazon RDS who offers automated backups. If you want to manage your own database on your own server, then you can have a cron job to automatically perform a backup of your database at scheduled times. Then you can use rsync to sync that backup to a separate machine. (You want to spread your backups across a couple of machines in different locations)
You can run cron jobs at regular interval to backup your data. The most important data could be your transactional information or the entire database itself.
Code, you can manage at developers end using github. But if you have certain folders which are populated by website users, then plan to backup those as well.

How many users a single-small Windows azure web role can support?

We are creating a simple website but with heave user logins (about 25000 concurrent users). How can I calculate no. of instances required to support it?
Load testing and performance testing are really the only way you're going to figure out the performance metrics and instance requirements of your app. You'll need to define "concurrent users" - does that mean 25,000 concurrent transactions, or does that simply mean 25,000 active sessions? And if the latter, how frequently does a user visit web pages (e.g. think-time between pages)? Then, there's all the other moving parts: databases, Azure storage, external web services, intra-role communication, etc. All these steps in your processing pipeline could be a bottleneck.
Don't forget SLA: Assuming you CAN support 25,000 concurrent sessions (not transactions per second), what's an acceptable round-trip time? Two seconds? Five?
When thinking about instance count, you also need to consider VM size in your equation. Depending again on your processing pipeline, you might need a medium or large VM to support specific memory requirements, for instance. You might get completely different results when testing different VM sizes.
You need to have a way of performing empirical tests that are repeatable and remove edge-case errors (for instance: running tests a minimum of 3 times to get an average; and methodically ramping up load in a well-defined way and observing results while under that load for a set amount of time to allow for the chaotic behavior of adding load to stabilize). This empirical testing includes well-crafted test plans (e.g. what pages the users will hit for given usage scenarios, including possible form data). And you'll need the proper tools for monitoring the systems under test to determine when a given load creates a "knee in the curve" (meaning you've hit a bottleneck and your performance plummets).
Final thought: Be sure your load-generation tool is not the bottleneck during the test! You might want to look into using Microsoft's load-testing solution with Visual Studio, or a cloud-based load-test solution such as Loadstorm (disclaimer: Loadstorm interviewed me about load/performance testing last year, but I don't work for them in any capacity).
EDIT June 21, 2013 Announced at TechEd 2013, Team Foundation Service will offer cloud-based load-testing, with the Preview launching June 26, coincident with the //build conference. The announcement is here.
No one can answer this question without a lot more information... like what technology you're using to build the website, what happens on a page load, what backend storage is used (if any), etc. It could be that for each user who logs on, you compute a million digits of pi, or it could be that for each user you serve up static content from a cache.
The best advice I have is to test your application (either in the cloud or equivalent hardware) and see how it performs.
It all depends on the architecture design, persistence technology and number of read/write operations you are performing per second (average/peak).
I would recommend to look into CQRS-based architectures for this kind of application. It fits cloud computing environments and allows for elastic scaling.
I was recently at a Cloud Summit and there were a few case studies. The one that sticks in my mind is an exam app. it has a burst load of about 80000 users over 2 hours, for which they spin up about 300 instances.
Without knowing your load profile it's hard to add more value, just keep in mind concurrent and continuous are not the same thing. Remember the Stack overflow versus Digg debacle "http://twitter.com/#!/spolsky/status/27244766467"?

Images in load balanced environment

I have a load balanced enviorment with over 10 web servers running IIS. All websites are accessing a single file storage that hosts all the pictures. We currently have 200GB of pictures - we store them in directories of 1000 images per directory. Right now all the images are in a single storage device (RAID 10) connected to a single server that serves as the file server. All web servers are connected to the file server on the same LAN.
I am looking to improve the architecture so that we would have no single point of failure.
I am considering two alternatives:
Replicate the file storage to all of the webservers so that they all access the data locally
replicate the file storage to another storage so if something happens to the current storage we would be able to switch to it.
Obviously the main operations done on the file storage are read, but there are also a lot of write operations. What do you think is the preferred method? Any other idea?
I am currently ruling out use of CDN as it will require an architecture change on the application which we cannot make right now.
Certain things i would normally consider before going for arch change is
what are the issues of current arch
what am i doing wrong with the current arch.(if this had been working for a while, minor tweaks will normally solve a lot of issues)
will it allow me to grow easily (here there will always be a upper limit). Based on the past growth of data, you can effectively plan it.
reliability
easy to maintain / monitor / troubleshoot
cost
200GB is not a lot of data, and you can go in for some home grown solution or use something like a NAS, which will allow you to expand later on. And have a hot swappable replica of it.
Replicating to storage of all the webservers is a very expensive setup, and as you said there are a lot of write operations, it will have a large overhead in replicating to all the servers(which will only increase with the number of servers and growing data). And there is also the issue of stale data being served by one of the other nodes. Apart from that troubleshooting replication issues will be a mess with 10 and growing nodes.
Unless the lookup / read / write of files is very time critical, replicating to all the webservers is not a good idea. Users(of web) will hardly notice the difference of 100ms - 200ms in loadtime.
There are some enterprise solutions for this sort of thing. But I don't doubt that they are expensive. NAS doesn’t scale well. And you have a single point of failure which is not good.
There are some ways that you can write code to help with this. You could cache the images on the web servers the first time they are requested, this will reduce the load on the image server.
You could get a master slave set up, so that you have one main image server but other servers which copy from this. You could load balance these, and put some logic in your code so that if a slave doesn’t have a copy of an image, you check on the master. You could also assign these in priority order so that if the master is not available the first slave then becomes the master.
Since you have so little data in your storage, it makes sense to buy several big HDs or use the free space on your web servers to keep copies. It will take down the strain on your backend storage system and when it fails, you can still deliver content for your users. Even better, if you need to scale (more downloads), you can simply add a new server and the stress on your backend won't change, much.
If I had to do this, I'd use rsync or unison to copy the image files in the exact same space on the web servers where they are on the storage device (this way, you can swap out the copy with a network file system mount any time).
Run rsync every now and then (for example after any upload or once in the night; you'll know better which sizes fits you best).
A more versatile solution would be to use a P2P protocol like Bittorreent. This way, you could publish all the changes on the storage backend to the web servers and they'd optimize the updates automatcially.

Resources