How to run Memgraph database without any disk persistence - memgraphdb

How to run Memgraph database with the database in memory only with no snapshots to disk and no write-ahead logging, no disk persistence. I am happy to loose all data is the database is stopped.

This is as of yet an un-tested answer, but looking at the configuration file in /etc/memgraph/memgraph.conf there are parameters related to durability of storage, I would set this to turn off durability.
--durability-enabled=false
There are other configuration parameters after this one also related to persistence, I would set them as follows, just to be sure. Although I assume the above configuration set to false would turn off all durability.
--db-recover-on-startup=false
--snapshot-cycle-sec=-1
--snapshot-on-exit=false
--snapshot-max-retained=3
--synchronous-commit=false

Related

How much space does spark streaming checkpoint take?

I am new to Spark Streaming and have little knowledge about checkpoint.Is streaming data stored in the checkpoint? Is the data stored in hdfs or memory ?How much space will it takes?
according to : Spark The definitive guide
The most important operational concern for a streaming application is
failure recovery. Faults are inevitable: you’re going to lose a
machine in the cluster, a schema will change by accident without a
proper migration, or you may even intentionally restart the cluster or
application. In any of these cases, Structured Streaming allows you to
recover an application by just restarting it. To do this, you must
configure the application to use checkpointing and write-ahead logs,
both of which are handled automatically by the engine. Specifically,
you must configure a query to write to a checkpoint location on a
reliable file system (e.g., HDFS, S3, or any compatible filesystem).
Structured Streaming will then periodically save all relevant progress
information (for instance, the range of offsets processed in a given
trigger) as well as the current intermediate state values to the
checkpoint location. In a failure scenario, you simply need to restart
your application, making sure to point to the same checkpoint
location, and it will automatically recover its state and start
processing data where it left off. You do not have to manually manage
this state on behalf of the application—Structured Streaming does it
for you.
I conclude that it is job progress information and intermediate results in which stored in checkpoint not the data, checkpoint location has to be a path in an HDFS compatible file system and the required space is based on the intermediate generated output.

How to perform GreenPlum 6.x Backup & Recovery

I am using GreenPlum 6.x and facing issues while performing backup and recovery. Does we have any tool to take the physical backup of whole cluster like pgbackrest for Postgres, further how can we purge the WAL of master and each segment as we can't take the pg_basebackup of whole cluster.
Are you using open source Greenplum 6 or a paid version? If paid, you can download the gpbackup/gprestore parallel backup utility (separate from the database software itself) which will back up the whole cluster with a wide variety of options. If using open source, your options are pretty much limited to pgdump/pgdumpall.
There is no way to purge the WAL logs that I am aware of. In Greenplum 6, the WAL logs are used to keep all the individual postgres engines in sync throughout the cluster. You would not want to purge these individually.
Jim McCann
VMware Tanzu Data Engineer
I would like to better understand the the issues you are facing when you are performing your backup and recovery.
For Open Source user of the Greenplum Database, the gpbackup/gprestore utilities can be downloaded from the Releases page on the Github repo:
https://github.com/greenplum-db/gpbackup/releases
v1.19.0 is the latest.
There currently isn't a pg_basebackup / WAL based backup/restore solution for Greenplum Database 6.x
WAL logs are periodically purged (as they get replicated to mirror and flushed) from master and segments individually. So, no manual purging is required. Have you looked into why the WAL logs are not getting purged? One of the reasons could be mirrors in cluster is down. If that happens WAL will continue mounting on primary and won't get purged. Perform select * from pg_replication_slots; for master or segment for which WAL is building to know more.
If the cause for WAL build is due replication slot as for some reason is mirror down, can use guc max_slot_wal_keep_size to configure max size WAL's should consume, after that replication slot will be disabled and not consume more disk space for WAL.

Which distributed database I need to choose for medium data project

Now we have java project with PostgreSQL database on spring boot 2 with Spring Data JPA (Hibernate).
Requirements to new architecture:
On N computers we have workplace. Each workplace use the same program with different configuration (configured client for redistributed database).
Computers count is not big - amount 10/20 PCs. Database must be scalable (a lot of data can be stored at the disk ~1/2 Tb).
Every day up to 1 million rows can be inserted into database from one workplace.
Each workplace works with redistributed database - it means, that each node must be able to read/write data, modified by each other. And make some decision based on data, modified by another workplace at runtime(Transactional).
Datastore(disk database archive) must be able to archived and copied as backup snapshot.
Project must be portable to new architecture with Spring Data JPA 2 and database backups with liquibase. Works on windows/ Linux.
The quick overview shows me that the most popular redistributed FREE database at now are:
1) Redis
2) Apache Ignite
3) Hazelcast
I need help in understanding way to architect described system.
First of all, I'm tried to use redis and ignite. Redis start easily - but it works like simple IMDG(in memory data grid). But I need to store all the data in persistent database(at disk, like ignite persistence). There is a way to use redis with existing PostgreSQL database? Postgres synchronized with all nodes and Redis use in memory cache with fresh data, produced by each workplace. Each 10 minutes data flushed at disk.
1) This is possible? How?
Also I'm tried to use Ignite - but my project works on spring boot 2. Spring data 2. And Ignite last released version is 2.6 and spring data 2 support will appears only in apache ignite 2.7!
2) I have to download 2.7 version nightly build, but how can I use it in my project? (need to install to local Maven repository?)
3) And after all, what will be the best architecture in that case? Datastore provider stores persistent data at disk, synchronized with each workspace In-memory cache and persist in-memory data to disk by timeout?
What will be the best solution and which database I should to choose?
(may be something works with existing PostgreSQL?)
Thx)
Your use case sounds like a common one with Hazelcast. You can store your data in memory (i.e. in an Hazelcast IMap), use a MapStore/MapLoader to persist changes to your database, or read from database. Persisting changes can be done in a write-through or write-behind manner based on your configuration. Also there is spring boot and spring-jpa integration available.
Also the amount of data you want to store is pretty big for 10-20 machines, so you might want to look into hazelcast High-Density Memory Store option to be able to store large amounts of data in commodity hardware without having GC problems.
Following links should give you further idea:
https://opencredo.com/spring-booting-hazelcast/
https://docs.hazelcast.org//docs/3.11/manual/html-single/index.html#loading-and-storing-persistent-data
https://hazelcast.com/products/high-density-memory-store/
Ignite is not suitable for that options, because JPA 1 supports only.
Redis isn't supports SQL queries.
Our choiсe is plain PostgreSQL master with slave replication. May be cockroachDB applies also.
Thx for help))

How to build Vertica slave?

I have a Vertica instance running on our prod. Currently, we are taking regular backups of the database. I want to build a Master/Slave configuration for Vertica so that I always have the latest backup in case something goes bad. I tried to google but did not find much on this topic. Your help will be much appreciated.
There is no concept of a Master/Slave in Vertica. It seems that you are after a DR solution which would give you a standby instance if your primary goes down.
The standard practice with Vertica is to use a dual load solution which streams data into your primary and DR instances. The option you're currently using would require an identical standby system and take time to restore from your backup. Your other option is to do storage replication which is more expensive.
Take a look at the best practices for disaster recovery in the documentation.

hadoop logging facility?

If I am to use zookeeper as a work queue and connect to it individual consumers/workers. What would you recommend as a good distributed setup for logging these workers' activities?
Assume the following:
1) At anytime we could be down to 1 single computer housing the hadoop cluster. The system will autoscale up and down as needed but has alot of down time where only 1 single computer is needed.
2) I just need the ability to access all of the workers logs without accessing the individual machine that worker is located at. Bare in mind, that by the time I get to read one of these logs that machine might very well be terminated and long gone.
3) We'll need easy access to the logs i.e being able to cat/grep and tail or alternatively in a more SQLish manner - we'll need real time ability to both query as well as monitor output for short periods of time in real time. (i.e tail -f /var/log/mylog.1)
I appreciate your expert ideas here!
Thanks.
Have you looked at using Flume, chukwa or scribe - ensure that your flume etc process has access to the log files that you are trying to aggregate onto a centralized server.
flume reference:
http://archive.cloudera.com/cdh/3/flume/Cookbook/
chukwa:
http://incubator.apache.org/chukwa/docs/r0.4.0/admin.html
scribe:
https://github.com/facebook/scribe/wiki/_pages
hope it helps.
Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd. Fluentd's Java library is clever enough to buffer locally when Fluentd daemon is down. This lessens the possibility of the data loss.
Fluentd: Data Import from Java Applications
High availability configuration is also available, which basically enables you to have centralized log aggregation system.
Fluentd: High Availability Configuration

Resources