I'm a newbie in using Pentaho. In my local env, I use Pentaho's Spoon to develop ETL and on my server we use kettle as Pentaho Server. I often ran into OutOfMemory error both in my local env and on my server.
On kettle server, we have 4GB of memory. Our server run import daily, the data we import is about > 100k records in 10 seperate imports. We have figure out that, when a import is finished, the memory which that import has used is still being kept inside the memory without release.
I have a work-around that each time we try to import, we restart the Kettle server and restart Spoon in my local env. In the image below, is the analyze about the memory.
Related
We have a Databricks job that has suddenly started to consistently fail. Sometimes it runs for an hour, other times it fails after a few minutes.
The inner exception is
ERROR MicroBatchExecution: Query [id = xyz, runId = abc] terminated with error
shaded.databricks.org.apache.hadoop.fs.azure.AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: Could not verify copy source.
The job targets a notebook which consumes from event-hub with PySpark structured streaming, calculates some values based on the data, and streams data back to another event-hub topic.
The cluster is a pool with 2 workers and 1 driver running on standard Databricks 9.1 ML.
We've tried to restart job many times, also with clean input data and checkpoint location.
We struggle to determine what is causing this error.
We cannot see any 403 Forbidden errors in logs, which is sometimes mentioned on forums as a reason
.
Any assistance is greatly appreciated.
Issue resolved by moving checkpointing (used internally by Spark) location from standard storage to premium. I don't know why it suddenly started failing after months of running hardly without hiccup.
Premium storage might be a better place for checkpointing anyway since I/O is cheaper.
I have a Tableau Data Extract that refreshes on a schedule. Our Tableau Production Server is on-promise, and when I run this extract on the Tableau Production Server, it takes forever to finish. There is another server VM (let's call VM 'X') which is at the same location as the Tableau Production Server. When I run the extract from this machine, it finishes in 10 minutes. Our data lake is in Oracle Exadata.
What I have tried so far:
I ran the Trace Route which didn't help much.
I thought it might be a traffic issue as Tableau Production Server is usually pretty busy. So, I went to our Tableau Dev Server VM which at the same location, and ran the extract there. It takes same time as production server.
Any ideas on what else I can try before reaching out to our networking team?
So I'm on a win workstation running a python script for GIS processing for very large .tif files. There is a linux server that I want to use the processing power from. I've ssh'ed into the server (netmiko) and set up pathos multiprocessing to run on the node. Works great on small projects. When I scaled it up, it was crashing due to memory allocation on workstation.
I realized the the workstation was trying to load everything into memory.
I have mapped the working tif file directory in the in the ubuntu server.
how do I call and store the file paths relative to the server in python, bypassing the workstation file directories, and call the objects relative to the worker node?
Currently looking into celery with RabbitMQ
Well, I talked to my networking guys and they built me a cluster to directly code from, woot woot. I think gRPC could have handled this too.
We have a Hadoop cluster with a very old PostgreSQL database (9.2) storing cluster metadata.
Is it possible to replace it with a more up-to-date version? I am concerned about breaking the cluster, what should I consider?
#Luis Sisamon
My recommendation would be to dump the 9.2 database and import into the version of your new preference. Assuming the import works without any errors, you should be able to move from old database to the new database. If you have concerns I would test this out with a dev cluster first before trying on live/prod system.
I have the following use case:
We have several SQL databases in different locations and we need to load some data them to HDFS.
The problem is that we do not have access to the servers from our Hadoop cluster(due to security concerns), but we can push data to our cluster.
Is there ant tool like Apache Sqoop to do such bulk loading.
Dump data as files from your SQL databases in some delimited format for instance csv and then do a simple hadoop put command and put all the files to hdfs.
Thats it.
Let us assume I am working in a small company on 30 node cluster daily 100GB data processing. This data will comes from the different sources like RDBS such as Oracle, MySQL, IBMs Netteza, DB2 and etc. We need not to install SQOOP on all 30 nodes. The minimum number of nodes should be isntalled by SQOOP is=1. After installing on one machine now we will access those machines. Using SQOOP we will import that data.
As per the security is considered no import will be done untill and unless the administartor has to put the following two commands.
MYSQL>grant all privileges on mydb.table to ''#'IP Address of Sqoop Machine'
MYSQL>grant all privileges on mydb.table to '%'#'IP Address of Sqoop Machine'
these two commands should be fire by admin.
Then we can use our sqoop import commands and etc.