Aeropsike 3.7.4 does not see changes in udf file - user-defined-functions

We have an aerospike cluster with 3 nodes (v. 3.7.4)
We encountered a problem with udf. We registered new version of udf file, but when we query we get the same results as before.
From the aql when we do show modules we see that the hash of the file has changed, for testing purposes I've added a new function to the same file, then in aql when I use aggregate with that new function it says that the function does not found. We removed the module, dropped the cache in each node, then registered module again, and nothing seems to help to solve the problem. Here is the screen:
asinfo -v build
3.7.4
aql> aggregate invoices_udf.filter_by_merchant('FOOD-CULTURA-ESENTAI') on dareco.invoices
2017-01-06 15:13:27 ERROR Lua Runtime Error: function not found
Error: (100) UDF: Execution Error 2 : function not found
aql> show modules
+--------------------------------------------+---------------------+-------+
| hash | module | type |
+--------------------------------------------+---------------------+-------+
| "18092674658a4edb345acd1e44941755ab962db1" | "statistics.lua" | "lua" |
| "95357d821687372af7d9d8d3f8cc591df5ccfee3" | "invoices_udf.lua" | "lua" |
We also tried to reload the cluster. Turned off the udf cache. Nothing helped.

For aggregations, your udf has to be loaded both on server nodes and client node. AQL will load it on the server node, so it works well for Record UDFs. For Aggregations, to load on client node, you should use the API call in the client to register the udf and also make sure when you connect to the cluster, in your connection configuration, you identify the lua user_path for the client node, ensure client userid has write access to that path, and system path for Aerospike lua modules to be loaded on the client node is also correctly identified in the connection config especially if using node.js or python clients. For details, see discussion at: https://discuss.aerospike.com/t/aerospikeerror-udf-execution-error-1/3801/13
(Posting this because restarting the cluster or rebuilding SIs is probably not related to your original aggregation issue.)

We have solved the problem by backing up the data, then restarting the cluster, importing data from the backup, recreating all the indices we had and deleting old unnecessary data.

Related

Couchbase query error 5000 (open C:\Couchbase\Server\var\lib\couchbase\tmp\scan-results5960831968761: The system cannot find the path specified

We have a 3 node Couchbase server 6.0.2 EE cluster on Windows 2016. We fire n1ql queries against a bucket using golang SDK. Every third query execution generates this error:
[5000] open D:\\Couchbase\\Server\\var\\lib\\couchbase\\tmp\\scan-results5960831968761: The system cannot find the path specified.
We tried to restart/kill the query job on all three nodes, that didn't solve the issue! Couchbase was still insisting to find the scan-results[\d]+
We didn't find anything in couchbase public forums
any ideas?
we solved the problem by creating an empty tmp folder in var\lib\couchbase on all nodes. after that the query error above didn't occur.

nifi Could not find Process Group with ID

After installing nifi, I am trying to create a flow to test hdfs-nifi connectivity. But I am getting following error continuously for every click on the dashboard.
I am the root. so I have complete access to the components.
Did you copy a flow.xml.gz file from an existing instance of NiFi, or have controller services, reporting tasks, or other components which reference a group that no longer exists?
Try searching for the UUID using the search bar in the top right, or shut down NiFi, and use the following terminal commands to look for any references to this process group ID (double check or copy/paste the UUID because I typed it from looking at your screenshot):
cd $NIFI_HOME
gunzip -k conf/flow.xml.gz
grep '1861df7a-0168-1000-3931-028d9eb92cbd' conf/flow.xml
You should remove the referenced process group (you can back up the flow.xml.gz first if you are concerned about data loss (this would be the flow definition, not any flowfiles, content, or provenance data).

How do you use s3a with spark 2.1.0 on aws us-east-2?

Background
I have been working on getting a flexible setup for myself to use spark on aws with docker swarm mode. The docker image I have been using is configured to use the latest spark, which at the time is 2.1.0 with Hadoop 2.7.3, and is available at jupyter/pyspark-notebook.
This is working, and I have been just going through to test out the various connectivity paths that I plan to use. The issue I came across is with the uncertainty around the correct way to interact with s3. I have followed the trail on how to provide the dependencies for spark to connect to data on aws s3 using the s3a protocol, vs s3n protocol.
I finally came across the hadoop aws guide and thought I was following how to provide the configuration. However, I was still receiving the 400 Bad Request error, as seen in this question that describes how to fix it by defining the endpoint, which I had already done.
I ended up being too far off the standard configuration by being on us-east-2, making me uncertain if I had a problem with the jar files. To eliminate the region issue, I set things back up on the regular us-east-1 region, and I was able to finally connect with s3a. So I have narrowed down the problem to the region, but thought I was doing everything required to operate on the other region.
Question
What is the correct way to use the configuration variables for hadoop in spark to use us-east-2?
Note: This example uses local execution mode to simplify things.
import os
import pyspark
I can see in the console for the notebook these download after creating the context, and adding these took me from being completely broken, to getting the Bad Request error.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
conf = pyspark.SparkConf('local[1]')
sc = pyspark.SparkContext(conf=conf)
sql = pyspark.SQLContext(sc)
For the aws config, I tried both the below method and by just using the above conf, and doing conf.set(spark.hadoop.fs.<config_string>, <config_value>) pattern equivalent to what I do below, except doing it this was I set the values on conf before creating the spark context.
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
One thing to note, is that I also tried an alternative endpoint for us-east-2 of s3-us-east-2.amazonaws.com.
I then read some parquet data off of s3.
df = sql.read.parquet('s3a://bucket-name/parquet-data-name')
df.limit(10).toPandas()
Again, after moving the EC2 instance to us-east-1, and commenting out the endpoint config, the above works for me. To me, it seems like endpoint config isn't being used for some reason.
us-east-2 is a V4 auth S3 instance so, as you attemped, the fs.s3a.endpoint value must be set.
if it's not being picked up then assume the config you are setting isn't the one being used to access the bucket. Know that Hadoop caches filesystem instances by URI, even when the config changes. The first attempt to access a filesystem fixes, the config, even when its lacking in auth details.
Some tactics
set the value is spark-defaults
using the config you've just created, try to explicitly load the filesystem via a call to Filesystem.get(new URI("s3a://bucket-name/parquet-data-name", myConf) will return the bucket with that config (unless it's already there). I don't know how to make that call in .py though.
set the property "fs.s3a.impl.disable.cache" to true to bypass the cache before the get command
Adding more more diagnostics on BadAuth errors, along with a wiki page, is a feature listed for S3A phase III. If you were to add it, along with a test, I can review it and get it in

This Distributed Cache host may cause cache reliability problems after Sharepoint servers removed

I recently removed 2 sharepoint servers from a 4 server farm and I get the following errors:
This Distributed Cache host may cause cache reliability;
More Cache hosts are running in this deployment than are registered with SharePoint.
Both errors are referring to the two removed servers.
The cache cluster shows only the 2 remaining servers as cache hosts.
I re-provisioned Distributed Cache, but I still get the error.
Also tried everything listed here.
Any thoughts?
Taken from here: http://alstechtips.blogspot.com/2015/02/sharepoint-2013-this-distributed-cache.html
First, get the ID of the Distributed Cache Service Instance (with PowerShell).
Make sure you edit the following command to add your WFE server name.
Get-SPServiceInstance | Where-Object {$_.Server -Like "*<yourWFE>*"} | Select-Object TypeName, ID, Status | Sort-Object TypeName
Look for Distributed Cache in this listing, and copy its ID, then edit this command to include the ID:
(Get-SPServiceInstance <yourWFE-ID>).delete()
Then finally:
Remove-SPDistributedCacheServiceInstance
Reanalysing the alert should show it disappear (fixed).
It's then up to you to decide whether you want to deploy another Distributed Cache Service Instance.
I removed the errors from Sharepoint "Review problems and solutions" and ran the rules again. Errors did not show the second time.

Golang file and folder replication / mirroring across multiple servers

Consider this scenario. In a load-balanced environment, I have 3 separate instances of a CMS running on 3 different physical servers. These 3 separate running instances of the application is sharing the same database.
On each server, the CMS has a /media folder where all media subfolders and files reside. My question is how I'd implement/code a file replication service/functionality in Golang, so when a subfolder or file is added/changed/deleted on one of the servers, it'll get copied/replicated/deleted on all other servers?
What packages would I need to look in to, or perhaps you have a small code snippet to help me get started? That would be awesome.
Edit:
This question has been marked as "duplicate", but it is not. It is however an alternative to setting up a shared network file system. I'm thinking that keeping a copy of the same file on all servers, synchronizing and keeping them updated might be better than sharing them.
You probably shouldn't do this. Use a distributed file system, object storage (ala S3 or GCS) or a syncing program like btsync or syncthing.
If you still want to do this yourself, it will be challenging. You are basically building a distributed database and they are difficult to get right.
At first blush you could checkout something like etcd or raft, but unfortunately etcd doesn't work well with large files.
You could, on upload, also copy the file to every other server using ssh. But then what happens when a server goes down? Or what happens when two people update the same file at the same time?
Maybe you could design it such that every file gets a unique id (perhaps based on the hash of its contents so you can safely dedupe) and those files can never be updated or deleted, only added. That would solve the simultaneous update problem, but you'd still have the downtime problem.
One approach would be for each server to maintain an append-only version log when a file is added:
VERSION | FILE HASH
1 | abcd123
2 | efgh456
3 | ijkl789
With that you can pull every file from a server and a single number would be sufficient to know when a file is added. (For example if you think Server A is on version 5, and you get informed it is now on version 7, you know you need to sync 2 files)
You could do this with a database table:
ID | LOCAL_SERVER_ID | REMOTE_SERVER_ID | VERSION | FILE HASH
Which you could periodically poll and do your syncing via ssh or http between machines. If a server was down you could just retry until it works.
Or if you didn't want to have a centralized database for this you could use a library like memberlist. The local meta data for each node could be its version.
Either way there will be some amount of delay between a file was uploaded to a single server, and when it's available on all of them. Handling that well is hard, which is why you probably shouldn't do this.

Resources