I am using fs (Node Module) to manage files. I am getting the file's created time (BirthTime). It is working absolutely fine when I run this app on my local machine. But when I try to implement it on EFS using NodeJs Lambda function then it gives 1970-01-01T00:00:00.000Z which is not the actual time of the file that I created.
var efsDirectory = "/mnt/data/";
var filePath = path.join(efsDirectory, file);
console.log("This file is going to be executed :", file);
var response = fs.statSync(filePath);
let fileBirthTime = response.birthtime;
console.log("File path is : ", filePath);
After joining the path my filepath looks like this filepath = /mnt/data/172.807056.json which is the actual path of the file.
In the Cloudwatch logs I am getting this :
On the local machine, it is working fine and giving the actual file birthtime. Can you tell me guys why I am getting this?
I posted the same question on the AWS repost, and an engineer responded to me with the following answer. Pasting the same answer here, if someone is facing that problem too.
You are getting this result with birthtime, as it is not supported on most NFS filesystems like EFS. Even on Linux OSes it depends on the kernel and type of file system as to whether this field is supported. The default file system on Amazon Linux 2 on EBS doesn't return a value to birthtime. However with the latest Ubuntu image, it is supported. This is why you would be seeing a difference between running it locally and against EFS.
Related
I'm extremely new at k6 + influxdb + grafana, and I was given a task related to execute certain K6 Scripts locally but save/pass the data over a remote InfluxDB instance.
As of now I'm having issues given that I'm not sure what I'm missing regarding the needed configurations in order to do this since everytime I try to run the script pointing at the InfluxDB instance I'm just getting an error everytime I run it:
The command that I'm executing is:
k6 run --out influxdb="https://my_influxdb_url/write" //sampleScript.js
But the original URL that was handed over to me was something like this:
https://my_influxdb_url/write?db=DB_NAME&u=USERNAME&p=PASSWORD
And when I execute the first mentioned script I'm getting the following error:
ERRO[000X] Couldn't write stats error="404 page not found\n" output=InfluxDB1
So I've tried creating K6_INFLUXDB_USERNAME and K6_INFLUXDB_PASSWORD as environment variables but I'm still getting the same error.
I'm not sure if I might be missing some .yaml file like a datasource in which I should fill those 3 values? (DB_NAME, USERNAME, PASSWORD)
Or maybe I'm just doing it all wrong and not calling the execution command properly for this scenario.
Another weird thing that I noticed is that OUTPUT is throwing InfluxDB1 instead of my actual InfluxDB url which I guess might be where my issue lies.
Any kind of tip would be greatly appreciated since the actual documentation that I've found so far is always run either on a Docker container instance of Grafana+InfluxDB or simply running it locally which is not my case :(
Thanks a lot in advance as always!!
Background
I have been working on getting a flexible setup for myself to use spark on aws with docker swarm mode. The docker image I have been using is configured to use the latest spark, which at the time is 2.1.0 with Hadoop 2.7.3, and is available at jupyter/pyspark-notebook.
This is working, and I have been just going through to test out the various connectivity paths that I plan to use. The issue I came across is with the uncertainty around the correct way to interact with s3. I have followed the trail on how to provide the dependencies for spark to connect to data on aws s3 using the s3a protocol, vs s3n protocol.
I finally came across the hadoop aws guide and thought I was following how to provide the configuration. However, I was still receiving the 400 Bad Request error, as seen in this question that describes how to fix it by defining the endpoint, which I had already done.
I ended up being too far off the standard configuration by being on us-east-2, making me uncertain if I had a problem with the jar files. To eliminate the region issue, I set things back up on the regular us-east-1 region, and I was able to finally connect with s3a. So I have narrowed down the problem to the region, but thought I was doing everything required to operate on the other region.
Question
What is the correct way to use the configuration variables for hadoop in spark to use us-east-2?
Note: This example uses local execution mode to simplify things.
import os
import pyspark
I can see in the console for the notebook these download after creating the context, and adding these took me from being completely broken, to getting the Bad Request error.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
conf = pyspark.SparkConf('local[1]')
sc = pyspark.SparkContext(conf=conf)
sql = pyspark.SQLContext(sc)
For the aws config, I tried both the below method and by just using the above conf, and doing conf.set(spark.hadoop.fs.<config_string>, <config_value>) pattern equivalent to what I do below, except doing it this was I set the values on conf before creating the spark context.
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
One thing to note, is that I also tried an alternative endpoint for us-east-2 of s3-us-east-2.amazonaws.com.
I then read some parquet data off of s3.
df = sql.read.parquet('s3a://bucket-name/parquet-data-name')
df.limit(10).toPandas()
Again, after moving the EC2 instance to us-east-1, and commenting out the endpoint config, the above works for me. To me, it seems like endpoint config isn't being used for some reason.
us-east-2 is a V4 auth S3 instance so, as you attemped, the fs.s3a.endpoint value must be set.
if it's not being picked up then assume the config you are setting isn't the one being used to access the bucket. Know that Hadoop caches filesystem instances by URI, even when the config changes. The first attempt to access a filesystem fixes, the config, even when its lacking in auth details.
Some tactics
set the value is spark-defaults
using the config you've just created, try to explicitly load the filesystem via a call to Filesystem.get(new URI("s3a://bucket-name/parquet-data-name", myConf) will return the bucket with that config (unless it's already there). I don't know how to make that call in .py though.
set the property "fs.s3a.impl.disable.cache" to true to bypass the cache before the get command
Adding more more diagnostics on BadAuth errors, along with a wiki page, is a feature listed for S3A phase III. If you were to add it, along with a test, I can review it and get it in
So in a lot of cases, when you add a file to a directory where a file with the same name already exists, it'll append something to the end of it. For example, a unique, incrementing number.
So, let's say the Hadoop system I'm connecting to has a folder called "/input", and there's already a file there called "sample.txt". So the full path would be "/input/sample.txt". If I tried to create a new file with the path "/input/sample.txt", it would save it as "/input/sample1.txt", or something like that. And then there would be two files in that directory, sample.txt and sample1.txt.
I'm new to Hadoop, my company has me building an interface that will allow our application to connect to Hadoop systems. I've got some simple client code working nicely, but I don't see anything in the API about how to do this. It's a behavior that other components of our product have, and while it's not necessary, I would like to be able to provide it for consistency's sake.
Thanks in advance.
PS. The client code I'm working on is in Java and uses Apache's Hadoop Client library.
Use FileSystem exists API and change the file to your needs increment or whatever.
sample java code todo that
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://namenode:9000");
FileSystem fs = FileSystem.get(conf);
Path path = new Path("/input/sample.txt");
if(fs.exists(path)) ....
https://hadoop.apache.org/docs/r2.6.1/api/org/apache/hadoop/fs/FileSystem.html
After following this simple tutorial http://www.louisaslett.com/RStudio_AMI/ and video guide http://www.louisaslett.com/RStudio_AMI/video_guide.html I have setup an RStudio environment on EC2.
The only problem is, I can't upload large files (> 1GB).
I can upload small files just fine.
When I try to upload a file via RStudio, it gives me the following error:
Unexpected empty response from server
Does anyone know how I can upload these large files for use in RStudio? This is the whole reason I am using EC2 in the first place (to work with big data).
Ok so I had the same problem myself and it was incredibly frustrating, but eventually I realised what was going on here. The default home directory size for AWS is less than 8-10GB regardless of the size of your instance. As this as trying to upload to home then there was not enough room. An experienced linux user would not have fallen into this trap, but hopefully any other windows users new to this who come across this problem will see this. If you upload into a different drive on the instance then this can be solved. As the Louis Aslett Rstudio AMI is based in this 8-10GB space then you will have to set your working directory outside this, the home directory. Not intuitively apparent from Rstudio server interface. Whilst this is an advanced forum and this is a rookie error I am hoping no one deletes this question as I spent months on this and I think someone else will too. I hope this makes sense to you?
Don't you have shell access to your Amazon server? Don't rely on RStudio's upload (which may have a 2Gb limit, reasonably) and use proper unix dev tools:
rsync -avz myHugeFile.dat amazonusername#my.amazon.host.ip:
on your local PC command line (install cygwin or other unixy compatibility system) will transfer your huge file to your amazon server, and if interrupted will resume from that point, will compress the data for transfer too.
For a windows gui on something like this, WinSCP was what we used to do in the bad old days before Linux.
This could have something to do with your web server. Are you using nginx or apache as your web server. If so you can modify the upload feature in your nginx server. If you are running nginx on the front end of the web server I would recommend the following fix in your nginx.conf file.
http {
...
client_max_body_size 100M;
}
https://www.tecmint.com/limit-file-upload-size-in-nginx/
I had a similar problems with a 5GB file. What worked for me was to use SQLite to create a database with the csv file that I needed. Use SQLite code to bring create the database. Then I used a function in RStudio to communicate with the local database. In that way, I was able to bring in the csv file. I can track down the R code that I used if you like.
I am trying to stay away from another batch file, is possible. I have an SSIS package that pulls data from SQL, creates a text file and then sends those file(s) to an external ftp site. The package works if I run it manually either from my workstation, or from the server. However, when I run it as a step in a SQL job, the step fails - the package fails to connect.
The data is pulled using a SQL task, a For Each Loop Container is used to create the text files, then a Script Task is used for the ftp portion.
For troubleshooting purposes, within the script task I put text in after each action wtihin the code, with information to help show where the error is happening. I acquire the user running the job, the value of certain variables, and the connection string, then write that into the log table. That is how I know the package fails when it tries to connect - I receive a false boolean value from ftp.Connect().
The package has variables for the ftp connection information and I use configuration files for those values. However, I have hardcoded the values into the code for testing and it produces the same results.
I have also tried adding a new connection to the connection manager vs. using an existing one - no difference.
I have ruled out any networking/blocking/firewall issues. The server allows the ftp connection.
I have also tried running the package manually as myself, and as the domain user that runs the SSIS package from within the job (sqlagent domain user). From both my workstation and the server, running it manually from either account works. To run it manually from my workstation, I use Visual Studio debugging. To run it manually from the server, I use "Run Package" while connected to the Integration Services instance on the server.
In the code snippet below, I copied and pasted parts of the script task - it should include all the ftp code. I didnt paste the part of getting the file name and sending the files. The error always happens at the connect portion.
What I would like is either 1) help in making it work, OR 2) help in determining how to get more of a response from the ftp.Connect - the boolean only isnt telling me much, OR 3) someone who is an expert telling me I just should use a batch file and stop beating my head against the wall :)
I promise I have done a lot of research, and I have found many posts about how to do the ftp - but not that solve this issue. Again, the code works, so I know how to do the ftp portion itself. If there is a duplicate post that I didnt see, I apologize and will read from there instead of making someone post an answer here. :)
Supplemental Data: Server - Windows 2008 Enterprise SP2; Workstation - Windows XP SP3; SQL/SSIS: SQL Server 2008 R2
SQL job step - Type: SSIS Package; Run as: SQL Server Agent Service Account (domain account called sqlagent); Authentication: Windows Authentication
Script Task Code:
string errMsg = null;
errMsg = "user: " + System.Security.Principal.WindowsIdentity.GetCurrent().Name + "; ";
ConnectionManager cm = Dts.Connections["FTP_ABN"];
cm.ConnectionString = Dts.Variables["FTP_url"].Value.ToString() + ":21." + Dts.Variables["FTP_username"].Value.ToString();
cm.Properties["ServerName"].SetValue(cm, Dts.Variables["FTP_url"].Value.ToString());
cm.Properties["ServerPort"].SetValue(cm,"21");
cm.Properties["ServerUserName"].SetValue(cm, Dts.Variables["FTP_username"].Value.ToString());
cm.Properties["ServerPassword"].SetValue(cm, Dts.Variables["FTP_password"].Value.ToString());
FtpClientConnection ftp = new FtpClientConnection(cm.AcquireConnection(null));
ftpConnected = ftp.Connect();
if (ftpConnected) {
errMsg += " - I connected: True; " + cm.ConnectionString.ToString() + "; ";
ftp.Close();
if (errMsg != null) { Dts.Log("YES - I am ok: " + errMsg, 0, new byte[0]); }
Dts.TaskResult = (int)ScriptResults.Success;
}
else {
errMsg += " - I connected: False; " + cm.ConnectionString.ToString() + "; ";
Dts.Log("Error: " + errMsg, 0, new byte[0]);
Dts.TaskResult = (int)ScriptResults.Failure;
}
I had this exact problem yesterday. Script failed when run as a scheduled task but it ran successfully when I ran it and the script is using secure ftp. The problem for me was that when I was logged in using my local login I had manually accepted the secure ftp encryption certificate. So when I ran it from my login it ran successfully.
I fixed it by logging in as the user that owns/runs the scheduled tasks and then I invoked the ftp pgrogram(core ftp) and accepted the secured ftp certificate.
I hope that will fix it for you too!
Out of interest have you tried using an FTP Task?
I had a few issues with getting FTP working and I found using one of those tasks helped me.
If you have tried using the FTP Task, did you get any error messages?
I appreciate everyone who responded! I wanted to mark this as answered in case anyone else has the issue too. Making the step within the sql job run as 32-bit mode solved my issue. I am going to research more and find out why and blog about it - hopefully someone else wont have to go through all the moaning and groaning that was going on in my head :)
I am really going to have to start answering some questions here - since you were all so helpful and responsive. Thank you again! :)