Using ACCCEPTINVCHARS with a remote host - amazon-ec2

I am using a scraper and uploading my data into redshift using EC2. I would prefer not to upload the data into S3 first. My code is in Jupyter Notebook. However, I get the "String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence: 80 (error 3)" error that I see a lot of other people have asked about previously. I even found a page on redshift that walks through using a Remote Desktop. However, as I said before I would prefer not going through S3. Is this possible?
Currently using psycopg2 to connect to the database. I figured it wouldn't work but I tried just putting in acceptinvchars after the database user/password line, and it said that ACCEPTINVCHARS isn't defined.

If you want to copy data to Redshift right from your notebook you have to compose valid INSERT statements and execute them against the existing table in Redshift. However, throughput of this approach is quite low. I don't know how much data you plan to write but I guess scrappers should have higher throughput than that. You can first write the output of your Python script to the same EC2 instance and use COPY command.
More info on copying from an EC2 instance here: COPY from Remote Host (SSH)
As for your error, you likely have accented letters in your input and you need to use LATIN1 encoding everywhere

Related

"There are no OCR keys; creating a new key encrypted with given password" Crashes when running Chainlink node

I am setting up a chainlink node in AWS ec2 + AWS RDS (PostgreSQL) and have followed every step in the documentation (https://docs.chain.link/docs/running-a-chainlink-node/).
Everything runs smoothly until the OCR keys creation step. Once it gets here, it shows "There are no OCR keys; creating a new key encrypted with given password". This is supposed to happen but the docker container exits right after (see image below).
Output after OCR keys creation
I have tried the following:
Checking whether there is a problem with the specific table these keys are stored in the PostgreSQL database: public.encrypted_ocr_key_bundles, which gets populated if this step succeeds. Nothing here so far.
Using a different version of the Chainlink docker image (see Chainlink Docker hub). I am currently using version 0.10.0. No success either, even if using latest ones.
Using AWS Cloudformation to "let AWS + Chainlink" take care of this, but even so I have encountered similar problems, so no success.
I have thought about populating the OCR table manually with a query, but I am far from having proper OCR key generation knowledge/script in hand so I do not like this option.
Does anybody know what else to try/where the problem could be?
Thanks a lot in advance!
UPDATE: It was a simple memory problem. The AWS micro instance (1GB RAM) was running out of memory when OCR keys were generated. I only got a log of the error after switching to an updated version of the CL docker image. In conclusion: migrate to a bigger instance. Should've thought of that but learning never stops!

Trouble Uploading Large Files to RStudio using Louis Aslett's AMI on EC2

After following this simple tutorial http://www.louisaslett.com/RStudio_AMI/ and video guide http://www.louisaslett.com/RStudio_AMI/video_guide.html I have setup an RStudio environment on EC2.
The only problem is, I can't upload large files (> 1GB).
I can upload small files just fine.
When I try to upload a file via RStudio, it gives me the following error:
Unexpected empty response from server
Does anyone know how I can upload these large files for use in RStudio? This is the whole reason I am using EC2 in the first place (to work with big data).
Ok so I had the same problem myself and it was incredibly frustrating, but eventually I realised what was going on here. The default home directory size for AWS is less than 8-10GB regardless of the size of your instance. As this as trying to upload to home then there was not enough room. An experienced linux user would not have fallen into this trap, but hopefully any other windows users new to this who come across this problem will see this. If you upload into a different drive on the instance then this can be solved. As the Louis Aslett Rstudio AMI is based in this 8-10GB space then you will have to set your working directory outside this, the home directory. Not intuitively apparent from Rstudio server interface. Whilst this is an advanced forum and this is a rookie error I am hoping no one deletes this question as I spent months on this and I think someone else will too. I hope this makes sense to you?
Don't you have shell access to your Amazon server? Don't rely on RStudio's upload (which may have a 2Gb limit, reasonably) and use proper unix dev tools:
rsync -avz myHugeFile.dat amazonusername#my.amazon.host.ip:
on your local PC command line (install cygwin or other unixy compatibility system) will transfer your huge file to your amazon server, and if interrupted will resume from that point, will compress the data for transfer too.
For a windows gui on something like this, WinSCP was what we used to do in the bad old days before Linux.
This could have something to do with your web server. Are you using nginx or apache as your web server. If so you can modify the upload feature in your nginx server. If you are running nginx on the front end of the web server I would recommend the following fix in your nginx.conf file.
http {
...
client_max_body_size 100M;
}
https://www.tecmint.com/limit-file-upload-size-in-nginx/
I had a similar problems with a 5GB file. What worked for me was to use SQLite to create a database with the csv file that I needed. Use SQLite code to bring create the database. Then I used a function in RStudio to communicate with the local database. In that way, I was able to bring in the csv file. I can track down the R code that I used if you like.

Load data onto an ec2 instance with no associated key-pair (generated by NotebookCloud)

I'm trying to run iPython notebook in an Amazon ec2 instance (I'm using the free tier, if that makes any difference), using NotebookCloud (https://notebookcloud.appspot.com/) to handle the iPython notebook interface. However, the code I want to run in the notebook needs access to a variety of datafiles and supplemental python files. When NotebookCloud generates a new ec2 instance, it doesn't assign a key-pair to it, and I can't find a way to make it do so. As far as I can tell from other questions, there's no way to SSH into an instance if it doesn't have an associated key-pair. Is there still some sneaky way to get data onto the instance though?
Okay, I figured it out. I put the data on an EBS volume and attached it to the instance. Since iPython let's you send commands directly to the operating system by prefacing them with "!", it was then possible to mount the volume on the instance as specified here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-using-volumes.html.
Earlier I also tried just enabling x-forwarding and installing a browser on the instance and running ipython notebook in that browser. This, however, proved to be painfully slow, although that might have just been because I was using a micro instance, which I have since decided is not remotely large enough for my task.
I recommend using ssh and port forwarding instead of putting the IPython Notebook on the internet or x-forwarding (e.g. ssh -i prvate_key user#ip_address -L 8889:localhost:8888)
Point your browser to http://localhost:8889 for your remote IPython Notebook

Can you connect to a MS Access database from Ruby running on a Mac?

I'm pretty sure the answer is "no" but I thought I'd check.
Background:
I have some legacy data in Access, need to get it into MySQL, which will be the DB server for a Ruby application that uses this legacy data.
Data has to be processed and transformed. Access and MySQL schemas are totally different. I want to write a rake task in Ruby to do the migration.
I'm planning to use the techniques outlined in this blog post: Using Ruby and ADO to Work with Access Databases. But I could use a different technique if it solves the problem.
I'm comfortable working on Unix-like computers, such as Macs. I avoid working in Windows because it fills me with deep existential horror.
Is there a practical way that I can write and run my rake task on my Mac and have it reach across the network to the grunting Mordor that is my Windows box and delicately pluck the data out like a team of commandos rescuing a group of hostages? Or do I have to just write this and run it on Windows?
Why don't you export it from MS-Access into Excel or CSV files and then import it into a separate MySQL database? Then you can rake the new one to your heart's content.
Mac ODBC drivers that open Access databases are available for about $30.00
http://www.actualtechnologies.com/product_access.php is one. I just run access inside vmware on my mac and expore to csv/excel as CodeSlave mentioned.
ODBC might be handy in case you want to use the access database to do a more direct transfer.
Hope that helps.
I had a similar issue where I wanted to use ruby with sql server. The best solution I found was using jruby with the java jdbc drivers. I'm guessing this will work with access as well, but I don't know anything about access

How do I verify the integrity of a Sybase dump file, without trying to load it?

Here's the scenario - a client uploads a Sybase dump file to (gzipped) to our local FTP server. We have an automated process which picks these up and then moves them to different server within the network where the database server resides. Unfortunately, this transfer is over a WAN, which for large files takes a long time, and sometimes our clients forget to FTP in binary mode, which results in 10GB of transfer over our WAN all for nothing as the dump file can't be loaded at the other end. What I'd like to do, is verify the integrity of the dump file on the local server before sending it out over the WAN, but I can't just try and "load" the dump file, as we don't have Sybase installed (and can't install it). Are there any tools or bits of code that I can use to do this?
There are a few things you can do from the command line. The first, on the sending side, is to generate md5sum's of the files.
$ md5sum *.dmp
2bddf3cd8b04010183dd3295ce7594ff pubs_1.dmp
7510e0250c8d68bae3e0e794c211e60b pubs_2.dmp
091fe54fa5fd81d8c109cc7835d37f4a pubs_3.dmp
On the client side, they can run the same. Secondly, usually Sybase dumps are done with the compress option. If this option is used, you can also test the file integrity by uncompressing the files via the command line. This isn't as complete, but it will verify the 8 byte CRC-32 checksum which is part of the compress algorithm.
$ gunzip --test *.dmp
gunzip: pubs_3.dmp: unexpected end of file
Neither of these methods validate that Sybase will be able to load the file, but it does help ensure the file isn't corrupt.
There is no way to really verify the integrity of the dump file without loading it in some way by a backup server. The client should know whether the dump is successful or not via the backup log or output during the dump.
But to solve your problem you should use to SFTP or SCP, all transfers are done in binary, alleviating your problem.
Ensure that they are also using compression in the dump a value of 1-3 is more than enough, this should reduce your network traffic also.

Resources