Writing parquet files without using Hadoop and Apache - etl

just wanna ask if I can create a parquet without hadoop/apache? I'm using talend by the way.

Check these links:
Writing a Paraquet file in Python:
https://mungingdata.com/python/writing-parquet-pandas-pyspark-koalas/
Writing a Paraquet file in Java, this was answered earlier and I think you might have seen this.
create parquet files in java

Related

Spark textFileStream [duplicate]

Should the file name contain a number for the tetFileStream to pickup? my program is picking up new files only if the file name contains a number. Ignoring all other files even if they are new. Is there any setting I need to change for picking up all the files? Please help
No. it scans the directory for new files which appear within the window. If you are writing to S3, do a direct write with your code, as the file doesn't appear until the final close() —no need to rename. In constrast, if you are working with file streaming sources against normal filesystems, you should create out of the scanned dir and rename in at the end —otherwise work-in-progress files may get read. And once read: never re-read.
After spending hours on analyzing stack trace, I figured out that the problem is S3 address. I was providing "s3://mybucket", which was working for Spark 1.6 and Scala 2.10.5. On Spark 2.0 (and Scala 2.11), it must be provided as "s3://mybucket/". May be some Regex related stuff. Working fine now. Thanks for all the help.

Ruby - Write Data to Existing .xlsx Document

I have an existing xlsx file that I am trying to write data to programatically. Is there any modern solution to this for ruby? Also looked into the Google Sheets API but it's only Java and .Net.
I've searched quite a bit, and so far have checked out the following gems with no luck:
https://github.com/roo-rb/roo
https://github.com/randym/axlsx
https://github.com/weshatheleopard/rubyXL
https://github.com/cxn03651/write_xlsx
In the meantime it seems my best solution is to write to CSV, then import the CSV into the xlsx file. But it's tough to do that programmatically in the future.
Any help would be appreciated, thanks.

Adding support for Zip files in hadoop

Hadoop by default have a support for reading .gz compressed files, I want to have similar support for .zip files. I should be able to read content of zip files by using hadoop -text command.
I am looking for an approach where I dont have to implement inputformat and recordreader for zip files. I want my jobs to be completely agnostic of the format of the input files, it should work irrespective of whether the data is zipped or unzipped. Similar to how it is for.gz files.
I'm sorry to say that I only see two ways to do this from "within" hadoop, either using a custom inputformat and recordreader based on ZipInputStream (which you clearly specified you were not interested in) or by detecting .zip input files and unzipping them before launching the job.
I would personally do this from outside hadoop, converting to gzip (or LZO indexed if I needed splittable files) via a script before running the job, but you most certainly already thought about that...
I'm also interested to see if someone can come up with an unexpected answer.

Apache spark - dealing with auto-updating inputs

I'm new to spark and using it a lot recently to do some batch processing.
Currently I have a new requirement and am stuck on how to approach it.
I have a file which has to be processed but this file can get periodically updated. I want the initial file to be processed and as and when there is an update to the file, I want spark operations to be triggered and should operate only on the updated parts this time. Any way to approach this would be helpful. An
I'm open to using any other technology in combination with spark. The files will generally sit on a file system and could be several GBs in size.
Spark alone cannot recognize if a file has been updated.
It does its job when reading for a first time the file and that's all.
By default, Spark won't know that a file has been updated and won't know which parts of the file are updates.
You should rather work with folders, Spark can run on a folder and can recognize if there is a new file to process in it -> sc.textFile(PATH_FOLDER)...

How to read from and write to the same EXCEL file using Ruby?

I want to read from and write to the same EXCEL file (file.xls) using Ruby. I tried to use Roo gem which doesn't allow to write to the file. Now I am using Spreadsheet gem, but I can't update existing data in the same excel file.
Is it possible to read from and write to the same EXCEL file without change in macros using Ruby?
Here's a similar post that received an answer: I want to edit a wellformatted excel file with ruby
Basically, you may want to check out win32ole: http://davidsulc.com/blog/2011/03/27/using-ruby-and-win32ole-to-manipulate-excel/
I was unable to make Roo read and write from the same file. This could be due to the fact that Roo creates a temporary file before it writes. Writing to a new file is cleaner, easier and supported. I have the data that I want to read from in my project under db/data/import, then I write to db/data/export and everything got way easier once I went with this approach.

Resources