Hadoop by default have a support for reading .gz compressed files, I want to have similar support for .zip files. I should be able to read content of zip files by using hadoop -text command.
I am looking for an approach where I dont have to implement inputformat and recordreader for zip files. I want my jobs to be completely agnostic of the format of the input files, it should work irrespective of whether the data is zipped or unzipped. Similar to how it is for.gz files.
I'm sorry to say that I only see two ways to do this from "within" hadoop, either using a custom inputformat and recordreader based on ZipInputStream (which you clearly specified you were not interested in) or by detecting .zip input files and unzipping them before launching the job.
I would personally do this from outside hadoop, converting to gzip (or LZO indexed if I needed splittable files) via a script before running the job, but you most certainly already thought about that...
I'm also interested to see if someone can come up with an unexpected answer.
Related
just wanna ask if I can create a parquet without hadoop/apache? I'm using talend by the way.
Check these links:
Writing a Paraquet file in Python:
https://mungingdata.com/python/writing-parquet-pandas-pyspark-koalas/
Writing a Paraquet file in Java, this was answered earlier and I think you might have seen this.
create parquet files in java
I am doing project about MapReduce task failures. According to Hadoop Beginner's Gudie(Garry Tukington), all of the skip data is stored in _logs/skip/ folder. The author used Hadoop 1.0 version. I am working with Hadoop 2.7.4. Although I tested with skip data, neither output folder nor _logs/skip/ are created. Is _logs/skip folder related to Hadoop version? If I want to skip data in hadoop 2.7.4, what should I do?
The short answer is no, it is not related to hadoop at all.
There are many temporary folders create at the time of execution, which are removed after the execution is completed. This includes log folders, temporary output folders and other temporary folders.
You should not get confused by them. The only guarantee is that it will generate an output folder with a _SUCCESS file even though there is no output.
I hope it answers your query.
I would like to know if it is possible under MacOS to store "information" about some files, but not the files themselves? In essence, what I would like is to off-load certain large files from many of my projects to external drives/machines, but still retain the ability to get information about those off-loaded files when I look into the project directory for listing and information about its contents. I understand that what I am asking is not supported by the FS by default, but is there any third-party framework under which this is possible?
Should the file name contain a number for the tetFileStream to pickup? my program is picking up new files only if the file name contains a number. Ignoring all other files even if they are new. Is there any setting I need to change for picking up all the files? Please help
No. it scans the directory for new files which appear within the window. If you are writing to S3, do a direct write with your code, as the file doesn't appear until the final close() —no need to rename. In constrast, if you are working with file streaming sources against normal filesystems, you should create out of the scanned dir and rename in at the end —otherwise work-in-progress files may get read. And once read: never re-read.
After spending hours on analyzing stack trace, I figured out that the problem is S3 address. I was providing "s3://mybucket", which was working for Spark 1.6 and Scala 2.10.5. On Spark 2.0 (and Scala 2.11), it must be provided as "s3://mybucket/". May be some Regex related stuff. Working fine now. Thanks for all the help.
I'm new to spark and using it a lot recently to do some batch processing.
Currently I have a new requirement and am stuck on how to approach it.
I have a file which has to be processed but this file can get periodically updated. I want the initial file to be processed and as and when there is an update to the file, I want spark operations to be triggered and should operate only on the updated parts this time. Any way to approach this would be helpful. An
I'm open to using any other technology in combination with spark. The files will generally sit on a file system and could be several GBs in size.
Spark alone cannot recognize if a file has been updated.
It does its job when reading for a first time the file and that's all.
By default, Spark won't know that a file has been updated and won't know which parts of the file are updates.
You should rather work with folders, Spark can run on a folder and can recognize if there is a new file to process in it -> sc.textFile(PATH_FOLDER)...