Can a text file be outputted to the local filesystem directly from Spark? - hadoop

With an RDD, I can output rdd.saveAsTextFile('directory') which saves the file in hdfs://directory. Can the text file be save directly to a directory on the local filesystem (i.e. directory)?

Of course you can... since the saveAsTextFile('directory') will save as many files as your partitioners, you first neeed to merge the files before you copy to local (unless you wish to copy each file into local). Therefore first call
FileUtil.copyMerge(sourceFileSystem, new Path(sourceFullPath), destFileSystem, new Path(destinationFullPath), true, sparkContext.hadoopConfiguration, null)
and afterwards use
FileSystem fs = FileSystem.get(yourConfiguration)
fs.copyToLocalFile(true, destinationFullPath, localFilePath)

Related

create a CSV file in ADLS from databricks

I am creating a CSV file in an ADLS folder.
For example: sample.txt is the file name
instead of a single file, I see sample.txt/..,part-000 files.
My question is is there a method to create sample.txt file instead of a directory in pyspark.
df.write() or df.save() both create folders and multiple files inside that directory.
Using Coalesce(1) I can combine multiple part-000 files into one file. but how to create a single csv file?
Unfortunately, Spark doesn’t support creating a data file without a folder
To workaround,
Firstly using coalesce or repartition, create a single part (partition) file.
df\
.coalesce(1)\
.write\
.format("csv")\
.mode("overwrite")\
.save("mydata")
The above example produces an mydata directory, a part-000* file, and hidden files
However, our data is contained in only one CSV file. The name of this file is not user-friendly. We can rename this file and extract it.
data_location = "/mydata.csv/"
files = dbutils.fs.ls(data_location)
csv_file = [x.path for x in files if x.path.endswith(".csv")][0]
dbutils.fs.mv(csv_file, data_location.rstrip('/') + ".csv")
dbutils.fs.rm(data_location, recurse = True)
set up account key and configure the storage account to access. then move file from databricks location to adls. To move file, we use dbutils.fs.mv
```python
storage_account_name = "Storage account name"
storage_account_access_key = "storage account acesss key"
spark.conf.set("fs.azure.account.key."+storage_account_name+".blob.core.windows.net",storage_account_access_key)
dbutils.fs.cp('/mydata.csv.csv','abfss://demo12#pratikstorage1.dfs.core.windows.net//mydata1.csv')
My Execution:
Output:

Shell Script - Iterate through each line in text file and rename HDFS file

I have a text file in HDFS which would have records like below. The number of lines in file may vary every time.
hdfs://myfile.txt
file_name_1
file_name_2
file_name_3
I have the below hdfs directory and file structure like below.
hdfs://myfolder/
hdfs://myfolder/file1.csv
hdfs://myfolder/file2.csv
hdfs://myfolder/file3.csv
Using shell script I am able to count the number of files in HDFS directory and number of lines available in my HDFS text file. Only if the count matches between the number of files in directory and number of records in my text file, I am going to proceed further with the process.
Now, i am trying to rename hdfs://myfolder/file1.csv to hdfs://myfolder/file_name_1.csv using the first record from my text file.
Second file should be renamed to hdfs://myfolder/file_name_2.csv and third file to hdfs://myfolder/file_name_3.csv
I have difficulty in looping through both the text file and also the files in HDFS directory.
Is there an optimal way to achieve this using shell script.
You cannot do this directly from HDFS, you'd need to stream the file contents, then issue individual move commands.
e.g.
#!/bin/sh
COUNTER = 0
for file in $(hdfs dfs -cat file.txt)
do
NAME = $(sed $file ...) # replace text, as needed. TODO: extract the extension
hdfs dfs -mv file "$NAME_${COUNTER}.csv" # 'csv' for example - make sure the extension isn't duplicated!!
COUNTER = $((COUNTER + 1)
done

Adding output files to an existing output directory in Mapreduce

I want to add output files of a map reduce program to the same directory every time I run the job by appending time stamp at the end of file name.
Currently i am able append the time stamp at the end of file output file, but I am unable to find out how to add files to the same output directory instead of overwriting it every time.
You can write output files in temporary folder and move them to target folder after the end of job. Example of a method that moves all files from one folder to another:
public static void moveFiles(Path from, Path to, Configuration conf) throws IOException {
FileSystem fs = from.getFileSystem(conf); // get file system
for (FileStatus status : fs.listStatus(from)) { // list all files in 'from' folder
Path file = status.getPath(); // get path to file in 'from' folder
Path dst = new Path(to, file.getName()); // create new file name
fs.rename(file, dst); // move file from 'from' folder to 'to' folder
}
}
The output can be controlled by using reduce method. I guess you can try out a logic in reducer.
Please note that number of reducers = number of output files.

How to read gz files in Spark using wholeTextFiles

I have a folder which contains many small .gz files (compressed csv text files). I need to read them in my Spark job, but the thing is I need to do some processing based on info which is in the file name. Therefore, I did not use:
JavaRDD<<String>String> input = sc.textFile(...)
since to my understanding I do not have access to the file name this way. Instead, I used:
JavaPairRDD<<String>String,String> files_and_content = sc.wholeTextFiles(...);
because this way I get a pair of file name and the content.
However, it seems that this way, the input reader fails to read the text from the gz file, but rather reads the binary Gibberish.
So, I would like to know if I can set it to somehow read the text, or alternatively access the file name using sc.textFile(...)
You cannot read gzipped files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gzipped files because they are not splittable (source proving it):
override def createRecordReader(
split: InputSplit,
context: TaskAttemptContext): RecordReader[String, String] = {
new CombineFileRecordReader[String, String](
split.asInstanceOf[CombineFileSplit],
context,
classOf[WholeTextFileRecordReader])
}
You may be able to use newAPIHadoopFile with wholefileinputformat (not built into hadoop but all over the internet) to get this to work correctly.
UPDATE 1: I don't think WholeFileInputFormat will work since it just gets the bytes of the file, meaning you may have to write your own class possibly extending WholeFileInputFormat to make sure it decompresses the bytes.
Another option would be to decompress the bytes yourself using GZipInputStream
UPDATE 2: If you have access to the directory name like in the OP's comment below you can get all the files like this.
Path path = new Path("");
FileSystem fileSystem = path.getFileSystem(new Configuration()); //just uses the default one
FileStatus [] fileStatuses = fileSystem.listStatus(path);
ArrayList<Path> paths = new ArrayList<>();
for (FileStatus fileStatus : fileStatuses) paths.add(fileStatus.getPath());
I faced the same issue while using spark to connect to S3.
My File was a gzip csv with no extension .
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile);
This approach returned currupted values
I solved it by using the the below code :
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile+".gz");
By adding .gz to the S3 URL , spark automatically picked the file and read it like gz file .(Seems a wrong approach but solved my problem .

Save twitter search result to JSON file

I am using twitter ruby gem to fetch twitter search result. The example code from Github extracts the information from search result.I am wondering how to save the search result, which is JSON i think, to a separate JSON file.
Here is part of the example code:
results = #search.perform("$aaa", 1000)
aFile = File.new("data.txt", "w")
results.map do |status|
myStr="#{status.from_user}: #{status.text} #{status.created_at}"
aFile.write(myStr)
aFile.write("\n")
end
Is there any way to save all the search result to a separate JSON file instead of writing strings to a file?
Thanks in advance.
If you want to save to a file all you need to do is open the file, write it it, then close it:
File.open("myFileName.txt", "a") do |mFile|
mFile.syswrite("Your content here")
mFile.close
end
When you use open you will create the file if it doesn't exist.
One thing to be aware of is that there are different ways to open file, of which will determine where the program writes to. The "a" indicates that it will append everything you write to the file, to the end of the current content.
Here is some of the options:
r Read-only mode. The file pointer is placed at the beginning of the file. This is the default mode.
r+ Read-write mode. The file pointer will be at the beginning of the file.
w Write-only mode. Overwrites the file if the file exists. If the file does not exist, creates a new file for writing.
w+ Read-write mode. Overwrites the existing file if the file exists. If the file does not exist, creates a new file for reading and writing.
a Write-only mode. The file pointer is at the end of the file if the file exists. That is, the file is in the append mode. If the file does not exist, it creates a new file for writing.
a+ Read and write mode. The file pointer is at the end of the file if the file exists. The file opens in the append mode. If the file does not exist, it creates a new file for reading and writing.
So in your case, you would want to pull out the data you want to save, then write it to a file as I have shown. You can also specify file paths by doing:
File.open("/the/path/to/yourfile/myFileName.txt", "a") do |mFile|
mFile.syswrite("Your content here")
mFile.close
end
Another thing to be aware of is that open does not create directories, so you will either need to create directories yourself, or you can do it with your program. Here is a link that is helpful for file input/output:
http://www.tutorialspoint.com/ruby/ruby_input_output.htm

Resources