I am using the below Python code to transfer large files between a server and a client using zeromq.
Implementation to send file, server
CHUNK_SIZE = 250000
message = pair.recv() # message is the path to the file
filename = open(message, 'rb')
filesize = os.path.getsize(message)
offsets = (int(ceil(filesize / CHUNK_SIZE)), 0)[filesize <= CHUNK_SIZE]
for offset in range(offsets + 1):
filename.seek(offset)
chunksize = CHUNK_SIZE
if offset == offsets:
chunksize = filesize - (CHUNK_SIZE * (offset - 1)) # calculate the size of the last chunk
data = filename.read(chunksize)
pair.send(data)
pair.send(b'')
Implementation to receive file, client
while True:
data = pairs.recv()
if data is not '':
target.write(data)
else:
break
However, after transfer a large file using this implementation, for some reason an extra data is being added at end of the file:
File server side
$ stat file.zip
File: `file.zip'
Size: 1503656416 Blocks: 2936840 IO Block: 4096 regular file
Client side
$ stat file.zip
File: `file.zip'
Size: 1503906416 Blocks: 2937328 IO Block: 4096 regular file
The size and blocks are different between them.
That said, do you have any suggestions to calculate/send the end of file properly?
Thanks
Just found the solution. The seek() was not processing the chunks properly.
-filename.seek(offset)
+filename.seek(0, 1)
Thus, it will always get the offset 0 on current (last) position.
Now everything is working as expected :)
Related
I'm trying to test the size of an upload to validate its size. Outside of the upload algos, simply looking at the temp file is where I'm having the issue. I have a test file on my desktop named test1.png, which is 115 KB.
a = '/users/rich/desktop/test1.png'
s = File.open(a, 'wb')
r = File.size(a)
p r => 0
p s.size => 0
Not sure what I'm doing wrong here, but both resolve to 0. Not true.
How can I get the size of a file?
The problem is the 'w' flag, since it truncates existing file to zero length, so, since you open the file before getting its size, you get 0.
To get the size you could just use the path of the file without using File.open:
a = '/users/rich/desktop/test1.png'
File.size(a)
Or, if you need to create the File object, just use the 'r' flag:
a = '/users/rich/desktop/test1.png'
s = File.open(a, 'r')
Now you can either use File.size(a) or s.size to get the size of the file.
I am writing to hdfs using flume spool directory. Here is my code
#initialize agent's source, channel and sink
agent.sources = test
agent.channels = memoryChannel
agent.sinks = flumeHDFS
# Setting the source to spool directory where the file exists
agent.sources.test.type = spooldir
agent.sources.test.spoolDir = /johir
agent.sources.test.fileHeader = false
agent.sources.test.fileSuffix = .COMPLETED
# Setting the channel to memory
agent.channels.memoryChannel.type = memory
# Max number of events stored in the memory channel
agent.channels.memoryChannel.capacity = 10000
# agent.channels.memoryChannel.batchSize = 15000
agent.channels.memoryChannel.transactioncapacity = 1000000
# Setting the sink to HDFS
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.path =/user/root/
agent.sinks.flumeHDFS.hdfs.fileType = DataStream
# Write format can be text or writable
agent.sinks.flumeHDFS.hdfs.writeFormat = Text
# use a single csv file at a time
agent.sinks.flumeHDFS.hdfs.maxOpenFiles = 1
# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollCount=0
agent.sinks.flumeHDFS.hdfs.rollInterval=0
agent.sinks.flumeHDFS.hdfs.rollSize = 1000000
agent.sinks.flumeHDFS.hdfs.batchSize =1000
# never rollover based on the number of events
agent.sinks.flumeHDFS.hdfs.rollCount = 0
# rollover file based on max time of 1 min
#agent.sinks.flumeHDFS.hdfs.rollInterval = 0
# agent.sinks.flumeHDFS.hdfs.idleTimeout = 600
# Connect source and sink with channel
agent.sources.test.channels = memoryChannel
agent.sinks.flumeHDFS.channel = memoryChannel
But he problem is data being written to the file is renamed to some a random tmp name. How can I rename the file in hdfs to my original file name in the source directory. For example I have the file day1.txt, day2.txt,day3.txt. Those are data for two different days. I want keep them stored in hdfs as day1.txt,day2.txt,day3.txt. But these three files are merged and stored in hdfs as FlumeData.1464629158164.tmp file. Is there any way to do this?
If you want to retain the original file name, you should attach the filename as a header to each event.
Set the basenameHeader property to true. This will create a header with the basename key unless set to something else using the basenameHeaderKey property.
Use the hdfs.filePrefix property to set the filename using basenameHeader values.
Add the below properties to your configuration file.
#source properties
agent.sources.test.basenameHeader = true
#sink properties
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.filePrefix = %{basename}
I am trying to uncompress a 823,000 line file, but I'm only receiving 26,000 lines of the file. I'm new to I/O and for some reason, not grasping why this is the case. Here is my code:
Zlib::GzipReader.open( file_path ) do |gz|
puts gz.readlines.count
end
Any direction would be appreciated.
Thanks in advance.
Ok, so I managed to fix this.
It turns out the server log file I was using had about 29 streams of data in it. Zlib::GzipReader only read the first one. In order to fix it, I had to loop through until all 29 streams had been read:
File.open( file_path ) do |file|
zio = file
loop do
io = Zlib::GzipReader.new( zio )
uncompressed += io.read
unused = io.unused # where I'm writing my file
break if unused.nil?
zio.pos -= unused.length
end
end
I have successfully configured flume to transfer text files from a local folder to hdfs. My problem is when this file is transfered into hdfs, some unwanted text "hdfs.write.Longwriter + binary characters" are prefixed in my text file.
Here is my flume.conf
agent.sources = flumedump
agent.channels = memoryChannel
agent.sinks = flumeHDFS
agent.sources.flumedump.type = spooldir
agent.sources.flumedump.spoolDir = /opt/test/flume/flumedump/
agent.sources.flumedump.channels = memoryChannel
# Each sink's type must be defined
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.path = hdfs://bigdata.ibm.com:9000/user/vin
agent.sinks.flumeHDFS.fileType = DataStream
#Format to be written
agent.sinks.flumeHDFS.hdfs.writeFormat = Text
agent.sinks.flumeHDFS.hdfs.maxOpenFiles = 10
# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollSize = 10485760
# never rollover based on the number of events
agent.sinks.flumeHDFS.hdfs.rollCount = 0
# rollover file based on max time of 1 mi
agent.sinks.flumeHDFS.hdfs.rollInterval = 60
#Specify the channel the sink should use
agent.sinks.flumeHDFS.channel = memoryChannel
# Each channel's type is defined.
agent.channels.memoryChannel.type = memory
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 100
My source text file is very simple containing text :
Hi My name is Hadoop and this is file one.
The sink file I get in hdfs looks like this :
SEQ !org.apache.hadoop.io.LongWritable org.apache.hadoop.io.Text������5����>I <4 H�ǥ�+Hi My name is Hadoop and this is file one.
Please let me know what am i doing wrong?
Figured it out.
I had to fix this line
agent.sinks.flumeHDFS.fileType = DataStream
and change it to
agent.sinks.flumeHDFS.hdfs.fileType = DataStream
this fixed the issue.
I am trying to store some large files on S3 using ruby aws:s3 using:
S3Object.store("video.mp4", open(file), 'bucket', :access => :public_read)
For files of 100 MB or so everything is great but with files of over 200 MB I get a "Connection reset by peer" error in the log.
Has anyone come across this weirdness? From the web, it seems to be an issue with large but I have not yet come across a definitive solution.
I am using Ubuntu.
EDIT:
This seems to be a Linux issue as suggested here.
No idea where the original problem might be, but as workaround you could try multipart upload.
filename = "video.mp4"
min_chunk_size = 5 * 1024 * 1024 # S3 minimum chunk size (5Mb)
#object.multipart_upload do |upload|
io = File.open(filename)
parts = []
bufsize = (io.size > 2 * min_chunk_size) ? min_chunk_size : io.size
while buf = io.read(bufsize)
md5 = Digest::MD5.base64digest(buf)
part = upload.add_part(buf)
parts << part
if (io.size - (io.pos + bufsize)) < bufsize
bufsize = (io.size - io.pos) if (io.size - io.pos) > 0
end
end
upload.complete(parts)
end
S3 multipart upload is little tricky as each part size must be over 5Mb, but that has been taken care of above code.