Reading line by line from blob Storage in Windows Azure - windows
Is there any way to read line by line from a text file in the blob storage in windows Azure??
Thanks
Yes, you can do this with streams, and it doesn't necessarily require that you pull the entire file, though please read to the end (of the answer... not the file in question) because you may want to pull the whole file anyway.
Here is the code:
StorageCredentialsAccountAndKey credentials = new StorageCredentialsAccountAndKey(
"YourStorageAccountName",
"YourStorageAccountKey"
);
CloudStorageAccount account = new CloudStorageAccount(credentials, true);
CloudBlobClient client = new CloudBlobClient(account.BlobEndpoint.AbsoluteUri, account.Credentials);
CloudBlobContainer container = client.GetContainerReference("test");
CloudBlob blob = container.GetBlobReference("CloudBlob.txt");
using (var stream = blob.OpenRead())
{
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine());
}
}
}
I uploaded a text file called CloudBlob.txt to a container called test. The file was about 1.37 MB in size (I actually used the CloudBlob.cs file from GitHub copied into the same file six or seven times). I tried this out with a BlockBlob which is likely what you'll be dealing with since you are talking about a text file.
This gets a reference to the BLOB as usualy, then I call the OpenRead() method off the CloudBlob object which returns you a BlobStream that you can then wrap in a StreamReader to get you the ReadLine method. I ran fiddler with this and noticed that it ended up calling up to get additional blocks three times to complete the file. It looks like the BlobStream has a few properties and such you can use to tweak the amount of reading ahead you have to do, but I didn't try adjusting them. According to one reference I found the retry policy also works at the last read level, so it won't attempt to re-read the whole thing again, just the last request that failed. Quoted here:
Lastly, the DownloadToFile/ByteArray/Stream/Text() methods performs it’s entire download in a single streaming get. If you use CloudBlob.OpenRead() method it will utilize the BlobReadStream abstraction which will download the blob one block at a time as it is consumed. If a connection error occurs, then only that one block will need to be re-downloaded(according to the configured RetryPolicy). Also, this will potentially help improve performance as the client may not need cache a large amount of data locally. For large blobs this can help significantly, however be aware that you will be performing a higher number of overall transactions against the service. -- Joe Giardino
I think it is important to note the caution that Joe points out in that this will lead to an overall larger number of transactions against your storage account. However, depending on your requirements this may still be the option you are looking for.
If these are massive files and you are doing a lot of this then it could many, many transactions (though you could see if you can tweak the properties on the BlobStream to increase the amount of blocks retrieved at a time, etc). It may still make sense to do a DownloadFromStream on the CloudBlob (which will pull the entire contents down), then read from that stream the same way I did above.
The only real difference is that one is pulling smaller chunks at a time and the other is pulling the full file immediately. There are pros and cons for each and it will depend heavily on how large these files are and if you plan on stopping at some point in the middle of reading the file (such as "yeah, I found the string I was searching for!) or if you plan on reading the entire file anyway. If you plan on pulling the whole file no matter what (because you are processing the entire file for example), then just use the DownloadToStream and wrap that in a StreamReader.
Note: I tried this with the 1.7 SDK. I'm not sure which SDK these options were introduced.
In case anyone finds themselves here, the Python SDK for Azure Blob Storage (v12) now has the simple download_blob() method, which accepts two parameters - offset and length.
Using Python, my goal was to extract the header row from (many) files in blob storage. I knew the locations of all of the files, so I created a list of the blob clients - one for each file. Then, I iterated through the list and ran the download_blob method.
Once you have created a Blob Client (either directly via connection string or using the BlobServiceClient.get_blob_client() method), just download the first (say,) 4k bytes to cover any long header rows, then split the text using an end of line character ('\n'). The first element of the resulting list will be a header row. My working code (just for a single file) looked like:
from azure.storage.blob import BlobServiceClient
MAX_LINE_SIZE = 4096 # You can change this..
my_blob_service_client = BlobServiceClient(account_url=my_url, credential=my_shared_access_key)
my_blob_client = my_blob_service_client.get_blob_client('my-container','my_file.csv')
file_size = my_blob_client.size
offset = 0
You can then write a loop to downloading the text line by by line, by counting the byte offset at the first end-of-line, and getting the next MAX_LINE_SIZE bytes. For optimum efficiency, it'd be nice to know the maximum length of a line, but if you don't, guess a sufficiently large length.
while offset < file_size - 1:
next_text_block = my_blob_client.download_blob(offset=offset, length=MAX_LINE_SIZE)
line = next_text_block.split('\n')[0]
offset = len(line) + 1
# Do something with your line..
Hope that helps. The obvious trade-offs here are network overhead, each call for a line of text is not fast, but it achieves your requirement of reading line-by-line.
To directly answer your question, you will have to write code to download the blob locally first and then read the content in it. This is mainly because you can not just peak into a blob and read its content in middle. IF you have used Windows Azure Table Storage, you sure can read the specific content in the table.
As your text file is a blob and located at the Azure Blob storage, what you really need is to download the blob locally (as local blob or memory stream) and then read the content in it. You will have to download the blob full or partial depend on what type of blob you have uploaded. With Page blobs you can download specific size of content locally and process it. It would be great to know about difference between block and page blob on this regard.
This the code I used to fetch a file line by line. The file was stored in Azure Storage. File service was used and not blob service.
//https://learn.microsoft.com/en-us/azure/storage/storage-dotnet-how-to-use-files
//https://<storage account>.file.core.windows.net/<share>/<directory/directories>/<file>
public void ReadAzureFile() {
CloudStorageAccount account = CloudStorageAccount.Parse(
CloudConfigurationManager.GetSetting("StorageConnectionString"));
CloudFileClient fileClient = account.CreateCloudFileClient();
CloudFileShare share = fileClient.GetShareReference("jiosongdetails");
if (share.Exists()) {
CloudFileDirectory rootDir = share.GetRootDirectoryReference();
CloudFile file = rootDir.GetFileReference("songdetails(1).csv");
if (file.Exists()) {
using(var stream = file.OpenRead()) {
using(StreamReader reader = new StreamReader(stream)) {
while (!reader.EndOfStream) {
Console.WriteLine(reader.ReadLine());
}
}
}
}
}
Related
How can I append bytes of data in an already uploaded file in StorJ?
I was getting segment error while uploading a large file. I have read the file data in chunks of bytes using the Read method through io.Reader. Now, I need to upload the bytes of data continuously into the StorJ.
Storj, architected as an S3-compatible distributed object storage system, does not allow changing objects once uploaded. Basically, you can delete or overwrite, but you can't append. You could make something that seemed like it supported append, however, using Storj as the backend. For example, by appending an ordinal number to your object's path, and incrementing it each time you want to add to it. When you want to download the whole thing, you would iterate over all the parts and fetch them all. Or if you only want to seek to a particular offset, you could calculate which part that offset would be in, and download from there. sj://bucket/my/object.name/000 sj://bucket/my/object.name/001 sj://bucket/my/object.name/002 sj://bucket/my/object.name/003 sj://bucket/my/object.name/004 sj://bucket/my/object.name/005 Of course, this leaves unsolved the problem of what to do when multiple clients are trying to append to your "file" at the same time. Without some sort of extra coordination layer, they would sometimes end up overwriting each other's objects.
How to protect a file with Win32 API from being corrupted if the power is reset?
In a C++ Win32 app I write a large file by appending blocks about 64K using a code like this: auto h = ::CreateFile( "uncommited.dat", FILE_APPEND_DATA, // open for writing FILE_SHARE_READ, // share for reading NULL, // default security CREATE_NEW, // create new file only FILE_ATTRIBUTE_NORMAL, // normal file NULL); // no attr. template for (int i = 0; i < 10000; ++i) { ::WriteFile(h, 64K);} As far as I see if the process is terminated unexpectedly, some blocks with numbers i >= N are lost, but blocks with numbers i < N are valid, and I can read them when the app restarts, because the blocks themselves are not corrupted. But what happens if the power is reset? Is it true that entire file can be corrupted, or even have zero length? Is it a good idea to do FlushFileBuffers(h); MoveFile("uncommited.dat", "commited.dat"); assuming that MoveFile is some kind of an atomic operation, and when the app restarts open "commited.dat" as valid and delete "uncommited.dat" as corrupted. Or is there a better way?
MoveFile can work all right in the right situation. It has a few problems though--for example, you can't have an existing file by the new name. If that might occur (you're basically updating an existing file you want to assure won't get corrupted by making a copy, modifying the copy, then replacing the old with the new), rather than MoveFile you probably want to use ReplaceFile. With ReplaceFile, you write your data to the uncommitted.dat (or whatever name you prefer). Then yes, you probably want to do FlushFileBuffers, and finally ReplaceFile to replace the old file with the new one. This makes use of the NTFS journaling (which applies to file system metadata, not the contents of your files), assuring that only one of two possibilities can happen: either you have the old file (entirely intact) or else the new one (also entirely intact). If power dies in the middle of making a change, NTFS will use its journal to roll back the transaction. NTFS does also support transactions, but Microsoft generally recommends against applications trying to use this directly. It apparently hasn't been used much since they added it (in Windows Vista), and MSDN hints that it's likely to be removed in some future version of Windows.
For append only scenario you can split data in blocks (constant or variable size). Each block should be accompanied with some form of checksum (SHA, MD5, CRC). After crash you can read sequentially each block and verify it's checksum. First damaged block and all following it should be treated as lost (eventually you can inspect them and recover manually). To append more data, truncate file to the end of last correct block. You can write two copies in parallel and after crash select one with more good blocks.
are temporary files temporal? if so how long?
I´m building a web app that allows users to upload files < 5MB, and for this I´m using Request.ParseMultipartForm(5000000), but I´m wondering what happens if a funny guy tries to upload a file bigger than 5MB, documentation is not clear enough https://golang.org/pkg/net/http/#Request.ParseMultipartForm The whole request body is parsed and up to a total of maxMemory bytes of its file parts are stored in memory, with the remainder stored on disk in temporary files So, how long "temporary files" really means? because it´s a little ambiguous, does that mean that remaining file will be erase after the handler function returns? or does mean that has a lifetime determined? I wouldn´t want my app to crash if some guys try to do this and I run out of disk space.
Temporary files live for the duration of the request. Parsing of the form and the creation of the temp files are handled by the mime/multipart package. When the server finishes the request, it calls Form.RemoveAll to delete any temporary files associated with the form data.
Ruby PStore file too large
I am using PStore to store the results of some computer simulations. Unfortunately, when the file becomes too large (more than 2GB from what I can see) I am not able to write the file to disk anymore and I receive the following error; Errno::EINVAL: Invalid argument - <filename> I am aware that this is probably a limitation of IO but I was wondering whether there is a workaround. For example, to read large JSON files, I would first split the file and then read it in parts. Probably the definitive solution should be to switch to a proper database in the backend, but because of some limitations of the specific Ruby (Sketchup) I am using this is not always possible.
I am going to assume that your data has a field that could be used as a crude key. Therefore I would suggest that instead of dumping data into one huge file, you could put your data into different files/buckets. For example, if your data has a name field, you could take the first 1-4 chars of the name, create a file with those chars like rojj-datafile.pstore and add the entry there. Any records with a name starting 'rojj' go in that file. A more structured version is to take the first char as a directory, then put the file inside that, like r/rojj-datafile.pstore. Obviously your mechanism for reading/writing will have to take this new file structure into account, and it will undoubtedly end up slower to process the data into the pstores.
Are there alternatives for creating large container files that are cross platform?
Previously, I asked the question. The problem is the demands of our file structure are very high. For instance, we're trying to create a container with up to 4500 files and 500mb data. The file structure of this container consists of SQLite DB (under 1mb) Text based xml-like file Images inside a dynamic folder structure that make up the rest of the 4,500ish files After the initial creation the images files are read only with the exception of deletion. The small db is used regularly when the container is accessed. Tar, Zip and the likes are all too slow (even with 0 compression). Slow is subjective I know, but to untar a container of this size is over 20 seconds. Any thoughts?
As you seem to be doing arbitrary file system operations on your container (say, creation, deletion of new files in the container, overwriting existing files, appending), I think you should go for some kind of file system. Allocate a large file, then create a file system structure in it. There are several options for the file system available: for both Berkeley UFS and Linux ext2/ext3, there are user-mode libraries available. It might also be possible that you find a FAT implementation somewhere. Make sure you understand the structure of the file system, and pick one that allows for extending - I know that ext2 is fairly easy to extend (by another block group), and FAT is difficult to extend (need to append to the FAT). Alternatively, you can put a virtual disk format yet below the file system, allowing arbitrary remapping of blocks. Then "free" blocks of the file system don't need to appear on disk, and you can allocate the virtual disk much larger than the real container file will be.
Three things. 1) What Timothy Walters said is right on, I'll go in to more detail. 2) 4500 files and 500Mb of data is simply a lot of data and disk writes. If you're operating on the entire dataset, it's going to be slow. Just I/O truth. 3) As others have mentioned, there's no detail on the use case. If we assume a read only, random access scenario, then what Timothy says is pretty much dead on, and implementation is straightforward. In a nutshell, here is what you do. You concatenate all of the files in to a single blob. While you are concatenating them, you track their filename, the file length, and the offset that the file starts within the blob. You write that information out in to a block of data, sorted by name. We'll call this the Table of Contents, or TOC block. Next, then, you concatenate the two files together. In the simple case, you have the TOC block first, then the data block. When you wish to get data from this format, search the TOC for the file name, grab the offset from the begining of the data block, add in the TOC block size, and read FILE_LENGTH bytes of data. Simple. If you want to be clever, you can put the TOC at the END of the blob file. Then, append at the very end, the offset to the start of the TOC. Then you lseek to the end of the file, back up 4 or 8 bytes (depending on your number size), take THAT value and lseek even farther back to the start of your TOC. Then you're back to square one. You do this so you don't have to rebuild the archive twice at the beginning. If you lay out your TOC in blocks (say 1K byte in size), then you can easily perform a binary search on the TOC. Simply fill each block with the File information entries, and when you run out of room, write a marker, pad with zeroes and advance to the next block. To do the binary search, you already know the size of the TOC, start in the middle, read the first file name, and go from there. Soon, you'll find the block, and then you read in the block and scan it for the file. This makes it efficient for reading without having the entire TOC in RAM. The other benefit is that the blocking requires less disk activity than a chained scheme like TAR (where you have to crawl the archive to find something). I suggest you pad the files to block sizes as well, disks like work with regular sized blocks of data, this isn't difficult either. Updating this without rebuilding the entire thing is difficult. If you want an updatable container system, then you may as well look in to some of the simpler file system designs, because that's what you're really looking for in that case. As for portability, I suggest you store your binary numbers in network order, as most standard libraries have routines to handle those details for you.
Working on the assumption that you're only going to need read-only access to the files why not just merge them all together and have a second "index" file (or an index in the header) that tells you the file name, start position and length. All you need to do is seek to the start point and read the correct number of bytes. The method will vary depending on your language but it's pretty straight forward in most of them. The hardest part then becomes creating your data file + index, and even that is pretty basic!
An ISO disk image might do the trick. It should be able to hold that many files easily, and is supported by many pieces of software on all the major operating systems.
First, thank-you for expanding your question, it helps a lot in providing better answers. Given that you're going to need a SQLite database anyway, have you looked at the performance of putting it all into the database? My experience is based around SQL Server 2000/2005/2008 so I'm not positive of the capabilities of SQLite but I'm sure it's going to be a pretty fast option for looking up records and getting the data, while still allowing for delete and/or update options. Usually I would not recommend to put files inside the database, but given that the total size of all images is around 500MB for 4500 images you're looking at a little over 100K per image right? If you're using a dynamic path to store the images then in a slightly more normalized database you could have a "ImagePaths" table that maps each path to an ID, then you can look for images with that PathID and load the data from the BLOB column as needed. The XML file(s) could also be in the SQLite database, which gives you a single 'data file' for your app that can move between Windows and OSX without issue. You can simply rely on your SQLite engine to provide the performance and compatability you need. How you optimize it depends on your usage, for example if you're frequently needing to get all images at a certain path then having a PathID (as an integer for performance) would be fast, but if you're showing all images that start with "A" and simply show the path as a property then an index on the ImageName column would be of more use. I am a little concerned though that this sounds like premature optimization, as you really need to find a solution that works 'fast enough', abstract the mechanics of it so your application (or both apps if you have both Mac and PC versions) use a simple repository or similar and then you can change the storage/retrieval method at will without any implication to your application.
Check Solid File System - it seems to be what you need.