Is there a way I can break a big torrent file (say, more than 4 GB) into small parts (say, less than 1 GB each) and download them separately ?
I should later be able to combine them and the final result should be the same as if I had downloaded the whole torrent file in one go.
Yes.
A torrent file contains modular information about the file it's tracking as 'pieces', which are slices of the entire file you're downloading, pieces are generally between 16KB and 4MB.
So, in essence, a torrent is downloaded as n parts and combined later as it is, all you'd need is a client that would let you specify that you only want the first, second, third or fourth quarter of the pieces, which is where you'll run into some trouble, I don't know if there's any client that would support such a thing.
In addition to this answer, it's possible to split a file in smaller chunks with a file archiver or similar software.
You can then send those files with torrents.
Related
I need to stream my videos using HLS bytes range HTTP requests.
FFmpeg has an option to keep all the ts file to one single large ".ts" file.
Any pros and cons of split the ts files or keep a big ts file?
Is a big ts file make the request slower? because of HDD fseek is slow?
I recommend using the split variant because
better performance (lower CPU load) of the webserver (no need to seek always within the same file; just deliver the whole, small, file)
better support of the clients (some, mostly older clients or SmartTVs, have issues with the big file (during seeking or at the video beginning; some of them also try to download the whole file first for any reason before starting playback))
small files can be better cached (within your webserver or through a CDN; most CDNs have a size limit and some others don't caching partial files)
I don't find on the other side any relevant benefits for the large file variant.
Maybe faster cleanup, because only one file instead of a whole folder must be deleted
ffmpeg has -hls_time option to set length of *.ts files. Apple suggest time of 6 seconds. In our case we have set it to 30 seconds (-hls_time 30), which results in larger size but smaller number of *.ts files, which is more practical for uploading ,storing and retrieving to/from object storage.
My use case is that I'm writing entries to a file throughout the day. I can either write these entries compressed, or compress the entire file after the fact. These files can get fairly big (~10 GB uncompressed) and I'm writing to multiple files at the same time. Some other considerations are that I can split up the files to smaller granularites to address the buffer issue for compressing per file. There probably isn't a definitive right or wrong answer to this, but just seeing if there are other considerations that I should look at.
Once compressed, these files will be uploaded to some sort of storage medium for archival purposes and possible later analysis.
Compress Per Line
Pros
Cons
More space efficient while writing
More Complicated to Implement
More space efficient while reading since I can decompress on a per entry granularity
Less efficient in terms of disk space usage vs compressing an entire file
Compress Per File
Pros
Cons
Better Compression on a per file basis since there is more data that can be compressed
Requires a bigger buffer of disk space to handle writes throughout the day before compressing
Simpler to implement, write normally to file and compress afterwards using simple linux tools
Unless you have really, really long lines, you will get almost no compression on a single line. Have you tried it?
You can get the best of both worlds by accumulating lines until you have enough to compress, and then write those to the file. gzlog does that.
Everything is a file searching program. As its author hasn't released the source code, I am wondering how it works.
How could it index files so efficiently?
What data structures does it use for file searching?
How can its file searching be so fast?
To quote its FAQ,
"Everything" only indexes file and folder names and generally takes a
few seconds to build its database. A fresh install of Windows 10
(about 120,000 files) will take about 1 second to index. 1,000,000
files will take about 1 minute.
If it takes only one second to index the whole Windows 10, and takes only 1 minute to index one million files, does this mean it can index 120,000 files per second?
To make the search fast, there must be a special data structure. Searching by file name doesn't only search from the start, but also from the middle in most cases. This makes it some widely used indexing structures such as Trie and Red–black tree ineffective.
The FAQ clarifies further.
Does "Everything" hog my system resources?
No, "Everything" uses very little system resources. A fresh install of
Windows 10 (about 120,000 files) will use about 14 MB of ram and less
than 9 MB of disk space. 1,000,000 files will use about 75 MB of ram
and 45 MB of disk space.
Short Answer: MFT (Master File Table)
Getting the data
Many search engines used to recursively walk through the entire disk structure so that it finds all the files. Therefore it used to take longer time to complete the indexing process (even when contents are not indexed). If contents were also indexed, it would take a lot longer.
From my analysis of Everything, it does not recurse at all. If we observe the speed, in about 5 seconds it indexed an entire 1tb drive (SSD). Even if it had to recurse it would take longer - since there are thousands of small files - each with its own file size, date etc - all spread across.
Instead, Everything does not even touch the actual data, it reads and parses the 'Index' of the hard drive. For NTFS, MFT store all the file names, its physical location (like concept of iNodes in Linux). So, in one small contiguous area (a file), all the data inside MFT is present. So, the search indexer does not have waste time finding where the info about next file is, it does not have to seek. Since MFT by design is contiguous (rare exception if there are many more files and MFT for some reason is filled up or corrupt, it will link to a new one which will cause a seek time - but that edge case is very rare).
MFT is not plain text, it needs to be parsed. Folks at Everything have designed a superfast parser and decoder for NFT and hence all is well.
FSCTL_GET_NTFS_VOLUME_DATA (declared in winioctl.h) will get you the cluster locations for mft. Alternatively, you could use NTFSInfo (Microsoft SysInternals - NTFSInfo v1.2).
MFT zone clusters : 90400352 - 90451584
Storing and retrieving
The .db file from my index at C:\Users\XXX\AppData\Local\Everything I assume this is a regular nosql file-based database. Since it uses a DB and not a flat file, that contributes to the speed. And also, at start of program, it loads this db file into RAM, so all the queries do not look up the DB on disk, instead on RAM. All this combined makes it slick.
How could it index files so efficiently?
First, it indexes only file/directory names, not contents.
I don't know if it's efficient enough for your needs, but the ordinary way is with FindFirstFile function. Write a simple C program to list folders/files recursively, and see if it's fast enough for you. The second step through optimization would be running threads in parallel, but maybe disk access would be the bottleneck, if so multiple threads would add little benefit.
If this is not enough, finally you could try to dig into the even lower Native API functions; I have no experience with these, so I can't give you further advice. You'd be pretty close to the metal, and maybe the Linux NTFS project has some concepts you need to learn.
What data structures does it use for file searching?
How can its file searching be so fast?
Well, you know there are many different data structures designed for fast searching... probably the author ran a lot of benchmarks.
I have a Delphi app that references a datafile of 28-byte records. The file is written sequentially but read randomly. The datafile is split into N physical files which are rolled over at 10 megs or so to provide some insurance against disk problems, and because we are only ever writing to the most recent one, and I found it became slower and slower to write to if it were allowed to grow to big. On startup I read the entire file set and build an index so that I can quickly know which file to seek into given a virtual record number.
As part of the splitting into N files I implemented a read cache. I realise now that Windows does a fair amount of caching on it's own, and I wonder if I'm gaining anything by sticking another cache between myself and the disk files.
Any thoughts appreciated.
No. Use file mappings to use the existing file cache more efficiently.
I have the following loop:
for fileName in fileList:
f = open(fileName)
txt = open(f).read()
analyze(txt)
The fileList is a list of more than 1 million small files. Empirically, I have found that call to open(fileName) takes more than 90% of the loop running time. What would you do in order to optimize this loop. This is a "software only" question, buying a new hardware is not an option.
Some information about this file collection:
Each file name is a 9-13 digit ID. The files are arranged in subfolders according to the first 4 digits of the ID. The files are stored on an NTFS disk and I rather not change disk format for reasons I won't get into, unless someone here has a strong belief that such a change will make a huge difference.
Solution
Thank you all for the answers.
My solution was to pass over all the files, parsing them and putting the results in an SQLite database. No the analyses that I perform on the data (select several entries, do the math) take only seconds. Already said, the reading part took about 90% of the time, so parsing the XML files in advance had little effect on the performance, compared to the effect of not having to read the actual files from the disk.
Hardware solution
You should really benefit from using a solid state drive (SSD). These are a lot faster than traditional hard disk drives, because they don't have any hardware components that need to spin and move around.
Software solution
Are these files under your control, or are they coming from an external system? If you're in control, I'd suggest you use a database to store the information.
If a database is too much of a hassle for you, try to store the information in a single file and read from that. If the isn't fragmented too much, you'll have much better performance compared to having millions of small files.
If opening and closing of files is taking most of your time, a good idea will be use a database or data store for your storage rather than a collection of flat files
To address your final point:
unless someone here has a strong belief that such a change will make a huge difference
If we're really talking about 1 million small files, merging them into one large file (or a small number of files) will almost certainly make a huge difference. Try it as an experiment.
Store the files in a single .zip archive and read them from that. You are just reading these files, right?
So, let's get this straight: you have sound empirical data that shows that your bottleneck is the filesystem, but you don't want to change your file structure? Look up Amdahl's law. If opening the files takes 90% of the time, then without changing that part of the program, you will not be able to speed things up by more than 10%.
Take a look at the properties dialog box for the directory containing all those files. I'd imagine the "size on disk" value is much larger than the total size of the files, because of the overhead of the filesystem (things like per-file metadata that is probably very redundant, and files being stored with an integer number of 4k blocks).
Since what you have here is essentially a large hash table, you should store it in a file format that is more suited to that kind of usage. Depending on whether you will need to modify these files and whether the data set will fit in RAM, you should look in to using a full-fledged database, a ligheweight embeddable one like sqlite, your language's hash table/dictionary serialization format, a tar archive, or a key-value store program that has good persistence support.