Some issues when edit large files more than 100GB

Some issues when edit large files more than 100GB - emeditor

Sometime when edit large files like more than 100GB. (my pc physical memory is 128GB and using nvme ssd)
small change . vs fast save ?
when i did a small change on the file, like deleting the first line of a file. Is there a more efficient way to complete this function ?
200gb file takes half a hour to save.
Sometime emeditor will detect json or csv error rows. Is it easy to mark these rows as bookbooks? So it will be easy to extract or delete these lines.
Can sequence number auto-fulling used in replace ?
When edit more than 100M rows. As I know,
the normal function should be
switch into csv mode and
insert a new column. and
then filled with sequence numbers. these steps also time-consuming.
These step can be fullfilled by replace function? an example bellow.
example:
{"Genres":"Drama","Product":"Ice Cream - Super Sandwich","Title":"White Lightnin'"}
{"Genres":"Drama|War","Product":"Raspberries - Frozen","Title":"Leopard, The (Gattopardo, Il)"}
{"Genres":"Crime|Drama|Film-Noir","Product":"Cookie Dough - Chunky","Title":"Limits of Control, The"}
{"Genres":"Drama|Mystery","Product":"Watercress","Title":"Echoes from the Dead (Skumtimmen)"}
{"Genres":"Drama|Thriller","Product":"Cumin - Whole","Title":"Good People"}
need to convert into
{"id":1,"Genres":"Drama","Product":"Ice Cream - Super Sandwich","Title":"White Lightnin'"}
{"id":2,"Genres":"Drama|War","Product":"Raspberries - Frozen","Title":"Leopard, The (Gattopardo, Il)"}
{"id":3,"Genres":"Crime|Drama|Film-Noir","Product":"Cookie Dough - Chunky","Title":"Limits of Control, The"}
{"id":4,"Genres":"Drama|Mystery","Product":"Watercress","Title":"Echoes from the Dead (Skumtimmen)"}
{"id":5,"Genres":"Drama|Thriller","Product":"Cumin - Whole","Title":"Good People"}
data created by Mockaroo

Assuming you are running a relatively recent version of EmEditor
Find (Ctrl+ F): {
Options: Match Case, Close when Finished, (None)
Click [Select All] (should all be selected in your file)
Edit Menu - Advanced - Numbering (or Alt+ N)
First Line:{"id":1
Increment:1
Make sure Decimal is selected
Click [OK]

Related

R307 Fingerprint Sensor working with more then 1000 fingerprints

I want to integrate fingerprint sensor in my project. For the instance I have shortlisted R307, which has capacity of 1000 fingerprints.But as project requirement is more then 1000 prints,so I will going to store print inside the host.
The procedure I understand by reading the datasheet for achieving project requirements is :
I will register the fingerprint by "GenImg".
I will download the template by "upchr"
Now whenever a fingerprint come I will follow the step 1 and step 2.
Then start some sort of matching algorithm that will match the recently downloaded template file with
the template file stored in database.
So below are the points for which I want your thoughts
Is the procedure I have written above is correct and optimized ?
Is matching algorithm is straight forward like just comparing or it is some sort of tricky ? How can
I implement that.Please suggest if some sort of library already exist.
The sensor stores the image in 256 * 288 pixels and if I take this file to host
at maximum data rate it takes ~5(256 * 288*8/115200) seconds. which seems very
large.
Thanks
Abhishek
PS: I just mentioned "HOST" from which I going to connect sensor, it can be Arduino/Pi or any other compute device depends on how much computing require for this task, I will select.

You most probably figured it out yourself. But for anyone stumbling here in future.
You're correct for the most part.
You will take finger image (GenImg)
You will then generate a character file (Img2Tz) at BufferID: 1
You'll repeat the above 2 steps again, but this time store the character file in BufferID: 2
You're now supposed to generate a template file by combining those 2 character files (RegModel).
The device combines them for you, and stores the template in both the character buffers
As a last step; you need to store this template in your storage (Store)
For searching the finger: You'll take finger image once, generate a character file in BufferID : 1 and search the library (Search). This performs a linear search and returns the finger id along with confidence score.
There's also another method (GR_Identify); does all of the above automatically.
The question about optimization isn't applicable here, you're using a 3rd party device and you have to follow the working instructions whether it's optimized or not.
The sensor stores the image in 256 * 288 pixels and if I take this file to host at maximum data rate it takes ~5(256 * 288*8/115200) seconds. which seems very large.
I don't really get what you mean by this, but the template file ( that you intend to upload to your host ) is 512 bytes, I don't think it should take much time.
If you want an overview of how this system is implemented; Adafruit's Library is a good reference.

Reset MeasurementIndex to 0 in Vector CANoe?

Is there a way to reset the MeasurementIndex of a CANoe Simulation to 0?
I'm currently working on a myCANoe.cfg simulation that was saved multiple times. I'm creating log files with the structure myCANoe_{MeasurementIndex}.blf and MeasurementIndex = 800 right now. I'd like to tweak the text in myCANoe.cfg to reset it to zero. So far, searching for the string was not effective, nor it was changing the preview text myCANoe_800.blf in myCANoe.cfg. Can we achive this result somehow?

Turns out there is a simple way. Please be sure to have a back-up plan in case the following edits go south. You'll need to manually edit the myCANoe.cfg file, possibly resulting in complete corruption of the simulation. I was able to achieve the result with the following:
Note the current measurement index (e.g. 800)
Delete myCANoe.stcfg compiled file
Open the simulation
Check current measurement index again and close simulation
Delete myCANoe.stcfg again
edit myCANoe.cfg with a text editor
Search for the measurement index value (800). I found two results: one on row 609, one with the format <VFileName V7 QL> 1 "myCANoe_800.blf"
Edit both to 0 and 000, respectively. Save
Open the CANoe configuration, My measurement index was re-set.

Labview Saving multiple segments into one file

I am converting an SDK vi provided by a data acquisition card company to suit my needs. The original vi records multiple data segments in the card memory and displays in a waveform graph on the front panel without any saving to file function. I can input "Number of Records" to set how many segments (waveforms) I want to acquire. Once the acquisition is over, I can click on "segment" (a "control" on the front panel to input a number) to view the nth segment. To save all the data segments into one file, I put the "Write Delimited Spreadsheet VI" in this VI, with attache to file and transpose function.
My problem is that once I add the save file function, the VI is only saving one segment if the "Run" is set at "False," and then by clicking the arrow to add one in the "Segment" control on the front panel, the next segment data will be saved in the same file and attach after the previously segment. Or the VI will keep saving data without stopping if the "Run" is set at "True. What I want is when I set the "Number of Records" as X (an integer), the file will save X segments. I tried to add a counter to automatically add one each time and replace the "Segment" input but was not working.
I feel that I am getting very close to what I want but after a week I decided to ask for help. Any comments and suggestions welcome. Thank you.

A counter is needed. Add it to shift register of the While Loop (not the For Loop!).
Add saving of data to the While Loop (move it out from For Loop).
Increase counter (in the While Loop), and save data until the counter reaches Segment’s value.
Let me also give one recommendation: try to change current implementation to some more flexible design pattern (such as State Machine or Producer-Consumer). Currently, your code is quite messy, coupled, etc. - so it could be a challenge to debug it or to add new features. Both of the patterns mentioned can be explored more by looking at the Project Templates (available when you create a new project).

How can you identify a file without a filename or filepath?

If I were to give you a file. You can read the file but you can't change it or copy it. Then I take the file, rename it, move it to a new location. How could you identify that file? (Fairly reliably)
I'm looking if I have a database of media files for a program and the user alters the location/name of file, could I find the file by searching a directory and looking for something.

I have done exactly this, it's not hard.
I take a 256-bit hash (I forget which routine I used off the top of my head) of the file and the filesize and write it to a table. If they match the files match. (And I think tracking the size is more paranoia than necessity.) To speed things up I also fold that hash to a 32-bit value. If the 32-bit values match then I check all the data.
For the sake of performance I persist the last 10 million files I have examined. The 32-bit values go in one file which is read in it's entirety, when a main record needs to be examined I pull in a "page" (I forget exactly how big) of them which is padded to align it with the disk.

Are there alternatives for creating large container files that are cross platform?

Previously, I asked the question.
The problem is the demands of our file structure are very high.
For instance, we're trying to create a container with up to 4500 files and 500mb data.
The file structure of this container consists of
SQLite DB (under 1mb)
Text based xml-like file
Images inside a dynamic folder structure that make up the rest of the 4,500ish files
After the initial creation the images files are read only with the exception of deletion.
The small db is used regularly when the container is accessed.
Tar, Zip and the likes are all too slow (even with 0 compression). Slow is subjective I know, but to untar a container of this size is over 20 seconds.
Any thoughts?

As you seem to be doing arbitrary file system operations on your container (say, creation, deletion of new files in the container, overwriting existing files, appending), I think you should go for some kind of file system. Allocate a large file, then create a file system structure in it.
There are several options for the file system available: for both Berkeley UFS and Linux ext2/ext3, there are user-mode libraries available. It might also be possible that you find a FAT implementation somewhere. Make sure you understand the structure of the file system, and pick one that allows for extending - I know that ext2 is fairly easy to extend (by another block group), and FAT is difficult to extend (need to append to the FAT).
Alternatively, you can put a virtual disk format yet below the file system, allowing arbitrary remapping of blocks. Then "free" blocks of the file system don't need to appear on disk, and you can allocate the virtual disk much larger than the real container file will be.

Three things.
1) What Timothy Walters said is right on, I'll go in to more detail.
2) 4500 files and 500Mb of data is simply a lot of data and disk writes. If you're operating on the entire dataset, it's going to be slow. Just I/O truth.
3) As others have mentioned, there's no detail on the use case.
If we assume a read only, random access scenario, then what Timothy says is pretty much dead on, and implementation is straightforward.
In a nutshell, here is what you do.
You concatenate all of the files in to a single blob. While you are concatenating them, you track their filename, the file length, and the offset that the file starts within the blob. You write that information out in to a block of data, sorted by name. We'll call this the Table of Contents, or TOC block.
Next, then, you concatenate the two files together. In the simple case, you have the TOC block first, then the data block.
When you wish to get data from this format, search the TOC for the file name, grab the offset from the begining of the data block, add in the TOC block size, and read FILE_LENGTH bytes of data. Simple.
If you want to be clever, you can put the TOC at the END of the blob file. Then, append at the very end, the offset to the start of the TOC. Then you lseek to the end of the file, back up 4 or 8 bytes (depending on your number size), take THAT value and lseek even farther back to the start of your TOC. Then you're back to square one. You do this so you don't have to rebuild the archive twice at the beginning.
If you lay out your TOC in blocks (say 1K byte in size), then you can easily perform a binary search on the TOC. Simply fill each block with the File information entries, and when you run out of room, write a marker, pad with zeroes and advance to the next block. To do the binary search, you already know the size of the TOC, start in the middle, read the first file name, and go from there. Soon, you'll find the block, and then you read in the block and scan it for the file. This makes it efficient for reading without having the entire TOC in RAM. The other benefit is that the blocking requires less disk activity than a chained scheme like TAR (where you have to crawl the archive to find something).
I suggest you pad the files to block sizes as well, disks like work with regular sized blocks of data, this isn't difficult either.
Updating this without rebuilding the entire thing is difficult. If you want an updatable container system, then you may as well look in to some of the simpler file system designs, because that's what you're really looking for in that case.
As for portability, I suggest you store your binary numbers in network order, as most standard libraries have routines to handle those details for you.

Working on the assumption that you're only going to need read-only access to the files why not just merge them all together and have a second "index" file (or an index in the header) that tells you the file name, start position and length. All you need to do is seek to the start point and read the correct number of bytes. The method will vary depending on your language but it's pretty straight forward in most of them.
The hardest part then becomes creating your data file + index, and even that is pretty basic!

An ISO disk image might do the trick. It should be able to hold that many files easily, and is supported by many pieces of software on all the major operating systems.

First, thank-you for expanding your question, it helps a lot in providing better answers.
Given that you're going to need a SQLite database anyway, have you looked at the performance of putting it all into the database? My experience is based around SQL Server 2000/2005/2008 so I'm not positive of the capabilities of SQLite but I'm sure it's going to be a pretty fast option for looking up records and getting the data, while still allowing for delete and/or update options.
Usually I would not recommend to put files inside the database, but given that the total size of all images is around 500MB for 4500 images you're looking at a little over 100K per image right? If you're using a dynamic path to store the images then in a slightly more normalized database you could have a "ImagePaths" table that maps each path to an ID, then you can look for images with that PathID and load the data from the BLOB column as needed.
The XML file(s) could also be in the SQLite database, which gives you a single 'data file' for your app that can move between Windows and OSX without issue. You can simply rely on your SQLite engine to provide the performance and compatability you need.
How you optimize it depends on your usage, for example if you're frequently needing to get all images at a certain path then having a PathID (as an integer for performance) would be fast, but if you're showing all images that start with "A" and simply show the path as a property then an index on the ImageName column would be of more use.
I am a little concerned though that this sounds like premature optimization, as you really need to find a solution that works 'fast enough', abstract the mechanics of it so your application (or both apps if you have both Mac and PC versions) use a simple repository or similar and then you can change the storage/retrieval method at will without any implication to your application.

Check Solid File System - it seems to be what you need.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio