Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed last year.
Improve this question
I'm interested what would happen with the hardware (hard disk) when deletes a file.
I mean, I'm looking for some information about the process in hardware when the user decides to erase all the files from recycle bin.
I'm interested in Windows OS. I just found technical information like this: Data erasure http://en.wikipedia.org/wiki/Data_erasure
Any doubt or question, please let me know.
Recycle bin is not related to hardware. It is just a special folder. When you move a file/folder into recycle bin. Windows just calls ZwSetInformationFile to 'rename' it. As you know, when a file is opened. You can't delete/remove it directly but you can rename it. Recycle bin is working like this. Then, when you try to empty the recycle bin, it just remove files/folders from file system. (Actually, it just set a flag of file in file system. The data did not be erased directly).
That's all.
Thanks.
Each file system has a way of removing or releasing a file. the sectors on disk do not get wiped, eventually they may get reused for some other file and over time the old, deleted, file is not there.
If you read even the first sentence of the page you linked you see "a software-based method", the hardware knows nothing of files or file systems definitely not files with an abstraction of recycle bin (directory entry simply moved to some other directory, file not moved nor deleted when it goes in the recycle bin). The hardware deals with spinning motors, moving heads, finding/reading/writing sectors. The concept of partitions, files, deleted or not are all in the software realm, hardware does not know or care.
The article you referenced has to do with the media. Think about writing something with pencil on paper, then erase it and write something else. The paper has been compressed by the pencil both times, with the right tools you can probably figure out some or all of the original text from the indentations in the paper. You want to sell or donate or throw out a computer how do you insure that someone doesnt extract your bank account or other sensitive information? On that piece of paper, well you could burn it and grind up the chunks of ashes (cant sell that paper for money at that point). Or you could in a very chaotic and random way scribble over the parts where you have written, such that the indention in the paper from your original and second writing are buried in the noise. In addition to random scribbles you also write words, real words or letters but nothing sensitive, just to throw off any attempt to distinguish scribbles from real letters. The hard disk hardware is doing nothing special here it is spinning motors, moving heads, seeking to sectors and reading and writing them, nothing special. What the software is doing is trying to make those random scribbles that look just enough like real information to not have the real information stand out in the noise. You have to understand a bit about the encoding of the data, a 0x12345678 value does not use those bits when stored on the hard disk, to make the read-back more reliable the real bits are translated to different bits, and the reverse translation on the way back. So you want to know to choose chaotic patterns that when laid down on the disk actually exercise all the points on the disk not some and skip otherse. Ideally causing each location on the disk (for lack of a better term) get written with both ones and zeros many times.
Interesting related history lesson if you bear with me. there were these things called floppy disks. http://en.wikipedia.org/wiki/Floppy_disk there was a long history but in particular in the same size of disk the density changed (again this happened more than once). The older technology did what it could it laid down sectors using "bits" for lack of a better term as small is it could. Later, technology got better, and could lay down bits or less than half the size. You could take a disk written in the old days, and read it on the new drive. You could overwrite files on that disk with the new drive and reuse the disk (with the new drive). You could take a new disk and write files on the new drive and read on the old drive, but if you took an old disk with files written by the old drive, deleted and overwrote new files on the new drive, you couldnt necessarily read those files on the old drive, the old drive might actually see the old files or new files or just fail to read anything. To reuse that disk from the new drive to the old drive you had to format the disk on the old drive, then write files on the new drive, then read on the old drive...why...On a whiteboard write some words in block letters, big letters one foot tall. Take the eraser and erase only a two inch path through the middle then write some words two inches tall. Can you read both? Depends on what you wrote but often, yes you can. On a clean white board, write two inch letters, can you read the words you wrote, yep. The newer drives always had a smaller focus, they didnt write big fat bits when the disk was formatted using the older, smaller size, and small bits when writing on a disk formatted at the higher density, they always wrote the small sized bits. When reading the old disks they read the bits okay despite the huge size, when erasing and re-writing was like the big letters on the white board, they only erased a path through the middle, and wrote in that small path. The new drives could only read along the narrow path, they could only read the two inch letters and didnt see the big 1 foot letters at all. the old disk saw both the old one foot letters and the two inch and depending on which one had the dominant bits it would read that or often just fail to read either.
These disk erasures want to do the same kind of thing, every time you spin the media and move the heads its not exactly perfect, there is some error, you are not changing the charge on the exact same set of molecules on the media every time there is a bit of a wiggle as you drive down that road. Take a road for example. The lanes on a road are wider than the car, if you were to have a paint brush the width of the car, and painted a line the first time you drove down the road, and now you want to paint over that line so that nobody can determine what your original secret color was. You need to drive that road many many times (no cheating you cant hug one side of the road one time and the other side of the road the other time, every time you need to pretend to be hardware and do your best job to always be as close to the middle as you can, as hardware you dont know what the goal of the software is) to allow the error in position on the first pass to be covered by the error in the latter passes. You want to use a different color paint on each pass so that eventually the edges of the painted stripe is a rainbow of colors, making it impossible to tell which one was the original color. Same here beat up the hard drive with many many passes of writes, use ever changing and different data each pass, until the point that the original charge from the original write cannot be isolated and interpreted even on the edges.
Note that a solid state flash based drive works differently, there is likely a write leveling scheme to prevent some portions of flash to wear out before others. And you might get away with the same software level solution (the software doesnt necessarily know it is an ssd vs mechanical drive) or it may not work and a new solution is needed. The problem with ssd it is flash based so there is a limited number of write cycles before you wear it out, pounding on it with lots of chaotic writes, just wears it out.
What does any of this have to do with windows and the recycle bin, absolutely nothing, you send something to the recycle bin it is not much different than copying it to another directory, nothing is destroyed. When you delete the file most of the file is still there, intact, on the directory entry and perhaps some sort of file allocation table, something that lists free sectors from used, is changed, the sectors themselves do not necessarily change, your data is there, and really easy for someone with the right tools to read all of your "deleted" files (shortly after deleting them).
If you dont want people to see your old data, remove the drive, open and remove the platters and grind them into dust. That is the only guaranteed method of destroying your sensitive information.
There is no "process in hardware". Emptying the recycle bin just performs a bunch of file delete operations, which means marking some blocks as no longer allocated, and removing directory entries from directories. At the hardware level, these are just ordinary disk writes. The data isn't destroyed in any way. For more details, look up a reference to the filesystem you're using (i.e. NTFS).
Related
Is it possible to determine the physical location (e.g. angle + radius) for a particular bit on a CD/DVD/BluRay disk?
The reason I'm asking is this, I want to design a data structure which stores recovery information approximately on the opposite side of the medium to avoid a single scratch from making the whole exercise moot.
CD/DVD/BlueRay encoding schemes include code correction. This is basically another layer of data redundancy which allows the algorithm that decodes the disk to be able to not only detect errors but fix them. When the original engineers and computer scientists decided the encoding schemes for these disks they took scratch resistance into account.
See Reed Solomon codes which are used on CDs/DVDs/BluRay Disks
I have question which i wanna discuss with u. i am a fresh gradutate and just got a job as IT programmer. my company is making a game, the images or graphics use inside the game have one folder but different files of images. They give me task that how we can convert different files of images into one file and the program still access that file. If u have any kind of idea share with me ..Thanks
I'm not really sure what the advantage of this approach is for a game that runs on the desktop, but if you've already carefully considered that and decided that having a single file is important, then it's certainly possible to do so.
Since the question, as Oded points out, shows very little research or otherwise effort on your part, I won't provide a complete solution. And even if I wanted to do so, I'm not sure I could because you don't give us any information on what programming language and UI framework you're using. Visual Studio 2010 supports a lot of different ones.
Anyway, the trick involves creating a sprite. This is a fairly common technique for web design, where it actually is helpful to reduce load times by using only a single image, and you can find plenty of explanation and examples by searching the web. For example, here.
Basically, what you do is make one large image that contains all of your smaller images, offset from each other by a certain number of pixels. Then, you load that single large image and access the individual images by specifying the offset coordinates of each image.
I do not, however, recommend doing as Jan recommends and compressing the image directory (into a ZIP file or any other format), because then you'll just have to pay the cost of uncompressing it each time you want to use one of the images. That also buys you extremely little; disk storage is cheap nowadays.
I've been thinking about writing a text editor control that can edit text that can have any arbitrary length (say, hundreds of megabytes), similar in some ways to the Scintilla editor. The goal is to lazy-read the file, so the user doesn't have to read five hundred megabytes of data just to view a small portion of it. I'm having two problems with this:
It seems to me to be impossible to implement any sensible scrolling feature for such an editor, unless I pre-read the entire file once, in order to figure out line breaks. Is this really true? Or is there a way to approximate things, that I'm not thinking of?
Because of various issues with Unicode (e.g. it allows many bytes to represent just one character, not just because of variable-length encoding but also because of accents and such), it seems nearly impossible to determine exactly how much text will fit on the screen -- I'd have to use TextOut() or something to draw one character, measure how big it was, and then draw the next character. And even then, that still doesn't say how I'd map the user's clicks back to the correct text position.
Is there anything I could read on the web regarding algorithms for handling these issues? I've searched, but I haven't found anything.
Thank you!
You can set a "coarse" position based on data size instead of lines. The "fine" position of your text window can be based on a local scan around an arbitrary entry point.
This means you will need to write functions that can scan locally (backwards and forwards) to find line starts, count Unicode characters, and so forth. This should not be too difficult; UTF8 is designed to be easy to parse in this way.
You may want to give special consideration of what to do about extremely long lines. Since there is no upper limit on how long a line can be, this makes finding the beginning (or end) of a line an unbounded task; I believe everything else you need for a screen editor display should be local.
Finally, if you want a general text editor, you need to figure out what you're going to do when you want to save a file in which you've inserted/deleted things. The straightforward thing is to rewrite the file; however, this is obviously going to take longer with a huge file. You can expect the user to run into problems if there is not enough room for a modified copy, so at the very least, you will want to check to make sure there is enough room on the filesystem.
#comingstorm is basically right. For display, you start at the cursor and scan backwards until you're sure you're past the top of the screen. Then you scan backwards to a line end, assuming you can identify a line end scanning backwards. Now you scan forwards, calculating and saving screen line start positions until you've gone far enough. Finally, you pick the line you want to start displaying on and off you go.
For simple text this can be done on an archaic processor fast enough to redraw a memory mapped video display every keystroke. [I invented this technology 30 years ago]. The right way to do this is to fix the cursor in the middle line of the screen.
For actually modifying files, you might look at using Gnu's ropes. A rope is basically a linked list of buffers. The idea is that all local edits can be done in just one small buffer, occasionally adding a new buffer, and occasionally merging adjacent buffers.
I would consider combining this technology with differential storage: the kind of thing all modern source control systems do. You basically have to use this kind of transaction based editing if you want to implement the undo function.
The key to this is invertible transactions, i.e. one which contains enough information to be applied backwards to undo what it did when applied forwards. The core editor transaction is:
at pos p replace old with new
which has inverse
at pos p replace new with old
This handles insert (old is empty) and delete (new is empty) as well as replace. Given a transaction list, you can undo inplace modifications to a string by applying the reverse of the list of inverse transactions.
Now you use the old checkpointing concept: you store a fairly recent in-place modified image of the file together with some recent transactions that haven't been applied yet. To display, you apply the transactions on the fly. To undo, you just throw away some transactions. Occasionally, you actually apply the transactions, making a "checkpoint" image. This speeds up the display, at the cost of making the undo slower.
Finally: to rewrite a huge sequential text file, you would normally rewrite the whole text, which is horrible. If you can cheat a bit, and allow arbitrary 0 characters in the text and you have access to virtual memory system page manager and low level disk access, you can do much better by keeping all the unchanged pages of text and just reorganising them: in other words, the ropes idea on disk.
For an open source project I have I am writing an abstraction layer on top of the filesystem.
This layer allows me to attach metadata and relationships to each file.
I would like the layer to handle file renames gracefully and maintain the metadata if a file is renamed / moved or copied.
To do this I will need a mechanism for calculating the identity of a file. The obvious solution is to calculate an SHA1 hash for each file and then assign metadata against that hash. But ... that is really expensive, especially for movies.
So, I have been thinking of an algorithm that though not 100% correct will be right the vast majority of the time, and is cheap.
One such algorithm could be to use file size and a sample of bytes for that file to calculate the hash.
Which bytes should I choose for the sample? How do I keep the calculation cheap and reasonably accurate? I understand there is a tradeoff here, but performance is critical. And the user will be able to handle situations where the system makes mistakes.
I need this algorithm to work for very large files (1GB+ and tiny files 5K)
EDIT
I need this algorithm to work on NTFS and all SMB shares (linux or windows based), I would like it to support situations where a file is copied from one spot to another (2 physical copies exist are treated as one identity). I may even consider wanting this to work in situations where MP3s are re-tagged (the physical file is changed, so I may have an identity provider per filetype).
EDIT 2
Related question: Algorithm for determining a file’s identity (Optimisation)
Bucketing, multiple layers of comparison should be fastest and scalable across the range of files you're discussing.
First level of indexing is just the length of the file.
Second level is hash. Below a certain size it is a whole-file hash. Beyond that, yes, I agree with your idea of a sampling algorithm. Issues that I think might affect the sampling speed:
To avoid hitting regularly spaced headers which may be highly similar or identical, you need to step in a non-conforming number, eg: multiples of a prime or successive primes.
Avoid steps which might end up encountering regular record headers, so if you are getting the same value from your sample bytes despite different location, try adjusting the step by another prime.
Cope with anomalous files with large stretches of identical values, either because they are unencoded images or just filled with nulls.
Do the first 128k, another 128k at the 1mb mark, another 128k at the 10mb mark, another 128k at the 100mb mark, another 128k at the 1000mb mark, etc. As the file sizes get larger, and it becomes more likely that you'll be able to distinguish two files based on their size alone, you hash a smaller and smaller fraction of the data. Everything under 128k is taken care of completely.
Believe it or not, I use the ticks for the last write time for the file. It is as cheap as it gets and I am still to see a clash between different files.
If you can drop the Linux share requirement and confine yourself to NTFS, then NTFS Alternate Data Streams will be a perfect solution that:
doesn't require any kind of hashing;
survives renames; and
survives moves (even between different NTFS volumes).
You can read more about it here. Basically you just append a colon and a name for your stream (e.g. ":meta") and write whatever you like to it. So if you have a directory "D:\Movies\Terminator", write your metadata using normal file I/O to "D:\Movies\Terminator:meta". You can do the same if you want to save the metadata for a specific file (as opposed to a whole folder).
If you'd prefer to store your metadata somewhere else and just be able to detect moves/renames on the same NTFS volume, you can use the GetFileInformationByHandle API call (see MSDN /en-us/library/aa364952(VS.85).aspx) to get the unique ID of the folder (combine VolumeSerialNumber and FileIndex members). This ID will not change if the file/folder is moved/renamed on the same volume.
How about storing some random integers ri, and looking up bytes (ri mod n) where n is the size of file? For files with headers, you can ignore them first and then do this process on the remaining bytes.
If your files are actually pretty different (not just a difference in a single byte somewhere, but say at least 1% different), then a random selection of bytes would notice that. For example, with a 1% difference in bytes, 100 random bytes would fail to notice with probability 1/e ~ 37%; increasing the number of bytes you look at makes this probability go down exponentially.
The idea behind using random bytes is that they are essentially guaranteed (well, probabilistically speaking) to be as good as any other sequence of bytes, except they aren't susceptible to some of the problems with other sequences (e.g. happening to look at every 256-th byte of a file format where that byte is required to be 0 or something).
Some more advice:
Instead of grabbing bytes, grab larger chunks to justify the cost of seeking.
I would suggest always looking at the first block or so of the file. From this, you can determine filetype and such. (For example, you could use the file program.)
At least weigh the cost/benefit of something like a CRC of the entire file. It's not as expensive as a real cryptographic hash function, but still requires reading the entire file. The upside is it will notice single-byte differences.
Well, first you need to look more deeply into how filesystems work. Which filesystems will you be working with? Most filesystems support things like hard links and soft links and therefore "filename" information is not necessarily stored in the metadata of the file itself.
Actually, this is the whole point of a stackable layered filesystem, that you can extend it in various ways, say to support compression or encryption. This is what "vnodes" are all about. You could actually do this in several ways. Some of this is very dependent on the platform you are looking at. This is much simpler on UNIX/Linux systems that use a VFS concept. You could implement your own layer on tope of ext3 for instance or what have you.
**
After reading your edits, a couplre more things. File systems already do this, as mentioned before, using things like inodes. Hashing is probably going to be a bad idea not just because it is expensive but because two or more preimages can share the same image; that is to say that two entirely different files can have the same hashed value. I think what you really want to do is exploit the metadata of that the filesystem already exposes. This would be simpler on an open source system, of course. :)
Which bytes should I choose for the sample?
I think that I would try to use some arithmetic progression like Fibonacci numbers. These are easy to calculate, and they have a diminishing density. Small files would have a higher sample ratio than big files, and the sample would still go over spots in the whole file.
This work sounds like it could be more effectively implemented at the filesystem level or with some loose approximation of a version control system (both?).
To address the original question, you could keep a database of (file size, bytes hashed, hash) for each file and try to minimize the number of bytes hashed for each file size. Whenever you detect a collision you either have an identical file, or you increase the hash length to go just past the first difference.
There's undoubtedly optimizations to be made and CPU vs. I/O tradeoffs as well, but it's a good start for something that won't have false-positives.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a list representing products which are more or less the same. For instance, in the list below, they are all Seagate hard drives.
Seagate Hard Drive 500Go
Seagate Hard Drive 120Go for laptop
Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s Hard Drive
New and shinny 500Go hard drive from Seagate
Seagate Barracuda 7200.12
Seagate FreeAgent Desk 500GB External Hard Drive Silver 7200RPM USB2.0 Retail
For a human being, the hard drives 3 and 5 are the same. We could go a little bit further and suppose that the products 1, 3, 4 and 5 are the same and put in other categories the product 2 and 6.
We have a huge list of products that I would like to classify. Does anybody have an idea of what would be the best algorithm to do such thing. Any suggestions?
I though of a Bayesian classifier but I am not sure if it is the best choice. Any help would be appreciated!
Thanks.
You need at least two components:
First, you need something that does "feature" extraction, i.e. that takes your items and extracts the relevant information. For example, "new and shinny" is not as relevant as "500Go hard drive" and "seagate". A (very) simple approach would consist of a simple heuristic extracting manufacturers, technology names like "USB2.0" and patterns like "GB", "RPM" from each item.
You then end up with a set of features for each item. Some machine learning people like to put this into a "feature vector", i.e. it has one entry for each feature, being set to 0 or 1, depending on whether the feature exists or not. This is your data representation. On this vectors you can then do a distance comparison.
Note that you might end up with a vector of thousands of entries. Even then, you then have to cluster your results.
Possibly useful Wikipedia articles:
Feature Extraction
Nearest Neighbour Search
One of the problems you will encounter is to decide Nearest Neighbours in non-linear or non-ordered attributes. I'm building on Manuel's entry here.
One problem you will have is to decide on proximity of (1) Seagate 500Go, (2) Seagate Hard Drive 120Go for laptop, and (3) Seagate FreeAgent Desk 500GB External Hard Drive Silver 7200RPM USB2.0 Retail:
Is 1 closer to 2 or to 3? Do the differences justify different categories?
A human person would say that 3 is between 1 and 2, as an external HD can be used on both kind of machines. Which means that if somebody searches for a HD for his desktop, and broadens the scope of selection to include alternatives, external HDs will be shown too, but not laptop HDs. Probably, SSDs, USB memory sticks, CD/DVD drives will even show up before laptop drives, enlarging the scope.
Possible solution:
Present users with pairs of attributes and let them weight proximity. Give them a scale to tell you how close together certain attributes are. Broadening the scope of a selection will then use this scale as a distance function on this attribute.
To actually classify a product, you could use somewhat of a "enhanced neural network" with a blackboard. (This is just a metaphore to get you thinking in the right direction, not a strict use of the terms.)
Imagine a set of objects that are connected through listeners or events (just like neurons and synapsis). Each object has a set of patterns and tests the input against these patterns.
An example:
One object tests for ("seagate"|"connor"|"maxtor"|"quantum"| ...)
Another object tests for [:digit:]*(" ")?("gb"|"mb")
Another object tests for [:digit:]*(" ")?"rpm"
All these objects connect to another object that, if certain combinations of them fire, categorizes the input as a hard drive. The individual objects themselves would enter certain characterizations into the black board (common writing area to say things about the input) such as manufacturer, capacity, or speed.
So the neurons do not fire based on a threshhold, but on a recognition of a pattern. Many of these neurons can work highly parallel on the blackboard and even correct categorizations by other neurons (maybe introducing certainties?)
I used something like this in a prototype for a product used to classify products according to UNSPSC and was able to get 97% correct classification on car parts.
There's no easy solution for this kind of problem. Especially if your list is really large (millions of items). Maybe those two papers can point you in the right direction:
http://www.cs.utexas.edu/users/ml/papers/normalization-icdm-05.pdf
http://www.ismll.uni-hildesheim.de/pub/pdfs/Rendle_SchmidtThieme2006-Object_Identification_with_Constraints.pdf
MALLET has implementations of CRFs and MaxEnt that can probably do the job well. As someone said earlier you'll need to extract the features first and then feed them into your classifier.
To be honest, this seems more like a Record Linkage problem than a classification problem. You don't know ahead of time what all of the classes are, right? But you do want to figure out which product names refer to the same products, and which refer to different ones?
First I'd use a CountVectorizer to look at the vocabulary generated. There'd be words like 'from', 'laptop', 'fast', 'silver' etc. You can use stop words to discard such words that give us no information. I'd also go ahead and discard 'hard', 'drive', 'hard drive' etc. because I know this is a list of hard drives so they provide no information. Then we'd have list of words like
Seagate 500Go
Seagate 120Go
Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s
500Go Seagate etc.
You can use list of features like things that end with RPM are likely to give RPM information, same goes with stuff ending with mb/s or Gb/s. Then I'd discard alphanumeric characters like '1234FBA5235' which is most likely model numbers etc. which won't give us much information. Now if you are already aware of brands of hard drives that are appearing in your list like 'Seagate' 'Kingston' you can use string similarity or simply check if they are present in the given sentence. Once that's done you can use Clustering to group similar objects together. Now objects with similary rpm, gb's, gb/s, brand name will be clustered together. Again, if you use something like KMeans you'd have to figure out the best value of K. You'll have to do some manual work. What you could do it use a scatter plot and eyeball for which value of K the data classifies the best.
But the problem in above approach is if you don't know before hand the list of brands then you'd be in trouble. Then I'd use Bayesian Classifier to look for every sentence and get the probability of it being a hard drive brand. I'd look for two things
Look at the data, most of the times the sentence would explicitly mention the word 'hard drive' then I'd know it's definitely talking about a hard drive. Chances for something like 'Mercedes Benz hard drive' are slim.
This is a bit laborious but I'd write a Python web scraper over Amazon (or if you can't write one just Google for most used Hard Drive brands and create a list) It give me list like 'Seagate Barracuda 7200.12 ST3500418AS 500GB 7200 RPM SATA 3.0Gb/s' now for every sentence it'd use something like Naive Bayes to give me probability it's a brand. sklearn come pretty handy to do this stuff.