Related
I am not a programmer, rather a law student, but I am currently researching for a project involving artificial intelligence and copyright law. I am currently looking at whether the learning process of a machine learning algorithm may be copyright infringement if a protected work is used by the algorithm. However, this relies on whether or not the algorithm copies the work or does something else.
Can anyone tell me whether machine learning algorithms typically copy the data (picture/text/video/etc.) they are analysing (even if only briefly) or if they are able to obtain the required information from the data through other methods that do not require copying (akin to a human looking at a stop sign and recognising it as a stop sign without necessarily copying the image).
Apologies for my lack of knowledge and I'm sorry if any of my explanation flies in the face of any established machine learning knowledge. As I said, I am merely a lowly law student.
Thanks in advance!
A few machine learning algorithms actually retain a copy of the training set, for example k-nearest neighbours. See https://en.wikipedia.org/wiki/Instance-based_learning. Not all do this; in fact it is usually regarded as a disadvantage, because the training set can be large.
Also, computers are also built round a number of different stores of data of different sizes and speeds. They usually copy data they are working on to small fast stores while they are working on it, because the larger stores take much longer to read and write. One of many possible examples of this has been the subject of legal wrangling of which I know little - see e.g. https://law.stackexchange.com/questions/2223/why-does-browser-cache-not-count-as-copyright-infringement and others for browser cache copyright. If a computer has added two numbers, it will certainly have stored them in its internal memory. It is very likely that it will have stored at least one of them in what are called internal registers - very small very fast memory intended for storing numbers to be worked on.
If a computer (or any other piece of electronic equipment) has been used to process classified data, it is usual to treat it as classified from then on, making the worst case assumption that it might have retained some copy of any of the data it has been used to process, even if retrieving that data from it would in practice require a great deal of specialised expertise with specialised equipment.
Typically, no. The first thing that typical ML algorithms do with their inputs is not to copy or store it, but to compute something based on it and then forget the original. And this is a fair description of what neural networks, regression algorithms and statistical methods do. There is no 'eidetic memory' in mainstream ML. I imagine anything doing that would be marketed as a database or a full text indexing engine or somesuch.
But how will you present your data to an algorithm running on a machine without first copying the data to that machine?
Does a machine learning algorithm copy the data it learns from?
There are many different machine learning algorithms. If you are talking about k nearest neighbor (k-NN) then the answer is simply yes.
However, k-NN is rarely used. Most (all?) other models are not that simple. Usually, a machine learning developer wants the training data to be compressed (a lot, lossy) by the model for several reasons: (1) The amount of training data is large (many GB), (2) Generalization might be better if the training data is compressed (3) inference of new examples might take really long if the data is not compressed. (By "compress", I mean that the relevant information for the task is extracted and irrelevant data is removed. Not compression in the usual sense.)
For other models than k-NN, the answer is more complicated. It depends on what you consider a "copy". For example, from artificial neural networks (especially the sub-type of convolutional neural networks, short: CNNs) the training data can partially be restored. Those models ware state of the art for many (all?) computer vision tasks.
I could not find papers which show that you can (partially) restore / extract training data from CNNs with the focus on possible privacy / copyright problems, but I'm ~70% certain I have read an abstract about this problem. I think I've also heard a talk where a researcher said this was a problem when building a detector for child pornography. However, I don't think that was recorded or anything published about this.
Here are two papers which indicate that restoring training data from CNNs might be possible:
Understanding deep learning requires rethinking generalization
Visualizing Deep Convolutional Neural Networks Using Natural Pre-Images and the Zeiler & Fergus paper
It depends on what you mean by the word "copy". If you run any program, it will copy the data from the hard disk to RAM for processing. I am assuming this is not what you meant.
So let's say you have the copyrighted data in a particular machine and you run your machine learning algorithms on the data, then there is no reason for the algorithm to copy the data out of the machine.
On the other hand, if you use a cloud ML service(AWS/IBM Bluemix/Azure), then you need to upload the data to the cloud before you can run ML algorithms. This would mean you are copying the data.
Hopefully this sheds more light !
Lowly ML student
Some of the machines do copy the data set such as KNN. Unfortunately, such algorithms are not commonly used in practice because they can't be scaled for large data set.
Most ML algorithms use the data set to identify a pattern, that's why pattern recognition is another name for machine learning. The pattern is almost always much smaller (in terms of memory and variables etc) than the original data set.
My question is about design and possible suggestions for the following scenario:
I am writing a 3d visualizer. For my renderable objects I would like to store the minimum data possible (so quaternions are naturally nice for rotation).
At some point I must extract a Matrix for rendering which requires computation and temporary storage on every frame update (even for objects that do not change spatially).
Given that many objects remain static and don't need to be rotated locally would it make sense to store the matrix instead and thereby avoid the computation for each object each frame? Is there any best practice approach to this perhaps from a game engine design point of view?
I am currently a bit torn between storing the two extremes of either position+quaternion or 4x3/4x4 matrix. Looking at openframeworks (not necessarily trying to achieve the same goal as me), they seem to do a hybrid where they store a quaternion AND a matrix (matrix always reflects the quaternion) so its always ready when needed but needs to be updated along with every change to the quaternion.
More compact storage require 3 scalars, so Euler Angels or Exponential Maps (Rodrigues) can be used. Quaternions is good compromise between conversion to matrix speed and compactness.
From design point of view , there is a good rule "make all design decisions as LATE as possible". In your case, just incapsulate (isolate) the rotation (transformation) representation, to be able in the future, to change the physical storage of data in different states (file, memory, rendering and more). Also it enables different platform optimization, keep data in GPU or CPU and more.
Been there.
First: keep in mind the omnipresent struggle of time against space (in computer science processing time against memory requirements)
You said that want to keep minimum information possible at first (space), and next talked about some temporary matrix reflecting the quartenions, which is more of a time worry.
If you accept a tip, I would go for the matrices. They are generally performance wise standard for 3D graphics and it's size becomes easily irrelevant next to the object data itself.
Just to have and idea: in most GPUs transforming an vector for the identity (no change) is actually faster then checking if it needs transformation and then doing nothing.
As for engines, I can't think of one that does not apply the transformations for every vertex every frame. Even if the objects keep in place, they position has to go through projection and view matrices.
(does this answer? Maybe I got you wrong)
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed last year.
Improve this question
I'm interested what would happen with the hardware (hard disk) when deletes a file.
I mean, I'm looking for some information about the process in hardware when the user decides to erase all the files from recycle bin.
I'm interested in Windows OS. I just found technical information like this: Data erasure http://en.wikipedia.org/wiki/Data_erasure
Any doubt or question, please let me know.
Recycle bin is not related to hardware. It is just a special folder. When you move a file/folder into recycle bin. Windows just calls ZwSetInformationFile to 'rename' it. As you know, when a file is opened. You can't delete/remove it directly but you can rename it. Recycle bin is working like this. Then, when you try to empty the recycle bin, it just remove files/folders from file system. (Actually, it just set a flag of file in file system. The data did not be erased directly).
That's all.
Thanks.
Each file system has a way of removing or releasing a file. the sectors on disk do not get wiped, eventually they may get reused for some other file and over time the old, deleted, file is not there.
If you read even the first sentence of the page you linked you see "a software-based method", the hardware knows nothing of files or file systems definitely not files with an abstraction of recycle bin (directory entry simply moved to some other directory, file not moved nor deleted when it goes in the recycle bin). The hardware deals with spinning motors, moving heads, finding/reading/writing sectors. The concept of partitions, files, deleted or not are all in the software realm, hardware does not know or care.
The article you referenced has to do with the media. Think about writing something with pencil on paper, then erase it and write something else. The paper has been compressed by the pencil both times, with the right tools you can probably figure out some or all of the original text from the indentations in the paper. You want to sell or donate or throw out a computer how do you insure that someone doesnt extract your bank account or other sensitive information? On that piece of paper, well you could burn it and grind up the chunks of ashes (cant sell that paper for money at that point). Or you could in a very chaotic and random way scribble over the parts where you have written, such that the indention in the paper from your original and second writing are buried in the noise. In addition to random scribbles you also write words, real words or letters but nothing sensitive, just to throw off any attempt to distinguish scribbles from real letters. The hard disk hardware is doing nothing special here it is spinning motors, moving heads, seeking to sectors and reading and writing them, nothing special. What the software is doing is trying to make those random scribbles that look just enough like real information to not have the real information stand out in the noise. You have to understand a bit about the encoding of the data, a 0x12345678 value does not use those bits when stored on the hard disk, to make the read-back more reliable the real bits are translated to different bits, and the reverse translation on the way back. So you want to know to choose chaotic patterns that when laid down on the disk actually exercise all the points on the disk not some and skip otherse. Ideally causing each location on the disk (for lack of a better term) get written with both ones and zeros many times.
Interesting related history lesson if you bear with me. there were these things called floppy disks. http://en.wikipedia.org/wiki/Floppy_disk there was a long history but in particular in the same size of disk the density changed (again this happened more than once). The older technology did what it could it laid down sectors using "bits" for lack of a better term as small is it could. Later, technology got better, and could lay down bits or less than half the size. You could take a disk written in the old days, and read it on the new drive. You could overwrite files on that disk with the new drive and reuse the disk (with the new drive). You could take a new disk and write files on the new drive and read on the old drive, but if you took an old disk with files written by the old drive, deleted and overwrote new files on the new drive, you couldnt necessarily read those files on the old drive, the old drive might actually see the old files or new files or just fail to read anything. To reuse that disk from the new drive to the old drive you had to format the disk on the old drive, then write files on the new drive, then read on the old drive...why...On a whiteboard write some words in block letters, big letters one foot tall. Take the eraser and erase only a two inch path through the middle then write some words two inches tall. Can you read both? Depends on what you wrote but often, yes you can. On a clean white board, write two inch letters, can you read the words you wrote, yep. The newer drives always had a smaller focus, they didnt write big fat bits when the disk was formatted using the older, smaller size, and small bits when writing on a disk formatted at the higher density, they always wrote the small sized bits. When reading the old disks they read the bits okay despite the huge size, when erasing and re-writing was like the big letters on the white board, they only erased a path through the middle, and wrote in that small path. The new drives could only read along the narrow path, they could only read the two inch letters and didnt see the big 1 foot letters at all. the old disk saw both the old one foot letters and the two inch and depending on which one had the dominant bits it would read that or often just fail to read either.
These disk erasures want to do the same kind of thing, every time you spin the media and move the heads its not exactly perfect, there is some error, you are not changing the charge on the exact same set of molecules on the media every time there is a bit of a wiggle as you drive down that road. Take a road for example. The lanes on a road are wider than the car, if you were to have a paint brush the width of the car, and painted a line the first time you drove down the road, and now you want to paint over that line so that nobody can determine what your original secret color was. You need to drive that road many many times (no cheating you cant hug one side of the road one time and the other side of the road the other time, every time you need to pretend to be hardware and do your best job to always be as close to the middle as you can, as hardware you dont know what the goal of the software is) to allow the error in position on the first pass to be covered by the error in the latter passes. You want to use a different color paint on each pass so that eventually the edges of the painted stripe is a rainbow of colors, making it impossible to tell which one was the original color. Same here beat up the hard drive with many many passes of writes, use ever changing and different data each pass, until the point that the original charge from the original write cannot be isolated and interpreted even on the edges.
Note that a solid state flash based drive works differently, there is likely a write leveling scheme to prevent some portions of flash to wear out before others. And you might get away with the same software level solution (the software doesnt necessarily know it is an ssd vs mechanical drive) or it may not work and a new solution is needed. The problem with ssd it is flash based so there is a limited number of write cycles before you wear it out, pounding on it with lots of chaotic writes, just wears it out.
What does any of this have to do with windows and the recycle bin, absolutely nothing, you send something to the recycle bin it is not much different than copying it to another directory, nothing is destroyed. When you delete the file most of the file is still there, intact, on the directory entry and perhaps some sort of file allocation table, something that lists free sectors from used, is changed, the sectors themselves do not necessarily change, your data is there, and really easy for someone with the right tools to read all of your "deleted" files (shortly after deleting them).
If you dont want people to see your old data, remove the drive, open and remove the platters and grind them into dust. That is the only guaranteed method of destroying your sensitive information.
There is no "process in hardware". Emptying the recycle bin just performs a bunch of file delete operations, which means marking some blocks as no longer allocated, and removing directory entries from directories. At the hardware level, these are just ordinary disk writes. The data isn't destroyed in any way. For more details, look up a reference to the filesystem you're using (i.e. NTFS).
Update
I asked this question quite a while ago now, and I was curious if anything like this has been developed since I asked the question?
I don't even know if there is a term for this kind of algorithm, and I guess there won't be if nobody has invented it yet. However it also makes googling for this a bit hard. Does anybody know if there is a term for this algorithm/principle yet?
This is an idea I have been thinking about, but I do not quite know how to solve it. I would like to know if any solutions like this exists out there, or if you guys have any idea how this could be implemented.
Steganography
Steganography is basically the art of hiding messages. In modern days we do this digitally by e.g. modifying the least significant bits in a image as the one below. Thus for every pixel and for every colour component of that pixel we might be able to hide a byte or two.
This alternation is not visibly by the naked eye, but analysing the least significant bits might reveal patterns that exposes the existence and possibly content of a hidden message. To counter this we simply encrypt the message before embedding it in the image, which keeps the message safe and also helps preventing discovery of the existence of a hidden message.
Thus, in principle, steganography provides the following:
Hiding encrypted message in any kind of media data. (Images, music, video, etc.)
Complete deniability of the existence of a hidden message without the correct key.
Extraction of the hidden message with the correct key.
(source: cs.vu.nl)
Semacodes
Semacodes are a way of encoding data in a visually representation, that may be printed, copied, and scanned easily. The Data Matrix shown below is a example of a semacode containing the famous Lorem Ipsum text. This is essentially a 2D barcode with a higher capacity that usually barcodes. Programs for generating semacodes are readily available, and ditto for software for reading them, especially for cell phones. Semacodes usually contains error correcting codes, are generally very robust, and can be read in very damaged conditions.
Thus semacodes has the following properties:
Data encoding that may be printed and copied.
May be scanned and interpreted even in damaged (dirty) conditions, and generally a very robust encoding.
Combining it
So my idea is to create something that combines these two, with all of the combined properties. This means it would have to:
Embed a encrypted message in any media, probably a scanned image.
The message should be extractable even if the image is printed and scanned, and even partly damaged.
The existence of a embedded message should be undetectable without the key used for encryption.
So, first of all I would like to know if any solutions, algorithms or research is available on this? Secondly I would like to hear any ideas/thoughts on how this might be done?
I really hope to get a good discussion going on the possibilities and feasibility of implementing something like this, and I am looking forward to reading your answers.
Update
Thanks for all the good input on this. I will probably work a bit more on this idea when I have more time. I am convinced it must be possible. Think about research in embedding watermarks in music and movies.
I imagine part of the robustness of a semacode to damage/dirt/obscuration is the high contrast between the two states of any "cell". The reader can still make a good guess as to the actual state, even with some distortion.
That sort of contrast is not available in a photographic image, and is the very reason why steganography works - the lsb bit-flipping has almost no visual effect on the image itself, while digital fidelity ensures that a non-visual system can still very accurately read the embedded data.
As the two applications are sort of at opposite ends of the analog/digital spectrum (semacodes are all about being decipherable by analog (visual) processing but are on paper, not digital; steganography is all about the bits in the file and cares nothing for the analog representation, whether light or sound or something else), I imagine a combination of the two will extremely difficult, if not impossible.
Essentially what you're thinking of is being able to steganographically embed something in an image, print the image, make a colour photocopy of it, scan it in, and still be able to extract the embedded data.
I'm afraid I can't help, but if anyone achieves this, I'll be DAMN impressed! :)
It's not a complete answer, but you should look at watermarking. This technique solves your first two goals (embedable in a printed image and readable even from partly damaged scan).
Part of watermarking's reliability to distortion and transcription errors (from going from digital to analog and back) come from redundancy (e.g. repeating the data several times). Those would make the watermark detectable even without a key. However, you might be able to use redundancy techniques that are more subtle, maybe something related to erasure coding or secret sharing.
I know that's not a complete answer, but hopefully those leads will point you in the right direction!
What language/environment are you using? It shouldn't be that hard to write code that opens both the image and semacode as a bitmap (the latter as a monochrome), sets the lowest bit(s) of each byte of each pixel in the color image to the value of the corresponding pixel of the monochrome bitmap.
(optionally expand the semacode bitmap first to the same pixel-dimensions extending with white)
I recently learned about PDF417 barcodes and I was astonished that I can still read the barcode after I ripped it in half and scanned only a fragment of the original label.
How can the barcode decoding be that robust? Which (types of) algorithms are used during encoding and decoding?
EDIT: I understand the general philosophy of introducing redundancy to create robustness, but I'm interested in more details, i.e. how this is done with PDF417.
the pdf417 format allows for varying levels of duplication/redundancy in its content. the level of redundancy used will affect how much of the barcode can be obscured or removed while still leaving the contents readable
PDF417 does not use anything. It's a specification of encoding of data.
I think there is a confusion between the barcode format and the data it conveys.
The various barcode formats (PDF417, Aztec, DataMatrix) specify a way to encode data, be it numerical, alphabetic or binary... the exact content though is left unspecified.
From what I have seen, Reed-Solomon is often the algorithm used for redundancy. The exact level of redundancy is up to you with this algorithm and there are libraries at least in Java and C from what I've been dealing with.
Now, it is up to you to specify what the exact content of your barcode should be, including the algorithm used for redundancy and the parameters used by this algorithm. And of course you'll need to work hand in hand with those who are going to decode it :)
Note: QR seems slightly different, with explicit zones for redundancy data.
I don't know the PDF417. I know that QR codes use Reed Solomon correction. It is an oversampling technique. To get the concept: suppose you have a polynomial in the power of 6. Technically, you need seven points to describe this polynomial uniquely, so you can perfectly transmit the information about the whole polynomial with just seven points. However, if one of these seven is corrupted, you miss the information whole. To work around this issue, you extract a larger number of points out of the polynomial, and write them down. As long as you have at least seven out of the bunch, it will be enough to reconstruct your original information.
In other words, you trade space for robustness, by introducing more and more redundancy. Nothing new here.
I do not think the concept of trade off between space and robustness is any different here as anywhere else. Think RAID, let's say RAID 5 - you can yank a disk out of the array and the data is still available. The price? - an extra disk. Or in terms of the barcode - extra space the label occupies