compare 2 files and copy source if different from destination - vbscript? - vbscript

I'm working on Windows XP and I need to make a script that would compare 2 files (1 on a server and 1 on a client). Basically, I need my script to check if the file from the client is different from the server version and replace the client version if it finds a difference (in the file itself, not only the modification date).

As you suggest, you can skip the date check as that can be changed without the contents changing.
First check that the sizes are different. If so, that may be enough to conclude that they are different. This can have false positives too though depending on the types of files. For example a unicode text file may contain the exact same content as an ansi text file, but be encoded with two bytes per character. If it's a script, it would execute with exactly the same results, but be twice the size.
If the sizes are the same, they may still contain different bytes. The brute force test would be to load each file into a string and compare them for equality. If they are big files and you don't want to read them all into memory if not necessary, then read them line by line until you encounter a difference. That's assuming they are text files. If they aren't text files, you can do something similar by reading them in fixed size chunks and comparing those.
Another option would be to to run the "fc" file compare command on the two files and capture the result and do your update based on that.

Related

Converting text file with spaces between CR & LF

I've never seen this line ending before and I am trying to load the file into a database.
The lines all have a fixed width. After the CSV text which contains the data (the length varies line-by-line), there is a CR followed by multiple spaces and ending with LF. The spaces provide the padding to equalize the line width.
Line1,Data 1,Data 2,Data 3,4,50D20202020200A
Line2,Data 11,Data 21,Data 31,41,510D2020200A
Line3,Data12,Data22,Data 32,42,520D202020200A
I am about to handle this with a stream reader / writer in C#, but there are 40 files that come in each month and if there is a way to convert them all at once instead of one line at a time, I would rather do that.
Any thoughts?
Line-by-line processing of a stream doesn't have to be a bottleneck if you implement it at the right point in your overall process.
When I've had to do this kind of preprocessing I put a folder watch on the inbound folder, then automatically pick up each file and process it upon arrival, putting the original into an archive folder and writing the processed file into another location from which data will be parsed or loaded into the database. Unless you have unusual real-time requirements, you'll never notice this kind of overhead. If you do have real-time requirements, this issue will pale in comparison to all the other issues you'll face with batched data files :)
But you may not even have to go through a preprocessing step at all. You didn't indicate what database you will be using or how you plan to load the data, but many databases do include utilities to process fixed-length records. In the past, fixed-format files came with every imaginable kind of bizarre format (and contained all kinds of stuff that had to be stripped out or converted). As a result those utilities tend to be very efficient at this kind of task. In my experience they can easily be at least an order of magnitude faster than line-by-line processing, which can make a real difference on larger bulk loads.
If your database doesn't have good bulk import processing tools, there are a number of many open-source or freeware utilities already written that do pretty much exactly what you need. You can find them on GitHub and other places. For example, NPM replace is here and zzzprojects findandreplace is here.
For a quick and dirty approach that allows you to preview all the changes as you develop a more robust solution, many text editors have the ability to find and replace in multiple files. I've used that approach successfully in the past. For example, here's the window from NotePad++ that lets you use RegEx to remove or change whatever you like in all files matching defined criteria.

How can you identify a file without a filename or filepath?

If I were to give you a file. You can read the file but you can't change it or copy it. Then I take the file, rename it, move it to a new location. How could you identify that file? (Fairly reliably)
I'm looking if I have a database of media files for a program and the user alters the location/name of file, could I find the file by searching a directory and looking for something.
I have done exactly this, it's not hard.
I take a 256-bit hash (I forget which routine I used off the top of my head) of the file and the filesize and write it to a table. If they match the files match. (And I think tracking the size is more paranoia than necessity.) To speed things up I also fold that hash to a 32-bit value. If the 32-bit values match then I check all the data.
For the sake of performance I persist the last 10 million files I have examined. The 32-bit values go in one file which is read in it's entirety, when a main record needs to be examined I pull in a "page" (I forget exactly how big) of them which is padded to align it with the disk.

Each run of the same Hadoop SequenceFile creation routine creates a file with different crc. Is it ok?

I have a simple code which creates Hadoop's Sequence file. Each the code is ran it leaves in working dir two files:
mySequenceFile.txt
.mySequenceFile.txt.crc
After each run the sizes of both files remain the same. But the crc file contents become different!
Is this a bug or an expected behaviour?
This is a confusing, but expected behaviour.
According to SequenceFile standart, each sequencefile has a sync-block, its length is 16 bytes. The sync-block repeats after each record in block-compressed sequencefiles, and after some records or one very long record in uncompressed or record-compressed sequencefiles.
The thing is, that the sync-block is some sort of random value. It is written in the header, so this is how the reader recognizes it. It stays same within one sequencefile, but it can (and actually is) different from one sequencefile to another.
So the files are logically same, but binary different. CRC is binary shecksum, so its different between two files too.
I haven`t found any ways to manually set this sync-block. If someone gets the way, please write it here.

In-depth understanding of binary files

I am learning C++ specially about binary file structure/manipulation, and since I am totally new to the subject of binary files, bits, bites & hexadecimal numbers, I decided to take one step backward and establish a solid understanding on the subjects.
In the picture I have included below, I wrote two words (blue thief) in a .txt file.
The reason for this, is when I decode the file using a hexeditor, I wanted to understand how the information is really stored in hex format. Now, don't get me wrong, I am not trying to make a living out of reading hex formats all day, but only to have a minimum level of understanding the basics of a binary file's composition. I also, know all files have different structures, but just for the sake of understanding, I wanted to know, how exactly the words "blue thief" and a single ' ' (space) were converted into those characters.
One more thing, is that, I have heard that binary files contain three types of information:
header, ftm & and the data! is that only concerned with multimedia files like audios, videos? because, I can't seem to see anything, other than what it looks like a the data chunk in this file only.
The characters in your text file are encoded in a Windows extension of ASCII--one byte for each character that you see in Notepad. What you see is what you get.
Generally, a hard distinction is made between text and binary files on Windows systems. On Unix/Linux systems, the distinction is fuzzier... you could argue that there is no distinction, in fact.
On Windows systems, the distinction is enforced by file extensions. All files with the extension ".TXT" are assumed to be text files (i.e., to contain only hex codes that represent visible onscreen characters, where "visible" includes whitespace).
Binary files are a whole different kettle of fish. Most, as you mention, include some sort of header describing how the data that follows is encoded. These headers can vary tremendously in size depending on the type of data (again, assumed to be indicated by the extension on Windows systems as well as Unix). A simple example is the WAV format for uncompressed audio. If you open a WAV file in your hex editing program, you'll see that the first four bytes are "RIFF"--this is a marker, often called a "magic number" even though it is readable as text, indicating that the contents are an audio file. Newer versions of the WAV specification have complicated this somewhat, but originally the WAV header was just the "RIFF" tag plus a dozen or so bytes indicating the sample rate of the following data. (You can see this by comparing the raw data in a track on an audio CD to the WAV file created by ripping an uncompressed copy of that track at 44.1 KHz--the data should be the same, with just a header section added at the start of the WAV file.)
Executable files (compiled programs) are a special type of binary file, but they follow roughly the same scheme of a header followed by data in a prescribed format. In this case, though, the "data" is executable machine code, and the header indicates, among other things, what operating system the file runs on. (For example, most Linux executables begin with the characters "ELF".)

How to do large file integrity check

I need to do an integrity check for a single big file. I have read the SHA code for Android, but it will need one another file for the result digest. Is there another method using a single file?
I need a simple and quick method. Can I merge the two files into a single file?
The file is binary and the file name is fixed. I can get the file size using fstat. My problem is that I can only have one single file. Maybe I should use CRC, but it would be very slow because it is a large file.
My object is to ensure the file on the SD card is not corrupt. I write it on a PC and read it on an embedded platform. The file is around 200 MB.
You have to store the hash somehow, no way around it.
You can try writing it to the file itself (at the beginning or end) and skip it when performing the integrity check. This can work for things like XML files, but not for images or binaries.
You can also put the hash in the filename, or just keep a database of all your hashes.
It really all depends on what your program does and how it's set up.

Resources