Archiving differences between time sequence of text files - bash

There is a sensor network from which I download measurements every ten minutes or on demand. Each download is a text file consisting of several lines with a timestamp and values. The name of the text file also contains a timestamp of when the download occured. So as time progresses I collect a lot of text files, which consist a sequence. Because of the physical parameters which the values are taken from, there are little to no differences between adjacent text files.
As I want to archive into a (compressed) file all of the text files that are being downloaded, in an efficient way. So I thought that archiving the differences between adjacent text files is one such way.
I want some ideas to work it out in BASH, using well-known tools like tar and diff. I know also about git, but it is not useful for creating an archive file.
I will try to clarify a bit. A text file is consisting of several lines of the following space-separated format:
timestamp sensor_uuid value_1 ... value_N
Not every line has exactly the same (say N) values, but there is little variation of tokens per line. Also the values themselves have little variation in time. As they come from sensors, and there is a single sensor per line, the number of the lines of the text file depends on how many responses I got for each call. Zero lines is possible.
Finally the text filename takes its own timestamp, a concatenation of an original name with a date time string:
sensors_2019-12-11_153043.txt for today’s 15:30:43 request.
Needless to say that timestamps in the lines of this example filename are usually earlier than the filename’s, or even there are lines and timestamps repeated from text files created before.
So my idea for efficient archiving is putting the first text file into the archive and then putting only the updates, i.e. the differences between two adjacent text files, which eventually will be tracing back to the first one text file actually archived. But at retrieving I need to get a complete text file, as if it was itself archived and not its difference from the past.
Tar takes in the whole text files, and a couple of differences between the text files’ lines are not producing a repeatable pattern suitable for strong compression.

tar command already identifies the repeating patterns and compress them. But if you want to eliminate the parts that are repeated you can use "diff" command with some other simple manipulation of diff output and then redirect all to file.
Let's say we have 2 file "file1.txt" and "file2.txt" you can use this command line to get only the line added from the second file (file2.txt) :
diff -u file1.txt file2.txt | grep -E "^\+" | sed -E 's/^\+//' | grep -v "\+"
then we need just to redirect the output or to the same file (example file2.txt) or in another file and then delete the file2.txt before the tar operation.

Related

Merge two text files

I'm using Windows and Notepad++ to separate file in txt. I have 2 files which is I have to merge it side by side or line by line for my data analysis.
Here is the example:
file1.txt
Abcdefghijk
abcdefghijk
file2.txt
123456
123456
then the output I want is like this:
Abcdefghijk123456
abcdefghijk123456
in the next file or output file. Does anybody here know how to do this?
Your question answered here by TheMadTechnician. Using powershell, you should take both source files (1 and 2) as arrays of lines. Then comes simple cycle, like "merge line x from file1 with line x from file2 as long you have some lines in file1".
Unfortunately its impossible with pure cmd.
#riki.. you could also write a batch program to do this pro grammatically. There should probably be no limit over the number of lines.
It may depend on the number of lines you're having in each files. I suggest to copy paste the same if it is less than 50 lines.
Otherwise,
use some powerful languages like python, c,php etc. And make it run before performing data analysis.
There is a free utility you can download and run on your computer, called txtcollector. I read about it here. I used it because I had a whole folder of files to concatenate. It was a breeze. The only slight imperfection I noticed was that I couldn't paste in the path to the specific folder in the first step (choosing the folder where the files to be concatenated were). However, I could do this when choosing where to save the result.

How to extract specific lines from a huge data file?

I have a very large data file, about 32GB. The file is made up of about 130k lines, each of which mainly contains numbers, but also has few characters.
The task I need to perform is very clear: I have to extract 20 lines and write them to a new text file.
I know the exact line number for each of the 20 lines that I want to copy.
So the question is: how can I extract the content at a specific line number from the large file? I am on Windows. Is there a tool that can do such sort of operations, or I need to write some code?
If there is no direct way of doing that, I was thinking that a possible approach is to first extract small blocks of the original file (so that each block contains one or more lines to extract) and then use a standard editor to find the lines within each block. In this case, the question would be: how can I split a large file in blocks by line on windows? I use a tool named HJ-Split which works very well with large files, but it can only split by size, not by line.
Install[1] Babun Shell (or Cygwin, but I recommend the Babun), and then use sed command as described here: How can I extract a predetermined range of lines from a text file on Unix?
[1] Installing Babun means actually just unzipping it somewhere, so you don't have to have the Administrator rights on the server.

Split and rejoin wav files

I'm trying to edit around 200 wav files with a windows program that won't support command line batch sort of stuff. So it seems like the easiest way to do it would be to combine the wavs into one file (they're all short), and then split them back the way they are after editing.
Sox will give me the length, and I already have the names of course. Is there any way to say, combine all the wavs in a directory into a single wav file, while preserving the names, lengths, and which order they were combined in a txt file, and then use the txt to turn them back into wavs with the original names and lengths?
Edit: I seem to be doing something wrong. I ran this script first:
#!/bin/bash
for f in *.wav
do
dd if=$f of=new_$f bs=1 skip=44
done
Then I moved all of the original files out of the folder, deleted the first of the new files, and copied the first of the originals back in. Then I did this:
cat *.wav > merged.wav
This gives me one file that's as big as it should be, but when I open it with a media player, it just plays the portion that was the first file, and then stops before playing the others.
dont know how creative you want to get. A wav is just a headder with binary data. So long as theyre all the same format, sample size everything, you can use cut or split to strip 44 bytes off the beginming of all of them, keeping one copy of the headder at the beginning, cat them into 1 file, do what you want to them, split it back up using another script with the same list of filenames.
Sox can do this.
Assuming all wav files are in c:\temp the command is
sox c:\temp\*.wav c:\temp\merged.wav
(The example is for windows, for linux use linux path notation)
For preserving the length and names I'd use sox to get the length and
then create a cue-file from that info.
This cue-file can later get used to split the audio.

Diff for 3 binary files

I have 3 binary files. Let's call them file1.bin, file2.bin and file3.bin.
file1.bin and file2.bin have some common parts.
file2.bin and file3.bin have some common parts.
I want to find the common parts between file1.bin and file2.bin that are different between file2.bin and file3.bin.
How do you recommend to accomplish that? I have already dumped the binary files to text files using xxd and then did a 3-way diff using vim -d file1.txt file2.txt file3.txt.
However, vim marks a part as changed in all the files even if it has only changed in one file and remains the same in the other two files. I want those special kind of occurrences to be marked differently.
Perhaps you can use the built-in unix diff (I think it is part of OSX), but use the --unchanged-group-format to list the similarities. Do that for file1 and file 2. Then do it for file2 and file3. You can then do a regular diff on the two resulting files.
For an idea of how to get the similarities, have a look at this post.
The tool that I work for (ECMerge) does that. You just have to diff the 3 binary files, it will present equal portions in front of each other, and modified bytes appropriately placed in between. No need to first get an hex dump. You can script in JavaScript to output whatever you like based on the diff results and the bytes in the files (it works also in command line).
Chromium uses bsdiff, then switched to courgette for doing binary diff as explained in their blog here. You might find useful leads from their blog.

very long lines - windows grep character (not a line based) tool

Is there a grep-like tool for windows where I can restrict the number of characters it outputs in a line where a searched for pattern is found.
One of the upstream software systems generates huge text files which we then feed as the input to our system.
Sometimes the input files get corrupted and I need to do a quick textual search to find if particular the bits of data are missing or not. To make it even worse - the input files is just one very very long line of text - and when I use grep or findstr - the result of the search is huge chunk of text.
I am wandering - how can I limit the number of characters grep to show before/after the pattern I searched for.
Cheers.
Two things spring to my mind:
Call grep with the --only-matching option so that only the text that matches is emitted. Depending on your regex, this may or may not help.
Write a very simple executable, call it trunc, which reads from stdin line by line and output the first n characters to stdout. Then simply pipe the output from grep to trunc.
The latter option is relatively simple. If you didn't want to go the whole hog and produce a proper native exe it could be quite easily achieved with a Perl/Python/Ruby etc. script.

Resources