diff -u -s, line cound (+, -) not giving correct value - shell

I am using diff -u -s file1 file2 and counting + and - for Added and deleted lines in files for File comparison automation. (Modified lines will also count as one + and one -). These counts match with Araxis tool compare statistics (Total Added+Deleted of script=Changed+deleted+new of Araxis) for most of the files. But script total and Araxis total does not match for few files.
P.S. - I am using cygwin to run script on windows. I tried dos2unix, tail -c 4 etc in hope of removing BOM characters. But out of these culprit files some of them does not have BOM, and still count does not match. Following are few sample culprit files.
(1)SIACPO_ActivacionDesactivacionBlacklist.aspx.vb - Script gives 57 total count, while araxis 55
(2)SIACPO_Suspension_Servicio.aspx - Script gives 2509 total count, while araxis 2473
(3)repCuadreProceso.aspx - Script gives 1165 total count, while araxis 1163
(4)detaPago.aspx.vb - This is strange file. There is no change at all, except BOM character on 1st line. Script gives 0, 0 count, then why at all this in modified list of files??
Now how can I attach these 4 culprint files (Dev as well as Prod version) for your troubleshooting?

Related

What is the numerical difference in the number of files in two different directories for every sequence (seq 1-current)?

Every time I write a new amount of data, two new directories are created called a sequence.
Directory 1 should always be 9 files larger than Directory 2.
I’m using ls | wc –l to output the number of files in each directory then manually doing the difference.
For example
Sequence 151
Directory 1 /raid2/xxx/xxxx/NHY274938WSP1151-OnlineSEHD-hyp (1911 files) – after WSP1 is the seq number.
Directory 2 - /raid/xxx/ProjectNumber/xxxx/seq0151 (1902 files)
Sequence 152
Directory 1 /raid2/xxx/xxxx/NHY274938WSP1152-OnlineSEHD-hyp (1525 files)
Directory 2 - /raid/xxx/ProjectNumber/xxxx/seq0152 (1516 files)
Is there a script that will output the difference (minus 9) for every sequence.
Ie
151 diff= 0
152 diff =0
That works great however:
I can now see some sequences in
Directory 1 (RAW/all files) it contains extra files that i dont want compared against diectory 2 these are:
At the beginning Warmup files (not set amount every sequence)
Duplicate files with an _
For example :
20329.uutt -warmup
20328.uutt -warmup
.
.
21530.uutt First good file after warmup
.
.
19822.uutt
19821.uutt
19820.uutt
19821_1.uutt
Directory 2 (reprocessed /missing files) doesn’t include warmup shots or Duplicate files with an _
For example :
Missing shots
*021386 – first available file (files are missing before).
*021385
.
.
*019822
*019821
*019820
Could we remove warmup files and any duplicates I should have number of missing files?
Or output
diff, D1#warmup files, D1#duplicate files, TOTdiff
to get D1#duplicate files maybe I could count the total number of occurances of _.uutt
to get D1#warmup files I have a log file where warmup shots have a "WARM" at the end of each line. in /raid2/xxx/xxxx/NHY274938WSP1151.log
i.e.
"01/27/21 15:33:51 :FLD211018WSP1004: SP:21597: SRC:2: Shots:1037: Manifold:2020:000 Vol:4000:828 Spread: 1.0:000 FF: nan:PtP: 0.000:000 WARM"
"01/27/21 15:34:04 :FLD211018WSP1004: SP:21596: SRC:4: Shots:1038: Manifold:2025:000 Vol:4000:000 Spread: 0.2:000 FF: nan:PtP: 0.000:000 WARM"
Is there a script that will output the difference (minus 9) for every sequence. Ie 151 diff= 0 152 diff =0
There it is:
#!/bin/bash
d1p=/raid2/xxx/xxxx/NHY274938WSP1 # Directory 1 prefix
d1s=-OnlineSEHD-hyp # Directory 1 suffix
d2=/raid/xxx/ProjectNumber/xxxx/seq0
for d in $d2*
do s=${d: -3} # extract sequence from Directory 2
echo $s diff=$(expr `ls $d1p$s$d1s|wc -l` - `ls $d|wc -l` - 9)
done
With filename expansion * we get all the directory names, and by removing the fixed part with the parameter expansion ${parameter:offset} we get the sequence.
For comparison here's a variant using arrays as suggested by tripleee:
#!/bin/bash
d1p=/raid2/xxx/xxxx/NHY274938WSP1 # Directory 1 prefix
d1s=-OnlineSEHD-hyp # Directory 1 suffix
d2=/raid/xxx/ProjectNumber/xxxx/seq0
shopt -s nullglob # make it work also for 0 files
for d in $d2*
do s=${d: -3} # extract sequence from Directory 2
f1=($d1p$s$d1s/*) # expand files from Directory 1
f2=($d/*) # expand files from Directory 2
echo $s diff=$((${#f1[#]} - ${#f2[#]} - 9))
done

Faster way of Appending/combining thousands (42000) of netCDF files in NCO

I seem to be having trouble properly combining thousands of netCDF files (42000+) (3gb in size, for this particular folder/variable). The main variable that i want to combine has a structure of (6, 127, 118) i.e (time,lat,lon)
Im appending each file 1 by 1 since the number of files is too long.
I have tried:
for i in input_source/**/**/*.nc; do ncrcat -A -h append_output.nc $i append_output.nc ; done
but this method seems to be really slow (order of kb/s and seems to be getting slower as more files are appended) and is also giving a warning:
ncrcat: WARNING Intra-file non-monotonicity. Record coordinate "forecast_period" does not monotonically increase between (input file file1.nc record indices: 17, 18) (output file file1.nc record indices 17, 18) record coordinate values 6.000000, 1.000000
that basically just increases the variable "forecast_period" 1-6 n-times. n = 42000files. i.e. [1,2,3,4,5,6,1,2,3,4,5,6......n]
And despite this warning i can still open the file and ncrcat does what its supposed to, it is just slow, at-least for this particular method
I have also tried adding in the option:
--no_tmp_fl
but this gives an eror:
ERROR: nco__open() unable to open file "append_output.nc"
full error attached below
If it helps, im using wsl and ubuntu in windows 10.
Im new to bash and any comments would be much appreciated.
Either of these commands should work:
ncrcat --no_tmp_fl -h *.nc
or
ls input_source/**/**/*.nc | ncrcat --no_tmp_fl -h append_output.nc
Your original command is slow because you open and close the output files N times. These commands open it once, fill-it up, then close it.
I would use CDO for this task. Given the huge number of files it is recommended to first sort them on time (assuming you want to merge them along the time axis). After that, you can use
cdo cat *.nc outfile

Rsync ignores folders ending in slashes

I have this following content in a file named "/rsync/include.txt"
+ /home/**
+ /opt**
- *
I then call the rsync command as follows:
rsync -avr --include-from="/rsync/include.txt" . ../backup
This produces the following output:
sending incremental file list
created directory ../archive
./
opt/
opt/some-file
opt/include-me/
opt/include-me/me-too
opt/include-me/and-me/
sent 299 bytes received 106 bytes 810.00 bytes/sec
total size is 0 speedup is 0.00
The /home directory exists, and contains files.
Why does the + /home/** pattern not work? I do not want to use the + /home** pattern, as that could match other folder names, e.g., /homeopathy.
Can anybody help me understand why this command doesn't work, and point me in the direction of the working command?
EDIT: While I'd still like an answer to this question, I sincerely suggest using rdiff-backup as it uses similar filtering files and patterns, but is substantially easier to use. I've spent a good deal of time today on this issue with rsync, which was resolved in a few minutes using rdiff-backup.
General debugging info
The easiest way to see if your filter rules do what you want, is to use -vv (increase verbosity twice) in combination with -n (dry-run).
With double verbosity you will see which pattern caused which file to be ignored.
You may grep the output to only see the relevant parts.
Example
% rsync -avv --include 'opt/1' --exclude 'opt/*' -n ./ /tmp \
| grep '\[sender\]'
[sender] showing directory opt/1 because of pattern opt/1
[sender] hiding directory opt/2 because of pattern opt/*
[sender] hiding directory opt/3 because of pattern opt/*
Specific answer
Your example fails because + /home/** is excluded by - *.
man rsync states:
Note that, when using the --recursive (-r) option (which is implied by -a), every subdir component of every path is visited left to right, with each directory having a chance for exclusion before its content.
So the pattern /home/** will be evaluated after /home is traversed, but this will never happen, because - * excludes /home.
To include /home/ you just have to insert it before /home/**, so your exclude file becomes:
+ /home/
+ /home/**
+ /opt/
+ /opt/**
- *

How can I compare the file sizes match between duplicate directories?

I need to compare two directories to validate a backup.
Say my directory looks like the following:
Filename Filesize Filename Filesize
user#main_server:~/mydir/ user#backup_server:~/mydir/
file1000.txt 4182410737 file1000.txt 4182410737
file1001.txt 8241410737 - <-- missing on backup_server!
... ...
file9999.txt 2410418737 file9999.txt 1111111111 <-- size != main_server
Is there a quick one liner that would get me close to output like:
Invalid Backup Files:
file1001.txt
file9999.txt
(with the goal to instruct the backup script to refetch these files)
I've tried to get variations of the following to no avail.
[main_server] $ rsync -n ~/mydir/ user#backup_server:~/mydir
I cannot do rsync to backup the directories itself because it takes way too long (8-24hrs). Instead I run multiple threads of scp to fetch files in batches. This completes regularly <1hr. However, occasionally I find a few files that were somehow missed (perhaps dropped connection).
Speed is a priority, so file sizes should be sufficient. But I'm open to including a checksum, provided it doesn't slow the process down like I find with rsync.
Here's my test process:
# Generate Large Files (1GB)
for i in {1..100}; do head -c 1073741824 </dev/urandom >foo-$i ; done
# SCP them from src to dest
for i in {1..100}; do ( scp ~/mydir/foo-$i user#backup_server:~/mydir/ & ) ; sleep 0.1 ; done
# Confirm destination has everything from source
# This is the point of the question. I've tried:
rsync -Sa ~/mydir/ user#backup_server:~/mydir
# Way too slow
What do you recommend?
By default, rsync uses the quick check method which only transfers files that differ in size or last-modified time. As you report that the sizes are unchanged, that would seem to indicate that the timestamps differ. Two options to handlel this are:
Use -p to preserve timestamps when transferring files.
Use --size-only to ignore timestamps and transfer only files that differ in size.

rsync script expands variable incorrectly

I have a script that takes in a unique location number from a file. These are formatted like this 7325-05, 5269-09 and 7479-14, for example. The first four numbers are what the folder is called and the second number is the first two characters of the filename unique within each folder.
So I wrote this script to use locate and find to get the full path of the folder and then use a wildcard to download the specific file using rsync. Here's the script that I have right now:
#!/bin/bash
#IFS='
#'
oIFS=$IFS
IFS=$'\n'
while read line;
do
name=$line;
folder=${line:0:4}
track=${line: -2}
folderlocation="$(locate -r '/'$folder'$')"
filelocation="$(find "$folderlocation" -type f -name "$track*")"
rsync -vazhn --progress "$filelocation" /cygdrive/c/
# mkdir /cygdrive/c/test/"$folder"
# cp -rvi "$filelocation" /cygdrive/c/test/"$folder"
echo "";
done < $1
The code using cp that is commented out works just fine. I would just really prefer to use rsync, mainly due to better feedback and more accurate progress reporting, as far as I can tell.
Using the code as pasted above (with rsync) throws this error:
./filelocator classic-locations.txt
sending incremental file list
rsync: change_dir "/home/emil//\\sandrew-nas/SMMUSIC/MMIMUSIC/7001-8000/7201-7300/7252/Unknown Album (29-12-2012 09-52-02)" failed: No such file or directory (2)
sent 20 bytes received 12 bytes 64.00 bytes/sec
total size is 0 speedup is 0.00 (DRY RUN)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1165) [sender=3.1.1]
sending incremental file list
rsync: change_dir "/home/emil//\\sandrew-nas/SMMUSIC/MMIMUSIC/7001-8000/7201-7300/7252/Unknown Album (29-12-2012 09-52-02)" failed: No such file or directory (2)
sent 20 bytes received 12 bytes 64.00 bytes/sec
total size is 0 speedup is 0.00 (DRY RUN)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1165) [sender=3.1.1]
As you can see, my home folder (where I issue the command) is suddenly included in the script, leading me to believe that a variable or wildcard is being expanded in the local shell, but no amount of escape characters seem to accomplish what I want with rsync.

Resources