I'm trying to use look to speed up searches from a sorted file. My understanding of look is that it workes where the input data (up to some terminator specified by -t) is sorted. My data is numerically sorted, which it doesn't seem to like.
My look is look from util-linux 2.23.2. Is there any way I can make it play nicely with my numerically sorted data?
Small reproducible example:
$ seq 100 | sed "s/$/,[data]/" > temp
$ look 11 temp
$ look 11, temp
$ look -t, 11 temp
$ look -d 11 temp
$ look -d -t, 11 temp
$ look -f -d -t, 11 temp
look finds nothing. grep works just fine.
$ grep 11 temp
11,[data]
Related
Let's say I have these files in folder Test1
AAAA-12_21_2020.txt
AAAA-12_20_2020.txt
AAAA-12_19_2020.txt
BBB-12_21_2020.txt
BBB-12_20_2020.txt
BBB-12_19_2020.txt
I want below latest files to folder Test2
AAAA-12_21_2020.txt
BBB-12_21_2020.txt
This code would work:
ls $1 -U | sort | cut -f 1 -d "-" | uniq | while read -r prefix; do
ls $1/$prefix-* | sort -t '_' -k3,3V -k1,1V -k2,2V | head -n 1
done
We first iterate over every prefix in the directory specified as the first argument, which we get by sorting the list of files and deleting duplicates, before extracting everything before -. Then we sort those filenames by three fields separated by the _ symbol using the -k option of sort (primarily by years in the third field, then months in second and lastly days). We use version sort to be able to ignore the text around and interpret numbers correctly (as opposed to lexicographical sort).
I'm not sure whether this is the best way to do this, as I used only basic bash functions. Because of the date format and the fact that you have to differentiate prefixes, you have to parse the string fully, which is a job better suited for AWK or Perl.
Nonetheless, I would suggest using day-month-year or year-month-day format for machine-readable filenames.
Using awk:
ls -1 Test1/ | awk -v src_dir="Test1" -v target_dir="Test2" -F '(-|_)' '{p=$4""$2""$3; if(!($1 in b) || b[$1] < p){a[$1]=$0}} END {for (i in a) {system ("mv "src_dir"/"a[i]" "target_dir"/")}}'
My Problem:
I have 2 large csv files, with millions of lines.
The one file contains a backup of a database from my server, and looks like:
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true
...
Now I have another CSV file, containing new codes like, with the exact same schema.
I would like to compare the two, and only find the codes, which are not already on the server. Because a friend of mine generates random codes, we want to be certain to only update codes, which are not already on the server.
I tried sorting them with sort -u serverBackup.csv > serverBackupSorted.csv and sort -u newCodes.csv > newCodesSorted.csv
First I tried to use grep -F -x -f newCodesSorted.csv serverBackupSorted.csv but the process got killed because it took too much resources, so I thought there had to be a better way
I then used diff to only find new lines in newCodesSorted.csv like diff serverBackupSorted.csv newCodesSorted.csv.
I believe you could tell diff directly that you want only the difference from the second file, but I didn't understood how, therefore I grepped the input, knowing that I cut/remove unwanted characters later:
diff serverBackupSorted.csv newCodesSorted.csv | grep '>' > greppedCodes
But I believe there has to be a better way.
So I ask you, if you have any ideas, how to improve this method.
EDIT:
comm works great so far. But one thing I forgot to mention is, that some of the codes on the server are already scanned.
But new codes are always initialized with isScanned = false. So the newCodes.csv would look something like
securityCode,isScanned
ALBSIBFOEA,false
OUVOENJBSD,false
NAPOIDFNLE,false
NALEJNSIDO,false
NPIAEBNSIE,false
...
I don't know whether it would be sufficient to use cut -d',' -f1 to reduce it to just the codes and the use comms.
I tried that, and once with grep, once with comms got different results. So I'm kind of unsure, which one is the correct way ^^
Yes! a highly underrated tool comm is great for this.
Stolen examples from here.
Show lines that only exist in file a: (i.e. what was deleted from a)
comm -23 a b
Show lines that only exist in file b: (i.e. what was added to b)
comm -13 a b
Show lines that only exist in one file or the other: (but not both)
comm -3 a b | sed 's/^\t//'
As noted in the comments, for comm to work the files do need to be sorted beforehand. The following will sort them as a part of the command:
comm -12 <(sort a) <(sort b)
If you do prefer to stick with diff, you can get it to do what you want without the grep:
diff --changed-group-format='%<%>' --unchanged-group-format='' 1.txt 2.txt
You could then alias that diff command to "comp" or something similar to allow you to just:
comp 1.txt 2.txt
That might be handy if this is a command you are likely to use often in future.
I would think that sorting the file uses a lot of resources.
When you only want the new lines, you can try grep with the option -v
grep -vFxf serverBackup.csv newCodes.csv
or first split serverBackup.csv
split -a 4 --lines 10000 serverBackup.csv splitted
cp newCodes.csv newCodes.csv.org
for f in splitted*; do
grep -vFxf "${f}" newCodes.csv > smaller
mv smaller newCodes.csv
done
rm splitted*
Given:
$ cat f1
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,false
NAPOIDFNLE,true
$ cat f2
securityCode,isScanned
NALEJNSIDO,false
NALSKIFKEA,true
NAPOIDFNLE,false
SOMETHINGELSE,true
You could use awk:
$ awk 'FNR==NR{seen[$0]; next} !($0 in seen)' f1 f2
NALSKIFKEA,true
NAPOIDFNLE,false
SOMETHINGELSE,true
Given two files (so that at any file can be duplicates) in the following format:
file1 (file that contains only numbers) for example:
10
40
20
10
10
file2 (file that contains only numbers) for example:
30
40
10
30
0
How can I prints the contents of the files, so that, from any file, we will remove the duplications.
For example, the output according to the 2 file above, need to be:
10
40
20
30
40
10
0
Note: in the output, we can get duplications (at most, will be 2 number that appears two times) , but, from any file, we will take the content without duplications !
How can I do it with sort , uniq , cat using only one command?
Namely, something like that: cat file1 file2 | sort | uniq (but, of course, this command not good - it's not solve the problem, it's only for explain what I mean while I say "using only one command").
I will be happy to listen your ideas how do it :)
If I understood the question correctly, this awk should do it while preserving the order:
awk 'FNR==1{delete a}!a[$0]++' file1 file2
If you don't need to preserve the order, it can be as simple as:
sort -u file1; sort -u file2
If you don't want to use a list (;), something like this is also an option:
cat <(sort -u file1) <(sort -u file2)
I'm trying to join several files, which look like below
file1
DATE;BAL_RO,ET-CAP,EXT_EA16;LRW_RT,AY-LME;
2014M01;AZ;PO;
2013M12;WT;UF;
file2
DATE;WALU-TF,TZ-AN;BAL_OP,WZ-CPI,WXZ-JUM;
2014M02;BA;LA;
2014M01;BR;ON;
I'm trying to merge them to have the following results
DATE;WALU-TF,TZ-AN;BAL_OP,WZ-CPI,WXZ-JUM;BAL_RO,ET-CAP,EXT_EA16;LRW_RT,AY-LME;
2014M02;BA;LA;
2014M01;BR;ON;AZ;PO;
2013M12 WT;UF;
or
DATE;WALU-TF,TZ-AN;BAL_OP,WZ-CPI,WXZ-JUM;BAL_RO,ET-CAP,EXT_EA16;LRW_RT,AY-LME;
2014M02;BA;LA;;
2014M01;BR;ON;AZ;PO;
2013M12;;WT;UF;
I tried join but it says filenameX is not sorted:
If you have any ideas, they are welcomed.
Best.
Will this work for you:
$ awk '
BEGIN{FS=OFS=";"}
NR==FNR{a[$1]=$0;next}
{$0=($1 in a)?a[$1] $2 FS $3:$0; delete a[$1]}1;END{for(x in a) print a[x]}' file2 file1
DATE;WALU-TF,TZ-AN;BAL_OP,WZ-CPI,WXZ-JUM;BAL_RO,ET-CAP,EXT_EA16;LRW_RT,AY-LME
2014M01;BR;ON;AZ;PO
2013M12;WT;UF;
2014M02;BA;LA;
We set the field separators (Input and Output) to ;
We scan the first file and create an array indexed at column 1 and assign it value of entire line
Once the first file is completed, we start reading the second file. If the first column is present in our array, we append the current line to the line stored in array. We delete the array item.
Once all lines of second file are processed, we loop through the array to see if there are any items left. If so we print them.
Bash has this wonderful feature that allows to sort both files in-line:
$ join -t ';' -a 1 -a 2 -o 0 1.2 1.3 2.2 2.3 <(sort -n file1 ) <(sort -n file2)
DATE;BAL_RO,ET-CAP,EXT_EA16;LRW_RT,AY-LME;WALU-TF,TZ-AN;BAL_OP,WZ-CPI,WXZ-JUM
2013M12;WT;UF;;
2014M01;AZ;PO;BR;ON
2014M02;;;BA;LA
Explanation:
-t ';': use ; as both input and output separator.
-a 1 -a 2: also print unpairable lines from both file1 and file2.
-o 0 1.2 1.3 2.2 2.3: each line is formatted as 0 (the join field), 1.2 (2nd field of file1), 1.3 (3rd field of file1), etcetera.
<(sort -n file1): numeric sort file1 via bash process substitution.
<(sort -n file2): numeric sort file2 via bash process substitution.
For details on bash process substitution, see: http://tldp.org/LDP/abs/html/process-sub.html.
I am working on a specific project where I need to work out the make-up of a large extract of documents so that we have a baseline for performance testing.
Specifically, I need a command that can recursively go through a directory and, for each file type, inform me of the number of files of that type and their average size.
I've looked at solutions like:
Unix find average file size,
How can I recursively print a list of files with filenames shorter than 25 characters using a one-liner? and https://unix.stackexchange.com/questions/63370/compute-average-file-size, but nothing quite gets me to what I'm after.
This du and awk combination should work for you:
du -a mydir/ | awk -F'[.[:space:]]' '/\.[a-zA-Z0-9]+$/ { a[$NF]+=$1; b[$NF]++ }
END{for (i in a) print i, b[i], (a[i]/b[i])}'
Give you something to start, with below script, you will get a list of file and its size, line by line.
#!/usr/bin/env bash
DIR=ABC
cd $DIR
find . -type f |while read line
do
# size=$(stat --format="%s" $line) # For the system with stat command
size=$(perl -e 'print -s $ARGV[0],"\n"' $line ) # #Mark Setchell provided the command, but I have no osx system to test it.
echo $size $line
done
Output sample
123 ./a.txt
23 ./fds/afdsf.jpg
Then it is your homework, with above output, you should be easy to get file type and their average size
You can use "du" maybe:
du -a -c *.txt
Sample output:
104 M1.txt
8 in.txt
8 keys.txt
8 text.txt
8 wordle.txt
136 total
The output is in 512-byte blocks, but you can change it with "-k" or "-m".