I intend to check submited h.w answers in .c code.
does someone have link or bash shell script code that checks for file similarity (percentage of similar lines, etc...)?
Ready-to-use-programm
On the one hand there is a little C-programm called Sherlock from the University of Sydney, which does exactly what you want: displaying the percentage of similarity. You only have to compile it yourself, but I think that won't be a problem.
Do it yourself
On the other hand, in case you're using a unix-based system and want to do it all by yourself there is the comm command:
compare two sorted files line by line and write to standard output:
the lines that are common, plus the lines that are unique.
(taken from the manpage)
Important to notice here is that comm only works ony sorted files, so you have to sort both of them first. If you have two files, say first.txt and second.txt you can use comm like this:
comm -12 <(sort first.txt) <(sort second.txt)
The -12-option specified suppresses lines which are unique in both files, so you will only get lines appearing on both files.
Related
I have the following data containing a subset of record numbers formatting like so:
>head pilot.dat
AnalogPoint,206407
AnalogPoint,2584
AnalogPoint,206292
AnalogPoint,206278
AnalogPoint,206409
AnalogPoint,206410
AnalogPoint,206254
AnalogPoint,206266
AnalogPoint,206408
AnalogPoint,206284
I want to compare the list of entries to another subset file called "disps.dat" to find duplicates, which is formatted in the same way:
>head disps.dat
StatusPoint,280264
StatusPoint,280266
StatusPoint,280267
StatusPoint,280268
StatusPoint,280269
StatusPoint,280335
StatusPoint,280336
StatusPoint,280334
StatusPoint,280124
I used the command:
grep -f pilot.dat disps.dat > duplicate.dat
However, the output file "duplicate.dat" is listing records that exist in the second file "disps.dat", but do not exist in the first file.
(Note, both files are big, so the sample shown above don't have duplicates, but I do expect and have confirmed at least 10-12k duplicates to show up in total)
> head duplicate.dat
AnalogPoint,208106
AnalogPoint,208107
StatusPoint,1235220
AnalogPoint,217270
AnalogPoint,217271
AnalogPoint,217272
AnalogPoint,217273
AnalogPoint,217274
AnalogPoint,217275
AnalogPoint,217277
> grep "AnalogPoint,208106" pilot.dat
>
I tested the above command with a smaller sample of data (10 records), also formatted the same, and the results work fine, so I'm a little bit confused on why it is failing on the larger execution.
I also tried feeding it in as a string with -F thinking that the "," comma might be the source of issue. Right now, I am feeding the data through a 'for' loop and echoing each line, which is executing very, very slowly but at least it will help me cross out the regex possibility.
the -x or -w option is needed to do an exact match.
-x will match exact string, and -w will match exact substring and block non-word characters which works in my case to handle trailing numbers.
The issue is that a record in the first file such as:
"AnalogPoint,1"
Would end up flagging records in the second file like:
"AnalogPoint,10"
"AnalogPoint,123"
"AnalogPoint,100200"
And so on.
Thanks to #Barmar for pointing out my issue.
I have several text files in a directory. All of them unrelated. Words doesn't repeat in each file. Each line has 1 to 3 words in it such as:
apple
potato soup
vitamin D
banana
guinea pig
life is good
I know how to randomize each file:
sort -R file.txt > file-modified.txt
That's great but I want to do this in over 500+ files in a directory and it would take me ages. There must be something better.
I would like to do something like:
sort -R *.txt -o KEEP-SAME-NAME-AS-ORIGINAL-FILE-ADD-SUFFIX-TO-ALL.txt
Maybe this is possible with an script that go through each file in the directory until finished.
Very important is every file should only randomize the words within itself and do not mix with the other files.
Thank you.
Something like this one-liner:
for file in !(*-modified).txt; do shuf "$file" > "${file%.txt}-modified.txt"; done
Just loop over the files and shuffle each one in turn.
The !(*-modified).txt pattern uses bash's extended pattern matching to not match .txt files that already have -modified at the end of the name so you don't shuffle a pre-existing already shuffled output file and end up with file-modified-modified.txt. Might require a shopt -s extglob first, though that's usually turned on already in an interactive shell session.
I have a task where I need to parse through files and extract information. I can do this easy using bash but I have to get it done through unix commands only.
For example, I have a file similar to the following:
Set<tab>one<tab>two<tab>three
Set<tab>four<tab>five<tab>six
ENDSET
Set<tab>four<tab>two<tab>nine
ENDSET
Set<tab>one<tab>one<tab>one
Set<tab>two<tab>two<tab>two
ENDSET
...
So on and so forth. I want to be able to extract a certain number of sets, say the first 10. Also, I want to be able to extract info from the columns.
Once again, this is a trivial thing to do using bash scripting, but I am unsure of how to do this with unix commands only. I can combine the commands together in a shell script but, once again, only unix commands.
Without an output example, it's hard to know your goal, but anyway, one UNIX command you can use is AWK.
Examples:
Extract 2 sets from your data sample (without include "ENDSET" nor blank lines):
$ awk '/ENDSET/{ if(++count==2) exit(0);next; }NF{print}' file.txt
Set one two three
Set four five six
Set four two nine
Extract 3 sets and print 2nd column only (Note 1st column is always "Set"):
$ awk '/ENDSET/{ if(++count==3) exit(0);next; }$2{print $2}' file.txt
two
five
two
one
two
And so on... (more info: $ man awk)
I have a huge file and I split the big file into several small chunks and divide and conquer. Now I have a folder contains a list of files like below:
output_aa #(the output file done: cat input_aa | python parse.py > output_aa)
output_ab
output_ac
output_ad
...
I am wondering is there a way to merge those files back together FOLLOWING THE INDEX ORDER:
I know I could do it by using
cat * > output.all
but I am more curious another magical command already exist comes with split..
The magic command would be:
cat output_* > output.all
There is no need to sort the file names as the shell already does it (*).
As its name suggests, cat original design was precisely to conCATenate files which is basically the opposite of split.
(*) Edit:
Should you use an (hypothetical ?) locale that use a collating order where the a-z order is not abcdefghijklmnopqrstuvwxyz, here is one way to overcome the issue:
LC_ALL=C "sh -c cat output_* > output.all"
There are other ways to concat files together, but there is no magical "opposite of split" in "linux".
Of course, talking about "linux" in general is a bit far fetched, as many distributions have different tools (most of them use a different shell already by default, like sh, bash, csh, zsh, ksh, ...), but if you're talking about debian based linux at least, I don't know of any distribution which would provide such a tool.
For sorting you can use the linux command "sort" ;
Also be aware that using ">" for redirecting stdout will override maybe existing contents, while ">>" will concat to an existing file.
I don't want to copycat, but still make this answer complete, so what jlliagre said about the cat command should also be considered of course (that "cat" was made to con-"cat" files, effectively making it possible to reverse the split command - but that's only provided you use the same ordering of files, so it's not exactly the "opposite of split", but will work that way in close to 100% of the cases (see comments under jlliagre answer for specifics))
This question already has answers here:
How to remove the lines which appear on file B from another file A?
(12 answers)
Closed 9 years ago.
I’ve got two text files, each with several hundred lines. Some of the lines exist in both files, and I want to remove those so that they exist in only one of the files. Basically, I want to reduce them to get a unique set of lines. The catch is that I can’t sort them (they are stripped-down dumps of my Chromium history).
What is the easiest way to do this?
I tried WinDiff, but that gave incorrect results. I figure that I could knock together a PHP script in a while, but am hoping that there is an easier way (preferably a command-line tool).
Well, I ended up writing a PHP script after all.
I read both files into a string, then exploded the strings into arrays using \r\n as the delimiter. I then iterated through the arrays to remove any elements that exist, and finally dumped them back out to a file.
The only problem was that by trying to refactor the stripping routine to a function, I found that passing the array that gets changed (elements removed) by reference caused it to slow down to the point of needing to be Ctrl-C’d, so I just passed by value and returned the new array (counterintuitive). Also, using unset to delete the elements was slow no matter what, so I just set the element to an empty string and skipped those during the dump.
If you have a bash shell (cygwin), the following shell commands would remove all lines that appear in both files from a.txt:
comm -12 <(sort a.txt|uniq) <(sort b.txt|uniq) | while read dupe; do dupe_escaped=$(echo "$dupe" | sed 's/[][\.*^$/]/\\&/g'); sed -e "/${dupe_escaped}/d" -i a.txt; done