How to merge two csv files not including duplicates [duplicate] - bash

This question already has answers here:
Remove duplicate lines without sorting [duplicate]
(8 answers)
Closed 9 years ago.
If I have a csv files like this
lion#mammal#scary animal
human#mammal#human
hummingbird#bird#can fly
dog#mammal#man's best friend
cat#mammal#purrs a lot
shark#fish#very scary
fish#fish#blub blub
and I have another csv file like this
cat#mammal#purrs a lot
shark#fish#very scary
fish#fish#blub blub
rockets#pewpew#fire
banana#fruit#yellow
I want the output to be like this:
lion#mammal#scary animal
human#mammal#human
hummingbird#bird#can fly
dog#mammal#man's best friend
cat#mammal#purrs a lot
shark#fish#very scary
fish#fish#blub blub
rockets#pewpew#fire
banana#fruit#yellow
some of the things in the first csv file are present in the second csv file; they overlap pretty much. How can I combine these csv files with the correct order? It is guaranteed that the new entries will always be the first few lines in the beginning of the first csv file.

Soultion 1:
awk '!a[$0]++' file1.cvs file2.cvs
Solution 2 (if don't care of the original order)
sort -u file1 file2

Here's one way:
Use cat -n to concatenate input files and prepend line numbers
Use sort -u remove duplicate data
Use sort -n to sort again by prepended number
Use cut to remove the line numbering
$ cat -n file1 file2 | sort -uk2 | sort -nk1 | cut -f2-
lion#mammal#scary animal
human#mammal#human
hummingbird#bird#can fly
dog#mammal#man's best friend
cat#mammal#purrs a lot
shark#fish#very scary
fish#fish#blub blub
rockets#pewpew#fire
banana#fruit#yellow
$

Related

Sort duplicates in multiple files

I am kind new to programming and I am just starting so please do not think I am just trying to get spoon fed.
I am trying to make one assignment but I am stuck.
the idea is to sort duplicates string among many files.
They are all in one folder.
with 1. txt, 2.txt...n. txt
1.txt:
hello
hi
world
2. txt
hi
there
hello
They all contain different strings.
I would like to sort them and get result as:
2 hello
1 world
and so on
I tried this from a bit of searching but no luck.
sort file1.txt | uniq -c
This mainly does it for one file and I would like to make it for all files at once.
I thank you very much for you're help!
I'd cat both files and then sort and count the values:
$ cat file1.txt file2.txt | sort | uniq -c

How to shuffle multiple files and save different files? [duplicate]

This question already has answers here:
Shuffle multiple files in same order
(3 answers)
Closed 4 years ago.
I have three files as:
file1 file2 file3
A  B  C
D  E  F
G  H  I
The lines in each file relate to each other.
Thus, I want to generate shuffled files as:
file1.shuf file2.shuf file3.shuf
G     H    I
D     E    F
A     B    C
I often face this kind of problem and I always write a small script in Ruby or Python, but I thought it can be solved by some simple shell commands.
Could you suggest any simple ways to do this by shell commands or a script?
Here’s a simple script that does what you want. Specify all the input
files on the command line. It assumes all of the files have the same
number of lines.
First it creates a list of numbers and shuffles it. Then it combines
those numbers with each input file, sorts that, and removes the numbers.
Thus, each input file is shuffled in the same order.
#!/bin/bash
# Temp file to hold shuffled order
shuffile=$(mktemp)
# Create shuffled order
lines=$(wc -l < "$1")
digits=$(printf "%d" $lines | wc -c)
fmt=$(printf "%%0%d.0f" $digits)
seq -f "$fmt" $lines | shuf > $shuffile
# Shuffle each file in same way
for fname in "$#"; do
paste $shuffile "$fname" | sort | cut -f 2- > "$fname.shuf"
done
# Clean up
rm $shuffile

How to get a string out of a plain text file [duplicate]

This question already has answers here:
How do I split a string on a delimiter in Bash?
(37 answers)
Closed 6 years ago.
I have a .txt file that has a list containing a hash and a password so it looks likes this:
00608cbd5037e18593f96995a080a15c:9e:hoboken
00b108d5d2b5baceeb9853b1ad9aa9e5:1c:wVPZ
Out of this txt file I need to extract only the passwords and add them in a new text file so that I have a list that would look like this:
hoboken
wVPZ
etc
etc
etc
etc
How to do this in bash, a scripting language or simply with a text editor?
Given your examples, the following use of cut would achieve what you want:
cut -f3 -d':' /folder/file >> /folder/result
The code above would delete anything before (and including) the second colon : on each line, which would work on your case, given your examples. The result is stored on /folder/result.
Edit: I edited this answer to make it simpler.
I suggest to use awk to get always last column from your file:
awk -F ':' '{print $NF}' file
Output:
hoboken
wVPZ
With sed, to remove the string up to ::
sed 's/.*://' file
You could also use grep:
$ grep -o [^:]*$ file
hoboken
wVPZ
-o print only matching part
[^:] anything but :
* all matching characters
$ end of record

Find same words in two text files

I have two text files and each contains more than 50 000 lines. I need to find same words that are in both text files. I tried COMM command but I got answer that "file 2 is not in sorted order". I tried to sort file by command SORT but it doesn´t work. I´m working in Windows. It doesn´t have to be solved in command line. It can be solved in some program or something else. Thank you for every idea.
If you want to sort the files you will have to use some sort of external sort (like merge sort) so you have enough memory. As for another way you could go through the first file and find all the words and store them in a hashtable, then go through the second file and check for repeated words. If the words are actual words and not gibberish the second method will work and be easier. Since the files are so large you may not want to use a scripting language but it might work.
If the words are not on their own line, then comm can not help you.
If you have a set of unix utilities handy, like Cygwin, (you mentioned comm, so you may have have others as well) you can do:
$ tr -cs "[:alpha:]" "\n" < firstFile | sort > firstFileWords
$ tr -cs "[:alpha:]" "\n" < secondFile | sort > secondFileWords
$ comm -12 firstFileWords secondFileWords > commonWords
The first two lines convert the words in each file in to a single word on each line, it also sorts the file.
If you're only interested in individual words, you can change sort to sort -u to make get the unique set.

Remove duplicate lines without sorting [duplicate]

This question already has answers here:
How to delete duplicate lines in a file without sorting it in Unix
(9 answers)
Closed 4 years ago.
I have a utility script in Python:
#!/usr/bin/env python
import sys
unique_lines = []
duplicate_lines = []
for line in sys.stdin:
if line in unique_lines:
duplicate_lines.append(line)
else:
unique_lines.append(line)
sys.stdout.write(line)
# optionally do something with duplicate_lines
This simple functionality (uniq without needing to sort first, stable ordering) must be available as a simple UNIX utility, mustn't it? Maybe a combination of filters in a pipe?
Reason for asking: needing this functionality on a system on which I cannot execute Python from anywhere.
The UNIX Bash Scripting blog suggests:
awk '!x[$0]++'
This command is telling awk which lines to print. The variable $0 holds the entire contents of a line and square brackets are array access. So, for each line of the file, the node of the array x is incremented and the line printed if the content of that node was not (!) previously set.
A late answer - I just ran into a duplicate of this - but perhaps worth adding...
The principle behind #1_CR's answer can be written more concisely, using cat -n instead of awk to add line numbers:
cat -n file_name | sort -uk2 | sort -n | cut -f2-
Use cat -n to prepend line numbers
Use sort -u remove duplicate data (-k2 says 'start at field 2 for sort key')
Use sort -n to sort by prepended number
Use cut to remove the line numbering (-f2- says 'select field 2 till end')
To remove duplicate from 2 files :
awk '!a[$0]++' file1.csv file2.csv
Michael Hoffman's solution above is short and sweet. For larger files, a Schwartzian transform approach involving the addition of an index field using awk followed by multiple rounds of sort and uniq involves less memory overhead. The following snippet works in bash
awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'
Now you can check out this small tool written in Rust: uq.
It performs uniqueness filtering without having to sort the input first, therefore can apply on continuous stream.
There are two advantages of this tool over the top-voted awk solution and other shell-based solutions:
uq remembers the occurence of lines using their hash values, so it doesn't use as much memory use when the lines are long.
uq can keep the memory usage constant by setting a limit on the number of entries to store (when the limit is reached, there is a flag to control either to override or to die), while the awk solution could run into OOM when there are too many lines.
Thanks 1_CR! I needed a "uniq -u" (remove duplicates entirely) rather than uniq (leave 1 copy of duplicates). The awk and perl solutions can't really be modified to do this, your's can! I may have also needed the lower memory use since I will be uniq'ing like 100,000,000 lines 8-). Just in case anyone else needs it, I just put a "-u" in the uniq portion of the command:
awk '{print(NR"\t"$0)}' file_name | sort -t$'\t' -k2,2 | uniq -u --skip-fields 1 | sort -k1,1 -t$'\t' | cut -f2 -d$'\t'
I just wanted to remove all duplicates on following lines, not everywhere in the file. So I used:
awk '{
if ($0 != PREVLINE) print $0;
PREVLINE=$0;
}'
the uniq command works in an alias even http://man7.org/linux/man-pages/man1/uniq.1.html

Resources