Remove duplicates in each individual column from a text file

Remove duplicates in each individual column from a text file - sorting

I have a text file of 7 tab-delimited columns. Each column has a different number of lines with values that could be duplicated. I want to remove the duplicates so that each column has only unique values for that specific column. As an example:
Input
C1 C2 C3 C4 C5 C6 C7
111 111 222 333 111 222 777
222 111 333 333 222 333 666
222 111 444 111 333 555 555
333 444 555 222 444 666 444
444 666 555 777 555 666 333
444 777 777 555 666 888 333
777 888 999 666 888
999
Output
C1 C2 C3 C4 C5 C6 C7
111 111 222 333 111 222 777
222 444 333 111 222 333 666
333 666 444 222 333 555 555
444 777 555 777 444 666 444
777 888 777 555 555 888 333
999 999 666 666
888
I figure I would need to use awk to print each column and use sort -u separately, and then paste those outputs together. So, is there a way to make a loop that for i number of columns in a text file, would print each column | sort - u, and then paste it all together?
Thanks in advance,
Carlos

Using perl instead for its support of true multidimensional arrays:
perl -lane '
for my $n (0..$#F) {
if (!exists ${$vals[$n]}{$F[$n]}) {
push #{$cols[$n]}, $F[$n];
${$vals[$n]}{$F[$n]} = 1;
}
}
END {
for (1..$.) {
my #row;
for my $n (0..$#cols) {
push #row, shift #{$cols[$n]};
}
print join("\t", #row);
}
}' input.txt

Using any awk in any shell on every Unix box:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{
for (colNr=1; colNr<=NF; colNr++) {
val = $colNr
if ( !seen[colNr,val]++ ) {
rowNr = ++colRowNrs[colNr]
vals[rowNr,colNr] = val
numRows = (rowNr > numRows ? rowNr : numRows)
}
}
numCols = (NF > numCols ? NF : numCols)
}
END {
for (rowNr=1; rowNr<=numRows; rowNr++) {
for (colNr=1; colNr<=numCols; colNr++) {
val = vals[rowNr,colNr]
printf "%s%s", val, (colNr<numCols ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
C1 C2 C3 C4 C5 C6 C7
111 111 222 333 111 222 777
222 444 333 111 222 333 666
333 666 444 222 333 555 555
444 777 555 777 444 666 444
777 888 777 555 555 888 333
999 999 666 666
888

Assumptions
an (awk) array of the entire output result will fit in memory
variable number of columns and rows
One idea consists of a (sparse) 2-dimensional array of values, where the array structure would look like:
values[<column#>][<row#>]=<unique_cell_value>
One idea using a single awk invocation that a) requires a single pass through the input file and b) does not require any transposing/pasting (in case anyone takes Cyrus' comment/suggestion seriously):
awk '
BEGIN { FS=OFS="\t" }
{ maxNF = (NF > maxNF ? NF : maxNF) # keep track of max number of columns
for (i=1; i<=NF; i++) {
if ( $i == "" ) # ignore empty cell
continue
for (j=1; j<=ndx[i]; j++) { # loop through values already seen for this column
if ( $i == vals[i][j] ) { # and if already seen then
$i = "" # clear the current cell and
break # break out of this for/testing loop
}
}
if ( $i != "" ) { # if we got this var and the cell is not empty then
vals[i][++ndx[i]] = $i # store the new value in our array
}
}
}
END { for (j=1; j<=NR; j++) { # loop through all possible rows
pfx = ""
for (i=1; i<=maxNF; i++) { # loop through all possible columns
printf "%s%s", pfx, vals[i][j] # non-existent array entries default to ""
pfx = OFS
}
printf "\n"
}
}
' input_file
NOTE: The array of arrays structure (arr[i][j]) requires GNU awk otherwise we could convert to a pseudo dual index array structure of arr[i,j]
This generates:
C1 C2 C3 C4 C5 C6 C7
111 111 222 333 111 222 777
222 444 333 111 222 333 666
333 666 444 222 333 555 555
444 777 555 777 444 666 444
777 888 777 555 555 888 333
999 999 666 666
888

Related

need help in formatted output using for loop in shell script

I have the requirement to capture pod memory and other details in my setup.
have written small reproducible shell script as copied below.
LOGFILE='/root/User1/test/log'
Data=""
space=" "
e=34
f=12Mi
a=122
b=123
c=333
d=450
for i in {1..10}; do
Data+=$space
Data+=$a
Data+=$space
Data+=$b
Data+=$space
Data+=$c
Data+=$space
Data+=$d
Data+=$space
Data+=$e
Data+=$space
Data+=$f
printf "%s" "$Data" >> ${LOGFILE}
echo $'\n' >> ${LOGFILE}
$(unset ${Data})
done
The above script produces concatenated output.
34 12Mi 122 123 333 450
34 12Mi 122 123 333 450 34 12Mi 122 123 333 450
34 12Mi 122 123 333 450 34 12Mi 122 123 333 450 34 12Mi 122 123 333 450
34 12Mi 122 123 333 450 34 12Mi 122 123 333 450 34 12Mi 122 123 333 450 34 12Mi 122 123 333 450
34 12Mi 122 123 333 450 34 12Mi 122 123 333 450 34 12Mi 122 123 333 450 34 12Mi 122 123 333 450 34 12Mi 122 123 333 450
34 12Mi 122 123 333 450 34 12Mi 122 123 333 450 34 12Mi 122 123 333 450 34 12Mi 122 123 333 450 34 12Mi 122 123 333 450 34 12Mi 122
The output format what I am looking for is
34 12Mi 122 123 333 450
34 12Mi 122 123 333 450
34 12Mi 122 123 333 450
34 12Mi 122 123 333 450
34 12Mi 122 123 333 450
34 12Mi 122 123 333 450
34 12Mi 122 123 333 450
34 12Mi 122 123 333 450
can some one help me here to understand what mistake I am doing here.
and what could be the possible solution for this.

When you do $(unset ${Data}), you are running unset ${Data} in a subshell, and then try to run its output (the empty string) as a command. This is wrong on a few levels:
A subshell can't affect its parent environment, and you don't want to run the output as a command anyway
unset takes the name of the variable, not its expansion, as a parameter
The quick fix is to replace $(unset ${Data}) with unset Data.
A simpler overall approach could be to skip the intermediate variable entirely, and move the redirection out of the loop:
for i in {1..10}; do
printf '%s ' "$e" "$f" "$a" "$b" "$c"
printf '%s\n\n' "$d"
done >> "$LOGFILE"
This doesn't require $Data or $space any longer.
This prints the exact desired output you show, though the actual output you show doesn't correspond to your script, which has each line begin with three blanks. To get that, the printf formatting strings would have to be ' %s' and ' %s\n\n', respectively.

filter multiline record file based if one of the lines meet condition ( word count)

everyone
I am looking for a way to keep the records from txt file that meet the following condition:
This is the example of the data:
aa bb cc
11 22 33
44 55 66
77 88 99
aa bb cc
11 22 33 44 55 66 77
44 55 66 66
77 88 99
aa bb cc
11 22 33 44 55
44 55 66
77 88 99 77
...
Basically, it's a file where one record where there are total 5 lines, 4 lines contain strings/numbers with tab delimeter , and the last is the new line \n.
The first line of the record always has 3 elements, while the number of elements in 2nd 3rd and 4th line can be different.
What I need to do is to remove every record(5 lines block) where total number of elements in the second line > 3 ( and I don't care about the number of elements in all the rest lines) . The output of the example should look like this:
aa bb cc
11 22 33
44 55 66
77 88 99
...
so only the record where i have 3 elements are kept and recorded in the new txt file.
I tried to do it with awk by modifying FS and RS values like this:
awk 'BEGIN {RS="\n\n"; FS="\n";}
{if(length($2)==3) print $2"\n\n"; }' test_filter.txt
but if(length($2)==3) is not correct, as I should count the number of entries in 2nd field instead of counting the length, which I can't find how to do.. any help would be much appreaciated!
thanks in advance,

You can use the split() function to break a line/field/string into components; in this case:
n=split($2,arr," ")
Where:
we split field #2, using a single space (" ") as the delimiter ...
components are stored in array arr[] and ...
n is the number of elements in the array
Pulling this into OP's current awk code, along with a couple small changes, we get:
awk 'BEGIN {ORS=RS="\n\n"; FS="\n"} {n=split($2,arr," "); if (n>=4) next}1' test_filter.txt
With an additional block added to our sample:
$ cat test_filter.txt
aa bb cc
11 22 33
44 55 66
77 88 99
aa bb cc
11 22 33 44 55 66 77
44 55 66 66
77 88 99
aa bb cc
111 222 333
444 555 665
777 888 999
aa bb cc
11 22 33 44 55
44 55 66
77 88 99 77
This awk solution generates:
aa bb cc
11 22 33
44 55 66
77 88 99
aa bb cc
111 222 333
444 555 665
777 888 999
# blank line here

filtering fields based on certain values

I wish you you all a very happy New Year.
I have a file that looks like this(example): There is no header and this file has about 10000 such rows
123 345 676 58 1
464 222 0 0 1
555 22 888 555 1
777 333 676 0 1
555 444 0 58 1
PROBLEM: I only want those rows where both field 3 and 4 have a non zero value i.e. in the above example row 1 & row 3 should be included and rest should be excluded. How can I do this?
The output should look like this:
123 345 676 58 1
555 22 888 555 1
Thanks.

awk is perfect for this kind of stuff:
awk '$3 && $4' input.txt
This will give you the output that you want.
$3 && $4 is a filter. $3 is the value of the 3rd field, $4 is the value of the forth. 0 values will be evaluated as false, anything else will be evaluated as true. If there can be negative values, than you need to write more precisely:
awk '$3 > 0 && $4 > 0' input.txt

How to find any decrement in the column?

I am trying to find out the decrements in a column and if found then print the last highest value.
For example:
From 111 to 445 there is a continous increment in the column.But 333 is less then the number before it.
111 aaa
112 aaa
112 aaa
113 sdf
115 aaa
222 ddd
333 sss
333 sss
444 sss
445 sss
333 aaa<<<<<<this is less then the number above it (445)
If any such scenario is found then print 445 sss

Like this, for example:
$ awk '{if (before>$1) {print before_line}} {before=$1; before_line=$0}' a
445 sss
What is it doing? Check the variable before and compare its value with the current. In case it is bigger, print the line.
It works for many cases as well:
$ cat a
111 aaa
112 aaa
112 aaa
113 sdf
115 aaa <--- this
15 aaa
222 ddd
333 sss
333 sss
444 sss
445 sss <--- this
333 aaa
$ awk '{if (before>$1) {print before_line}} {before=$1; before_line=$0}' a
115 aaa
445 sss

Store each number in a single variable called prevNumber then when you come to print the next one do a check e.g. if (newNumber < prevNumber) print prevNumber;
dont really know what language you are using

You can say:
awk '$1 > max {max=$1; maxline=$0}; END{ print maxline}' inputfile
For your input, it'd print:
445 sss

Extracting lines from text files in a folder based on the numbers in another file

I have a file ff.txt that looks as follows
*ABNA.txt
356
24
36
112
*AC24.txt
457
458
321
2
ABNA.txt and AC24.txt are the files in the folder named foo1. Based on the numbers in the ff.txt file, I want to extract the lines from the corresponding files in the foo1 folder and create the new files with the existing file names in another folder foo2.
If the third or fourth column of ABNA.txt file contain 356,24,36,112 numbers, extract that lines and save it to another folder foo2 as ABNA.txt.
ABNA.txt file in the folder foo1 looks as follows
dfg qza 356 245
hjb hkg 455 24
ghf qza 12 123
dfg qza 36 55
AC24.txt file in the folder foo1 looks as follows
hjb hkg 457 167
ghf qza 2 165
sar sar 234 321
dfg qza 345 345
Output:
ABNA.txt file in the folder foo2
dfg qza 356 245
hjb hkg 455 24
dfg qza 36 55
AC24.txt file in the folder foo2
hjb hkg 457 167
ghf qza 2 165
sar sar 234 321
your help would be appreciated!

#!/bin/bash
mkdir -p foo2
awk '
function process_file(filename, values, filein, fileout, line, f) {
if (filename == "") return
filein = "./foo1/" filename
fileout = "./foo2/" filename
while ((getline line < filein) > 0) {
split(line, f)
if (f[3] in values || f[4] in values) {
print line > fileout
}
}
}
/^\*/ {
process_file(filename, values)
filename = substr($0, 2)
delete values
next
}
{ values[$1] }
END { process_file(filename, values) }
' ff.txt

This might work for you (GNU sed and Bash):
folder1=foo1
folder2=foo2
sed -r '/^\*/!{s/\s*//g;H;$!d};1{h;d};x;s/\n/ /;s/\n/|/g;s#\*(.*) (.*)#<'"$folder1"'/\1 sed -nr '\''/^(\\S+\\s+){2,3}\\b(\2)\\b/w '"$folder2"'/\1'\''#' ff.txt | sh
This turns the ff.txt file into a script which is piped into the sh command. The user must first set bash variables $folder1 and $folder2 to the directories containing the source files and the ouput files respectively.

You can try something like this -
awk '
BEGIN {
readpath=sprintf("%s", "/path/to/foo1")
writepath=sprintf("%s", "/path/to/foo2")
}
$0~/\*/ {
file = substr($1,2)
while ((getline var < (readpath"/"file)) > 0) {
split (var, a, " ")
ary[a[3]]=var
ary[a[4]]=var
}
}
($1 in ary) {
print ary[$1] > (writepath"/"file)
}' foo.txt
Explaination:
Set the read path and write path in BEGIN statement.
For lines that has filenames in foo.txt file
Use substr to capture the filename in variable called file
Read the file in a variable called var.
split the variable var to use column 3 and 4 as index to array ary.
From foo.txt file if first column is present in the array as index write it to the file.
Test:
[jaypal:~/temp/test] ls
foo.txt foo1 foo2
[jaypal:~/temp/test] cat foo.txt
*ABNA.txt
356
24
36
112
*AC24.txt
457
458
321
2
[jaypal:~/temp/test] ls foo1/
ABNA.txt AC24.txt
[jaypal:~/temp/test] head foo1/*
==> foo1/ABNA.txt <==
dfg qza 356 245
hjb hkg 455 24
ghf qza 12 123
dfg qza 36 55
==> foo1/AC24.txt <==
hjb hkg 457 167
ghf qza 2 165
sar sar 234 321
dfg qza 345 345
[jaypal:~/temp/test] ls foo2/
[jaypal:~/temp/test]
[jaypal:~/temp/test] awk '
BEGIN {
readpath=sprintf("%s", "./foo1")
writepath=sprintf("%s", "./foo2")
}
$0~/\*/ {
file = substr($1,2)
while ((getline var < (readpath"/"file)) > 0) {
split (var, a, " ")
ary[a[3]]=var
ary[a[4]]=var
}
}
($1 in ary) {
print ary[$1] > (writepath"/"file)
}' foo.txt
[jaypal:~/temp/test] ls foo2/
ABNA.txt AC24.txt
[jaypal:~/temp/test] head foo2/*
==> foo2/ABNA.txt <==
dfg qza 356 245
hjb hkg 455 24
dfg qza 36 55
==> foo2/AC24.txt <==
hjb hkg 457 167
sar sar 234 321
ghf qza 2 165

UPDATED
This is a pure bash solution (grep was removed):
#!/bin/bash
file=
s=()
grp() { r="${s[#]}";r="\b("${r// /|}")\b";
while read w; do [[ $w =~ $r ]] && echo $w;done <foo1/$file >foo2/$file
}
while read a; do
if [[ $a =~ ^\* ]]; then
[ -n "$file" ] && grp
file=${a#\*}
s=()
else s=(${s[#]} $a)
fi
done < ff.txt
[ -n "$file" ] && grp
#See input and output files
for i in foo1/*;{ echo %% in $i; cat $i;}
for i in foo2/*;{ echo %% out $i; cat $i;}
Output
%% in foo1/ABNA.txt
dfg qza 356 245
hjb hkg 455 24
ghf qza 12 123
dfg qza 36 55
%% in foo1/AC24.txt
hjb hkg 457 167
ghf qza 2 165
sar sar 234 321
dfg qza 345 345
%% out foo2/ABNA.txt
dfg qza 356 245
hjb hkg 455 24
dfg qza 36 55
%% out foo2/AC24.txt
hjb hkg 457 167
ghf qza 2 165
sar sar 234 321
In the while-loop it parses the ff.txt file. If a line starts with * then the file environment variable is set. If not starts with * then it is a number and added to the s array. If a new filename found and there is an old filename set then it calls the grp function which does the real work.
The function grp creates a regex in \b(num1|num2...)\b format. The \b is to match only complete numbers. So \b24\b will not match to 245. The while-loop reads the file from foo1, matches each line against the regex and writes the file with the same name to directory foo2. It does not checks if foo2 directory exist.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Remove duplicates in each individual column from a text file - sorting

Related

need help in formatted output using for loop in shell script

filter multiline record file based if one of the lines meet condition ( word count)

filtering fields based on certain values

How to find any decrement in the column?

Extracting lines from text files in a folder based on the numbers in another file

Categories

Resources