how to get minimum prefix from a large file

how to get minimum prefix from a large file - shell

I have a big file whose entries are like this .
Input:
1113
1113456
11134567
12345
1734
123
194567
From this entries , I need to find out the minimum number of prefix which can represent all these entries.
Expected output:
1113
123
1734
194567
If we have 1113 then there is no need to use 1113456 or 1113457.
Things I have tried:
I can use grep -v ^123 and compare with input file and store the unique results in the output file. IF I use a while loop , I dont know , how I can delete the entries from the input file itself.

I will assume that input file is:
790234
790835
795023
79788
7985904
7902713
791
7987
7988
709576
749576
7902712
790856
79780
798599
791453
791454
791455
791456
791457
791458
791459
791460
You can use
awk '!(prev && $0~prev){prev = "^" $0; print}' <(sort file)
Returns
709576
749576
790234
7902712
7902713
790835
790856
791
795023
79780
79788
7985904
798599
7987
7988
How does it work ? First it sorts the file using lexicographic sort (1 < 10 < 2). Then it keeps the minimal prefix and checks if next lines match. If they do they are skipped. If a line doesn't, it will update the minimal prefix and prints the line.
Let's say that input is
71
82
710
First it orders the lines and input becomes (lexicographic sort : 71 < 710 < 82) :
71
710
82
First line is printed because awk variable prev is not set so condition !(prev && $0~prev) is reached. prev becomes 71. On next row, 710 will match regexp ^71 so line is skipped and prev variable stays 71. On next row, 82does not match ^71, condition !(prev && $0~prev) is reached again, line is printed, prev is set to 82.

You may use this awk command:
awk '{
n = (n != "" && index($1, n) == 1 ? n : $1)
}
p != n {
print p = n
}' <(sort file)
1113
123
1734
194567

$ awk 'NR==1 || (index($0,n)!=1){n=$0; print}' <(sort file)
1113
123
1734
194567

Related

How to compare multiple lines in one script and based on that, do something with the column in that row? In Bash

So, I have a text file separated by tabs that looks like:
contig141 hit293 2939 71 293 alksjdflksdf
contig141 hit339 9393 71 302 kljdkfjsjdfksdf
contig124 hit993 9239 55 274 laksjdfkls
contig124 hit101 9333 66 287 aslkdjfalkdjsfkjlsdf
contig124 hit205 4856 123 301 ksdjflksjdfskldjfeiedfdwe
contig132 hit003 2290 58 290 jsdfishfoisodncklsn
contig133 hit100 1889 21 107 sijhfdshfdjhsdjkdfjf
For each contig, I want to subtract the 4th column from the 5th column, and compare the differences. For the contig with the largest difference, I would like to print the entire row to a new file. I'm thinking of a nested loop, but I can't figure out how to do it.
I'm thinking:
Loop over each row of the file.
set variable a = string in first column of the first row
while: the string in the first column of the next row is equal to a,
take the difference of the 4th and 5th column
compare the differences among all rows of that contig
output the row with the greatest difference to a new file
So you would compare the differences between the 4th and 5th column for contig141, and output the line with the greatest difference. Repeat for contig124, etc. etc.

The efficient way to address this is with awk redirecting the output to a new file. In fact any time you start thinking "I need to do X with a field...", your first thought should be awk. (it is the Swiss Army-Knife of text processing). You can do what you need with:
awk '{
diff = $5-$4
if (diff > config[$1]) {
config[$1] = diff
rec[$1] = $0
}
}
END { for(i in config) print rec[i] }' file > newfile
Above you simply calculate the difference between the 5th and 4th field saving in diff. Then check if that difference is the max for the array element of config index by the first field, and if so, update config[$1] = diff saving the max difference for the config, and save the record (line) as rec[$1] also indexed by the first field. Then using the END rule, you simply output the max record for each config.
Example Use/Output
Showing what would be redirected to the new file, you have
awk '{
> diff = $5-$4
> if (diff > config[$1]) {
> config[$1] = diff
> rec[$1] = $0
> }
> }
> END { for(i in config) print rec[i] }' file
contig132 hit003 2290 58 290 jsdfishfoisodncklsn
contig141 hit339 9393 71 302 kljdkfjsjdfksdf
contig133 hit100 1889 21 107 sijhfdshfdjhsdjkdfjf
contig124 hit101 9333 66 287 aslkdjfalkdjsfkjlsdf
You can pipe to sort if you need the configs in sort order before redirecting to the new file, e.g.
$ awk '{
diff = $5-$4
if (diff > config[$1]) {
config[$1] = diff
rec[$1] = $0
}
}
END { for(i in config) print rec[i] }' file | sort
contig124 hit101 9333 66 287 aslkdjfalkdjsfkjlsdf
contig132 hit003 2290 58 290 jsdfishfoisodncklsn
contig133 hit100 1889 21 107 sijhfdshfdjhsdjkdfjf
contig141 hit339 9393 71 302 kljdkfjsjdfksdf

bash shell script for conditional assignment

I have a shell script that gives me a text file output in following format:
OUTPUT.TXT
FirmA
58
FirmB
58
FirmC
58
FirmD
58
FirmE
58
This output is good or a YES that my job completed as expected since the value for all of the firms is 58.
So I used to take count of '58' in this text file to automatically tell in a RESULT job that everything worked out well.
Now there seems to be a bug due to which for sometimes the output comes like below: OUTPUT.TXT
FirmA
58
FirmB
58
FirmC
61
FirmD
58
FirmE
61
which is impacting my count(since only 3 counts of 58 instead of expected 5) and hence my RESULT job states that it FAILED or a NO.
But actually the job has worked fine as long as the value stays within 58 to 61 for each firm.
So how can I ensure that in case the count is >=58 and <=61 for these five firms, than it has worked as expected ?
my simple one liner to check count in OUTPUT.TXT file
grep -cow 58 "OUTPUT.TXT"

Try Awk for simple jobs like this. You can learn enough in an hour to solve these problems yourself easily.
awk '(NR % 3 == 2) && ($1 < 58 || $1 > 61)' OUTPUT.TXT
This checks every third line, starting from the second, and prints any which are not in the range 58 to 61.
It would not be hard to extend the script to remember the string from the previous line. In fact, let's do that.
awk '(NR % 3 == 1) { firm = $0; next }
(NR % 3 == 2) && ($1 < 58 || $1 > 61) { print NR ":" firm, $0 }' OUTPUT.TXT
You might also want to check how many you get of each. But let's just make a separate script for that.
awk '(NR % 3 == 2) { ++a[$1] }
END { for (k in a) print k, a[k] }' OUTPUT.TXT
The Stack Overflow awk tag info page has links to learning materials etc.

Bash script - How to loop through rows in a CSV file

I am working with a huge CSV file (filename.csv) that contains a single column. From column 1, I wanted to read current row and compare it with the value of the previous row. If it is greater OR equal, continue comparing and if the value of the current cell is smaller than the previous row - divide the value of the current cell by the value of the previous cell and exit by printing the value of the division. For example from the following example: i wanted my bash script to divide 327 by 340 and print 0.961765 to the console and exit.
338
338
339
340
327
301
299
284
284
283
283
283
282
282
282
283
I tried it with the following awk and it works perfectly fine.
awk '$1 < val {print $1/val; exit} {val=$1}' filename.csv
However, since i want to include around 7 conditional statements (if-else's), I wanted to do it with a bit cleaner bash script and here is my approach. I am not that used to awk to be honest and that's why i prefer to use bash.
#!/bin/bash
FileName="filename.csv"
# Test when to stop looping
STOP=1
# to find the number of columns
NumCol=`sed 's/[^,]//g' $FileName | wc -c`; let "NumCol+=1"
# Loop until the current cell is less than the count+1
while [ "$STOP" -lt "$NumCol" ]; do
cat $FileName | cut -d, -f$STOP
let "STOP+=1"
done
How can we loop through the values and add conditional statements?
PS: the criteria for my if-else statement is (if the value ($1/val) is >=0.85 and <=0.9, print A, else if the value ($1/val) is >=0.7 and <=0.8, print B, if the value ($1/val) is >=0.5 and <=0.6 print C otherwise print D).

Here's one in GNU awk using switch, because I haven't used it in a while:
awk '
$1<p {
s=sprintf("%.1f",$1/p)
switch(s) {
case "0.9": # if comparing to values ranged [0.9-1.0[ use /0.9/
print "A" # ... in which case (no pun) you don't need sprintf
break
case "0.8":
print "B"
break
case "0.7":
print "c"
break
default:
print "D"
}
exit
}
{ p=$1 }' file
D
Other awks using if:
awk '
$1<p {
# s=sprintf("%.1f",$1/p) # s is not rounded anymore
s=$1/p
# if(s==0.9) # if you want rounding,
# print "A" # uncomment and edit all ifs to resemble
if(s~/0.9/)
print "A"
else if(s~/0.8/)
print "B"
else if(s~/0.7/)
print "c"
else
print "D"
exit
}
{ p=$1 }' file
D

This is an alternative approach,based on previous input data describing comparison of $1/val with fixed numbers 0.9 , 0.7 and 0.6.
This solution will not work with ranges like ($1/val) >=0.85 and <=0.9 as clarified later.
awk 'BEGIN{crit[0.9]="A";crit[0.7]="B";crit[0.6]="C"} \
$1 < val{ss=substr($1/val,1,3);if(ss in crit) {print crit[ss]} else {print D};exit}{val=$1}' file
A
This technique is based on checking if rounded value $1/val belongs to a predefined array loaded with corresponding messages.
Let me expand the code for better understanding:
awk 'BEGIN{crit[0.9]="A";crit[0.7]="B";crit[0.6]="C"} \ #Define the criteria array. Your criteria values are used as keys and values are the messages you want to print.
$1 < val{
ss=substr($1/val,1,3); #gets the first three chars of the result $1/val
if(ss in crit) { #checks if the first three chars is a key of the array crit declared in begin
print crit[ss] #if it is, print it's value
}
else {
print D #If it is not, print D
};
exit
}
{val=$1}' file
Using substr we get the first three chars of the result $1/val:
for $1/val = 0.961765 using substr($1/val,1,3) returns 0.9
If you want to make comparisons based on two decimals like 0.96 then change substr like substr($1/val,1,4).
In this case you need to accordingly provide the correct comparison entries in crit array i.e crit[0.96]="A"

How to split big tsv file using unique column element and also keep header

I have a tsv file called myfile.tsv. I want to split this file based on unique element in chr column using awk/gawk/bash or any faster command line and get chr1.tsv (header+row1), chr2.tsv (header+rows2 and 3),chrX.tsv(header+row4),chrY.tsv(header+rows5and6) and chrM.tsv(header+last row).
myfile.tsv
chr value region
chr1 223 554
chr2 433 444
chr2 443 454
chrX 445 444
chrY 445 443
chrY 435 243
chrM 543 544

Here's a little script that does what you're looking for:
NR == 1 {
header = $0
next
}
{
outfile = "chr" $1 ".tsv"
if (!seen[$1]++) {
print header > outfile
}
print > outfile
}
The first row is saved, so it can be used later. The other lines are printed to file matching the value of the first field. The header is added if the value hasn't been seen yet.
NR is the record number, so NR == 1 is only true when the record number is one (i.e. the first line). In this block, the whole line $0 is saved to the variable header. next skips any other blocks and moves to the next line. This means that the second block (which would otherwise be run unconditionally on every record) is skipped.
For every other line in the file, the output filename is built using the value of the first field. The array seen keeps a track of values of $1. !seen[$1]++ is only true the first time a given value of $1 is seen, as the value of seen[$1] is incremented every time it is checked. If the value of $1 has not yet been seen, the header is written to the output file.
Every line is written to the output file.

awk reading in values

Hello the following code is used by me to split a file
BEGIN{body=0}
!body && /^\/\/$/ {body=1}
body && /^\[/ {print > "first_"FILENAME}
body && /^pos/{$1="";print > "second_"FILENAME}
body && /^[01]+/ {print > "third_"FILENAME}
body && /^\[[0-9]+\]/ {
print > "first_"FILENAME
print substr($0, 2, index($0,"]")-2) > "fourth_"FILENAME
}
the file looks like here
header
//
SeqT: {"POS-s":174.683, "time":0.0130084}
SeqT: {"POS-s":431.49, "time":0.0221447}
[2.04545e+2]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
[29]:((962:0.000580339,930:0.000580339):0.00543993);
absolute:
gthcont: 5 4 2 1 3 4 543 5 67 657 78 67 8 5645 6
01010010101010101010101010101011111100011
1111010010010101010101010111101000100000
00000000000000011001100101010010101011111
The problem is that in the file 4 print substr($0, 2, index($0,"]")-2) > "fourth_"FILENAME the number with the sci notation with e does not get through. it works only as long as it is written without that . how can i cahnge the awk to also get the number in the way like 2.7e+7 or so

The problem is you're trying to match E notation when your regex is looking for integers only.
Instead of:
/^\[[0-9]+\]/
use something like:
/^\[[0-9]+(\.[0-9]+(e[+-]?[0-9]+)?)?\]/
This will match positive integers, floats, and E notation wrapped in square brackets at the start of the line.
See demo

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

how to get minimum prefix from a large file - shell

You may use this awk command: awk '{ n = (n != "" && index($1, n) == 1 ? n : $1) } p != n { print p = n }' <(sort file) 1113 123 1734 194567

$ awk 'NR==1 || (index($0,n)!=1){n=$0; print}' <(sort file) 1113 123 1734 194567

Related

How to compare multiple lines in one script and based on that, do something with the column in that row? In Bash

bash shell script for conditional assignment

Bash script - How to loop through rows in a CSV file

How to split big tsv file using unique column element and also keep header

awk reading in values

Categories

Resources