So I need to subset 10 characters from all strings in a particular column of a file, randomly and without repetition (i.e. I want to avoid drawing a character from any given index more than once).
For the sake of simplicity, let's say I have the following string:
ABCDEFGHIJKLMN
For which I should obtain, for example, this result:
DAKLFCHGBI
Notice that no letter occurs twice, which means that no position is extracted more than once.
For this other string:
CCCCCCCCCCCCGG
Analogously, I should never find more than two "G" characters in the output (otherwise it would mean that a "G" character has been sampled more than once), e.g.:
CCGCCCCCCC
Or, in other words, I want to shuffle all characters from each string, and keep the first 10. This can be easily achieved in bash using:
echo "ABCDEFGHIJKLMN" | fold -w1 | shuf -n10 | tr -d '\n'
However, since I need to perform this many times on dozens of files with over a hundred thousand lines each, this is way too slow. So looking around, I've arrived at the following awk code, which seems to work fine whenever the strings are passed to it one by one, e.g.:
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' <(echo "ABCDEFGHIJKLMN")
But when I input the following file with a string on each row, awk hangs and the output gets truncated on the second line:
echo "ABCDEFGHIJKLMN" > file.txt
echo "CCCCCCCCCCCCGG" >> file.txt
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' file.txt
This other version of the code which samples characters from the string with repetition works fine, so it looks like the issue lies in the part which populates the N array, but I'm not proficient in awk so I'm a bit stuck:
awk '{srand(); len=length($1); for(i=1;i<=10;i++) {k=int(rand()*len)+1; printf "%s", substr($1,k,1)} print ""}'
Anyone can help?
In case this matters: my actual file is more complex than the examples provided here, with several other columns, and unlike the ones in this example, its strings may have different lengths.
Thanks in advance for your time :)
EDIT:
As mentioned in the comments, I managed to make it work by removing the N array (so that it resets before processing each row):
awk 'BEGIN{srand()} {len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} split("", N); print ""}' file.txt
Do note however that if the string in $1 is shorter than 10, this will get stuck in an infinite loop, so make sure that all strings are always longer than the subset target size. The alternative solution provided by Andre Wildberg in the comments doesn't carry this issue.
I would harness GNU AWK for this task following way, let file.txt content be
ABCDEFGHIJKLMN
CCCCCCCCCCCCGG
then
awk 'function comp_func(i1, v1, i2, v2){return rand()-0.5}BEGIN{FPAT=".";PROCINFO["sorted_in"]="comp_func"}{s="";patsplit($0,arr);for(i in arr){s = s arr[i]};print substr(s,1,10)}' file.txt
might give output
NGLHCKEIMJ
CCCCCCCCGG
Explanation: I use custom Array Traversal Control function which does randomly decides which element should be considered greater. -0.5 is used as rand() gives values from 0 to 1. For each line array arr is populated by characters of line, then traversed in random order to create s string which are characters shuffled, then substr used to get first 10 characters. You might elect to add counter which will terminate for loop if you have very long lines in comparison to number of characters to select.
(tested in GNU Awk 5.0.1)
Iteratively construct a substring of the remaining letters.
Tested with
awk version 20121220
GNU Awk 4.2.1, API: 2.0
GNU Awk 5.2.1, API 3.2
mawk 1.3.4 20200120
% awk -v size=10 'BEGIN{srand()} {n=length($0); a=$0; x=0;
for(i=1; i<=n; i++){x++; na=length(a); rnd = int(rand() * na + 1)
printf("%s", substr(a, rnd, 1))
a=substr(a, 1, rnd - 1)""substr(a, rnd + 1, na)
if(x >= size){break}}
print ""}' file.txt
CJFMKHNDLA
CGCCCCCCCC
In consecutive iterative runs remember to check if srand works the way you expect in your version of awk. If in doubt use $RANDOM or, better, /dev/urandom.
if u don't need to be strictly within awk, then jot makes it super easy :
say you want 20 random characters between
"A" (ascii 65) and "N" (ascii 78), inc. repeats of same chars
jot -s '' -c -r 20 65 78
ANNKECLDMLMNCLGDIGNL
Goodmorning !
I have a file.csv with 140 lines and 26 columns. I need to sort the lines in according the values in column 23. This is an exemple :
Controller1,NA,ASHEBORO,ASH,B,,3674,4572,1814,3674,4572,1814,1859,#NAME?,0,124.45%,49.39%,19%,1,,"Big Risk, No Spare disk",45.04%,4.35%,12.63%,160,464,,,,,,0,1,1,1,0,410,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller2,EU,FR,URG,D,,0,0,0,0,0,0,0,#NAME?,0,#DIV/0!,#DIV/0!,#DIV/0!,1,,#N/A,0.00%,0.00%,#DIV/0!,NO STATS,-1088,,,,,,#N/A,#N/A,#N/A,#N/A,0,#N/A,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller3,EU,FR,URG,D,,0,0,0,0,0,0,0,#NAME?,0,#DIV/0!,#DIV/0!,#DIV/0!,1,,#N/A,0.00%,0.00%,#DIV/0!,NO STATS,-2159,,,,,,#N/A,#N/A,#N/A,#N/A,0,#N/A,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller4,NA,STARR,STA,D,,4430,6440,3736,4430,6440,3736,693,#NAME?,0,145.38%,84.35%,18%,1,,No more Data disk,65.17%,19.18%,-2.18%,849,-96,,,,,,0,2,1,2,2,547,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
To sort the lines according the values of the column 23, I do this :
awk -F "%*," '$23 > 4' myfikle.csv
The result :
Controller1,NA,ASHEBORO,ASH,B,,3674,4572,1814,3674,4572,1814,1859,#NAME?,0,124.45%,49.39%,19%,1,,"Big Risk, No Spare disk",45.04%,4.35%,12.63%,160,464,,,,,,0,1,1,1,0,410,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
Controller4,NA,STARR,STA,D,,4430,6440,3736,4430,6440,3736,693,#NAME?,0,145.38%,84.35%,18%,1,,No more Data disk,65.17%,19.18%,-2.18%,849,-96,,,,,,0,2,1,2,2,547,65%,1.1,1.1,1.3,0.65,0.65,0.75,0.04,0.1,,,,,,,,,
In my example, I use the value of 4% in column 23, the goal being to retrieve all the rows with their value in % which increases significantly in column 23. The problem is that I can't base myself on the 4% value because it is only representative of the current table. So I have to find another way to retrieve the rows that have a high value in column 23.
I have to sort the Controllers in descending order according to the percentage in column 23, I prefer to process the first 10% of the sorted lines to make sure I have the controllers with a large percentage.
The goal is to be able to vary the percentage according to the number of lines in the table.
Do you have any tips for that ?
Thanks ! :)
I could have sworn that this question was a duplicate, but so far I couldn't find a similar question.
Whether your file is sorted or not does not really matter. From any file you can extract the NUMBER first lines with head -n NUMBER. There is no built-in way to specify the number percentually, but you can compute that PERCENT% of your file's lines are NUMBER lines.
percentualHead() {
percent="$1"
file="$2"
linesTotal="$(wc -l < "$file")"
(( lines = linesTotal * percent / 100 ))
head -n "$lines" "$file"
}
or shorter but less readable
percentualHead() {
head -n "$(( "$(wc -l < "$2")" * "$1" / 100 ))" "$2"
}
Calling percentualHead 10 yourFile will print the first 10% of lines from yourFile to stdout.
Note that percentualHead only works with files because the file has to be read twice. It does not work with FIFOs, <(), or pipes.
If you want to use standard tools, you'll need to read the file twice. But if you're content to use perl, you can simply do:
perl -e 'my #sorted = sort <>; print #sorted[0..$#sorted * .10]' input-file
Here is one for GNU awk to get the top p% from the file but they are outputed in the order of appearance:
$ awk -F, -v p=0.5 ' # 50 % of top $23 records
NR==FNR { # first run
a[NR]=$23 # hash precentages to a, NR as key
next
}
FNR==1 { # second run, at beginning
n=asorti(a,a,"#val_num_desc") # sort percentages to descending order
for(i=1;i<=n*p;i++) # get only the top p %
b[a[i]] # hash their NRs to b
}
(FNR in b) # top p % BUT not in order
' file file | cut -d, -f 23 # file processed twice, cut 23rd for demo
45.04%
19.18%
Commenting this in a bit.
I have a CSV that was exported, some lines have a linefeed (ASCII 012) in the middle of a record. I need to replace this with a space, but preserve the new line for each record to load it.
Most of the lines are fine, however a good few have this:
Input:
10 , ,"2007-07-30 13.26.21.598000" ,1922 ,0 , , , ,"Special Needs List Rows updated :
Row 1 : Instruction: other :Comment: pump runs all of the water for the insd's home" ,10003 ,524 ,"cc:2023" , , ,2023 , , ,"CCR" ,"INSERT" ,"2011-12-03 01.25.39.759555" ,"2011-12-03 01.25.39.759555"
Output:
10 , ,"2007-07-30 13.26.21.598000" ,1922 ,0 , , , ,"Special Needs List Rows updated :Row 1 : Instruction: other :Comment: pump runs all of the water for the insd's home" ,10003 ,524 ,"cc:2023" , , ,2023 , , ,"CCR" ,"INSERT" ,"2011-12-03 01.25.39.759555" ,"2011-12-03 01.25.39.759555"
I have been looking into Awk but cannot really make sense of how to preserve the actual row.
Another Example:
Input:
9~~"2007-08-01 16.14.45.099000"~2215~0~~~~"Exposure closed (Unnecessary) : Garage door working
Claim Withdrawn"~~701~"cc:6007"~~564~6007~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
4~~"2007-08-01 16.14.49.333000"~1923~0~~~~"Assigned to user Leanne Hamshere in group GIO Home Processing (Team 3)"~~912~"cc:6008"~~~6008~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
Output:
9~~"2007-08-01 16.14.45.099000"~2215~0~~~~"Exposure closed (Unnecessary) : Garage door working Claim Withdrawn"~~701~"cc:6007"~~564~6007~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
4~~"2007-08-01 16.14.49.333000"~1923~0~~~~"Assigned to user Leanne Hamshere in group GIO Home Processing (Team 3)"~~912~"cc:6008"~~~6008~~~"CCR"~"INSERT"~"2011-12-03 01.25.39.759555"~"2011-12-03 01.25.39.759555"
One way using GNU awk:
awk -f script.awk file.txt
Contents of script.awk:
BEGIN {
FS = "[,~]"
}
NF < 21 {
line = (line ? line OFS : line) $0
fields = fields + NF
}
fields >= 21 {
print line
line=""
fields=0
}
NF == 21 {
print
}
Alternatively, you can use this one-liner:
awk -F "[,~]" 'NF < 21 { line = (line ? line OFS : line) $0; fields = fields + NF } fields >= 21 { print line; line=""; fields=0 } NF == 21 { print }' file.txt
Explanation:
I made an observation about your expected output: it seems each line should contain exactly 21 fields. Therefore if your line contains less than 21 fields, store the line and store the number of fields. When we loop onto the next line, the line will be joined to the stored line with a space, and the number of fields totaled. If this number of fields is greater or equal to 21 (the sum of the fields of a broken line will add to 22), print the stored line. Else if the line contains 21 fields (NF == 21), print it. HTH.
I think sed is your choice. I assume all the records end with non-colon character, thus if a line end with a colon, it is recognized as an exception and should be concatenated to the previous line.
Here is the code:
cat data | sed -e '/[^"]$/N' -e 's/\n//g'
The first execution -e '/[^"]$/N' match an abnormal case, and read in next record without empty the buffer. Then -e 's/\n//g' remove the new line character.
try this one-liner:
awk '{if(t){print;t=0;next;}x=$0;n=gsub(/"/,"",x);if(n%2){printf $0" ";t=1;}else print $0}' file
idea:
count the number of " in a line. if the count is odd, join the following line, otherwise the current line would be considered as a complete line.