Compare two files and replace value in one of the file - bash

I have to files. AU.swo and Compare files.
AU.swo contain data:
7844204020353125700125759G19
7855207010004191300200759119
7898211030001191500193359119
7898211030001212800212959G19
7898211030002212600212759G19
Compare contain data:
7844204G1
785520712
7898211G1
789821112
First seven values from of files is ID number, position 8 in Compare file is same as position 26 in AU.swo. What I want to do is replace number 9 from last position in AU.swo file. It should be looks like:
7844204020353125700125759G11
7855207010004191300200759112
7898211030001191500193359112
7898211030001212800212959G11
7898211030002212600212759G11
what is better to use? awk or sed command? Could you give me some hint how can I do this?
Thank you

You may use this awk:
awk 'NR == FNR {
k[substr($0, 1, 7),substr($0, 8, 1)] = substr($0, 9, 1)
next
}
(substr($0, 1, 7), substr($0, 26, 1)) in k {
sub(/9$/, k[substr($0, 1, 7),substr($0, 26, 1)])
} 1' compare AU.swo
7844204020353125700125759G11
7855207010004191300200759112
7898211030001191500193359112
7898211030001212800212959G11
7898211030002212600212759G11

(Edit: another, nicer solution in python:)
python -c 'for l in zip(open("AU.swo"), open("Compare")): print(l[0][:-2] + l[1][-2:-1])'
This one also doesn't append on the exact length of the input strings.
My original solution, although quite ugly:
paste -d '' <(sed 's/\(.*\).$/\1/' AU.swo) <(cut -c 9- Compare)
How this works:
sed 's/\(.*\).$/\1/' AU.swo prints every line from AU.swo without the last character. We surround this with <( ... ) to use this as the first input for paste.
cut -c 9- Compare prints the last character only of every line in Compare. Note that this assumes that each line is exactly 9 characters long.
paste -d '' takes each line of both inputs and prints them together on a single line.
Note that I tested this on linux and on mac OS this might not work.

Related

AWK: subset randomly and without replacement a string in every row of a file

So I need to subset 10 characters from all strings in a particular column of a file, randomly and without repetition (i.e. I want to avoid drawing a character from any given index more than once).
For the sake of simplicity, let's say I have the following string:
ABCDEFGHIJKLMN
For which I should obtain, for example, this result:
DAKLFCHGBI
Notice that no letter occurs twice, which means that no position is extracted more than once.
For this other string:
CCCCCCCCCCCCGG
Analogously, I should never find more than two "G" characters in the output (otherwise it would mean that a "G" character has been sampled more than once), e.g.:
CCGCCCCCCC
Or, in other words, I want to shuffle all characters from each string, and keep the first 10. This can be easily achieved in bash using:
echo "ABCDEFGHIJKLMN" | fold -w1 | shuf -n10 | tr -d '\n'
However, since I need to perform this many times on dozens of files with over a hundred thousand lines each, this is way too slow. So looking around, I've arrived at the following awk code, which seems to work fine whenever the strings are passed to it one by one, e.g.:
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' <(echo "ABCDEFGHIJKLMN")
But when I input the following file with a string on each row, awk hangs and the output gets truncated on the second line:
echo "ABCDEFGHIJKLMN" > file.txt
echo "CCCCCCCCCCCCGG" >> file.txt
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' file.txt
This other version of the code which samples characters from the string with repetition works fine, so it looks like the issue lies in the part which populates the N array, but I'm not proficient in awk so I'm a bit stuck:
awk '{srand(); len=length($1); for(i=1;i<=10;i++) {k=int(rand()*len)+1; printf "%s", substr($1,k,1)} print ""}'
Anyone can help?
In case this matters: my actual file is more complex than the examples provided here, with several other columns, and unlike the ones in this example, its strings may have different lengths.
Thanks in advance for your time :)
EDIT:
As mentioned in the comments, I managed to make it work by removing the N array (so that it resets before processing each row):
awk 'BEGIN{srand()} {len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} split("", N); print ""}' file.txt
Do note however that if the string in $1 is shorter than 10, this will get stuck in an infinite loop, so make sure that all strings are always longer than the subset target size. The alternative solution provided by Andre Wildberg in the comments doesn't carry this issue.
I would harness GNU AWK for this task following way, let file.txt content be
ABCDEFGHIJKLMN
CCCCCCCCCCCCGG
then
awk 'function comp_func(i1, v1, i2, v2){return rand()-0.5}BEGIN{FPAT=".";PROCINFO["sorted_in"]="comp_func"}{s="";patsplit($0,arr);for(i in arr){s = s arr[i]};print substr(s,1,10)}' file.txt
might give output
NGLHCKEIMJ
CCCCCCCCGG
Explanation: I use custom Array Traversal Control function which does randomly decides which element should be considered greater. -0.5 is used as rand() gives values from 0 to 1. For each line array arr is populated by characters of line, then traversed in random order to create s string which are characters shuffled, then substr used to get first 10 characters. You might elect to add counter which will terminate for loop if you have very long lines in comparison to number of characters to select.
(tested in GNU Awk 5.0.1)
Iteratively construct a substring of the remaining letters.
Tested with
awk version 20121220
GNU Awk 4.2.1, API: 2.0
GNU Awk 5.2.1, API 3.2
mawk 1.3.4 20200120
% awk -v size=10 'BEGIN{srand()} {n=length($0); a=$0; x=0;
for(i=1; i<=n; i++){x++; na=length(a); rnd = int(rand() * na + 1)
printf("%s", substr(a, rnd, 1))
a=substr(a, 1, rnd - 1)""substr(a, rnd + 1, na)
if(x >= size){break}}
print ""}' file.txt
CJFMKHNDLA
CGCCCCCCCC
In consecutive iterative runs remember to check if srand works the way you expect in your version of awk. If in doubt use $RANDOM or, better, /dev/urandom.
if u don't need to be strictly within awk, then jot makes it super easy :
say you want 20 random characters between
"A" (ascii 65) and "N" (ascii 78), inc. repeats of same chars
jot -s '' -c -r 20 65 78
ANNKECLDMLMNCLGDIGNL

How to average the values of different files and save them in a new file

I have about 140 files with data which I would like to process with a script.
The files have two types of names:
sys-time-4-16-80-15-1-1.txt
known-ratio-4-16-80-15-1-1.txt
where the two last numbers vary. The penultimate number takes 1, 50, 100, 150,...,300, and the last number ranges from 1,2,3,4,5...,10. A sample of these files are in this link.
I would like to write a new file with 3 columns as follows:
A 1st column with the penultimate number of the file, i.e., 1,25,50...
A 2nd column with the mean value of the second column in each sys-time-.. file.
A 3rd column with the mean value of the second column in each known-ratio-.. file.
The result might have a row for each pair of averaged 2nd columns of sys and known files:
1 mean-sys-1 mean-know-1
1 mean-sys-2 mean-know-2
.
.
1 mean-sys-10 mean-know-10
50 mean-sys-1 mean-know-1
50 mean-sys-2 mean-know-2
.
.
50 mean-sys-10 mean-know-10
100 mean-sys-1 mean-know-1
100 mean-sys-2 mean-know-2
.
.
100 mean-sys-10 mean-know-10
....
....
300 mean-sys-10 mean-know-10
where each row corresponds with the sys and known files with the same two last numbers.
Besides, I would like to copy in the first column the penultimate number of the files.
I know how to compute the mean value of the second column of a file with awk:
awk '{ sum += $2; n++ } END { if (n > 0) print sum / n; }' sys-time-4-16-80-15-1-5.txt
but I do not know how to iterate on all the files and build a result file with the three columns as above.
Here's a shell script that uses GNU datamash to compute the averages (Though you can easily swap out to awk if desired; I prefer datamash for calculating stats):
#!/bin/sh
nums=$(mktemp)
sysmeans=$(mktemp)
knownmeans=$(mktemp)
for systime in sys-time-*.txt
do
knownratio=$(echo -n "$systime" | sed -e 's/sys-time/known-ratio/')
echo "$systime" | sed -E 's/.*-([0-9]+)-[0-9]+\.txt/\1/' >> "$nums"
datamash -W mean 2 < "$systime" >> "$sysmeans"
datamash -W mean 2 < "$knownratio" >> "$knownmeans"
done
paste "$nums" "$sysmeans" "$knownmeans"
rm -f "$nums" "$sysmeans" "$knownmeans"
It creates three temporary files, one per column, and after populating them with the data from each pair of files, one pair per line of each, uses paste to combine them all and print the result to standard output.
I've used GNU Awk for easy, per-file operations. This is untested; please let me know how it runs. You might want to look into printf() for pretty-printed output.
mapfile -t Files < <(find . -type f -name "*-4-16-80-15-*" |sort -t\- -k7,7 -k8,8) #1
gawk '
BEGINFILE {n=split(FILENAME, f, "-"); type=f[1]; a[type]=0} #2
{a[type] = ($2 + a[type] * c++) / c} #3
ENDFILE {if(type=="sys") print f[n], a[sys], a[known]} #4
' "${Files[#]}"
Create a Bash array with matching files sorted by the last two "keys". We will feed this array to Awk later. Notice how we alternate between "sys" and "known" files in this sample:
./known-ratio-4-16-80-15-2-150
./sys-time-4-16-80-15-2-150
./known-ratio-4-16-80-15-3-1
./sys-time-4-16-80-15-3-1
./known-ratio-4-16-80-15-3-50
./sys-time-4-16-80-15-3-50
At the beginning of every file, clear any existing average value and save the type as either "sys" or "known".
On every line, calculate the Cumulative Moving Average
At the end of every file, check the file type. If we just handled a "sys" file, print the last part of the filename followed by our averages.

How to grep lines that have the pattern between start and end position in the line?

I have the below sample of file:
123456789000000000123456789000000000
123456789123456789123456789000000000
I want to grep for the lines that have 000000000 between 10th and 18th position of the line. In such a way that second line will be skipped/ignored.
How this can be achieved?
You must use ^ for the start of line and . for any character.
You can start with
echo "123456789000000000123456789000000000
123456789123456789123456789000000000" | grep "^.........000000000"
For this command you need to count the characters in the command.
The command can be simplified to
echo "123456789000000000123456789000000000
123456789123456789123456789000000000" | grep -E "^.{9}0{9}"
It is easier to do with awk using substr without using any regex:
awk 'substr($0, 10, 9) != "000000000"' file
123456789123456789123456789000000000
substr($0, 10, 9) will get substring of each record between 10th position and 18th position.
You can also use:
awk 'substr($0, 10, 9) != sprintf("%09d", 0)' file
sprintf("%09d", 0) will print 0 9 times.

deleting concret data from csv

I have a long csv file with 5 columns. But 3 lines have 6 columns. One begin with "tomasluck", another "peterblack" and the last one "susanpeeters". I need to delete, in this 3 lines, the fourth element (column) and get only 5 columns.
I put a short example, my file is long and is created automatically.
petergreat, 15, 11-03-2015, 10, 10
tomasluck, 15, 10-03-2015, tl, 10, 10
anaperez, 14, 11-03-2015, 10, 11
and I need
petergreat, 15, 11-03-2015, 10, 10
tomasluck, 15, 10-03-2015, 10, 10
anaperez, 14, 11-03-2015, 10, 11
Exactly, I was thinking in a code, that select the lines that begin with tomasluck, peterblack and susanpeeters, and then delete the 4rht field or colum.
The tricky thing about this is to keep the formatting intact. The simplest way, I think, is to treat the input as plain text and use sed:
sed '/^tomasluck,/ s/,[^,]*//3' file.csv
This removes, in a line that begins with tomasluck,, the third occurrence of a comma followed by a field (non-comma characters). The filter regex can be amended to include other first fields, such as
sed '/^\(tomasluck\|petergreat\|anaperez\),/ s/,[^,]*//3' file.csv
...but in your input data, those lines don't appear to have a sixth field.
Further ideas that may or may not pertain to your use case:
Removing the fourth field on the basis of the number of fields is a little trickier in sed, largely because sed does not have arithmetic functionality and identifying the lines is a bit tedious:
sed 'h; s/[^,]//g; /.\{5\}/ { x; s/,[^,]*//3; x; }; x' file.csv
That is:
h # copy the line to the hold buffer
s/[^,]//g # remove all non-comma characters
/.\{5\}/ { # if five characters remain (if the line has six or more
# fields)
x # exchange pattern space and hold buffer
s/,[^,]*//3 # remove field
x # swap back again
}
x # finally, swap in the actual data before printing.
The x dance is typical of sed scripts that use the hold buffer; the goal is to make sure that regardless of whether the substitution takes place, in the end the line (and not the isolated commas) are printed.
Mind you, if you want the selection condition to be that a line has six or more fields, it is worth considering to use awk, where the condition is easier to formulate but the replacement of the field is more tedious:
awk -F , 'BEGIN { OFS = FS } NF > 5 { for(i = 5; i <= NF; ++i) { $(i - 1) = $i }; --NF; $1 = $1 } 1' file.csv
That is: Split line at commas (-F ,), then
BEGIN { OFS = FS } # output field separator is input FS
NF > 5 { # if there are more than five fields
for(i = 5; i <= NF; ++i) { # shift them back one, starting at the fifth
$(i - 1) = $i
}
--NF # let awk know that there is one less field
$1 = $1 # for BSD awk: force rebuilding of the line
}
1 # whether or not a transformation happened, print.
This should work for most awks; I have tested it with gawk and mawk. However, because nothing is ever easy to do portably, I am told that there is at least one awk out there (on old Solaris, I believe) that doesn't understand the --NF trick. It would be possible to hack something together with sprintf for that, but it's enough of a corner case that I don't expect it to bite you.
a more generic solution is to check, whether we have 5 or 6 fields:
awk -F', ' '{if(NF==6) print $1", "$2", "$3", "$5", "$6; else print $0}' file.csv
You could do this through sed which uses capturing group a capturing group based regex.
$ sed 's/^\(\(tomasluck\|peterblack\|susanpeeters\),[^,]*,[^,]*\),[^,]*/\1/' file
petergreat, 15, 11-03-2015, 10, 10
tomasluck, 15, 10-03-2015, 10, 10
anaperez, 14, 11-03-2015, 10, 11
This captures all the characters upto the third column and matches the fourth column. Replacing the matched characters with the chars inside group 1 will give you the desired output.

Finding gaps in sequential numbers

I don’t do this stuff for a living so forgive me if it’s a simple question (or more complicated than I think). I‘ve been digging through the archives and found a lot of tips that are close but being a novice I’m not sure how to tweak for my needs or they are way beyond my understanding.
I have some large data files that I can parse out to generate a list of coordinate that are mostly sequential
5
6
7
8
15
16
17
25
26
27
What I want is a list of the gaps
1-4
9-14
18-24
I don’t know perl, SQL or anything fancy but thought I might be able to do something that would subtract one number from the next. I could then at least grep the output where the difference was not 1 or -1 and work with that to get the gaps.
With awk :
awk '$1!=p+1{print p+1"-"$1-1}{p=$1}' file.txt
explanations
$1 is the first column from current input line
p is the previous value of the last line
so ($1!=p+1) is a condition : if $1 is different than previous value +1, then :
this part is executed : {print p+1 "-" $1-1} : print previous value +1, the - character and fist columns + 1
{p=$1} is executed for each lines : p is assigned to the current 1st column
interesting question.
sputnick's awk one-liner is nice. I cannot write a simpler one than his. I just add another way using diff:
seq $(tail -1 file)|diff - file|grep -Po '.*(?=d)'
the output with your example would be:
1,4
9,14
18,24
I knew that there is comma in it, instead of -. you could replace the grep with sed to get -, grep cannot change the input text... but the idea is same.
hope it helps.
A Ruby Answer
Perhaps someone else can give you the Bash or Awk solution you asked for. However, I think any shell-based answer is likely to be extremely localized for your data set, and not very extendable. Solving the problem in Ruby is fairly simple, and provides you with flexible formatting and more options for manipulating the data set in other ways down the road. YMMV.
#!/usr/bin/env ruby
# You could read from a file if you prefer,
# but this is your provided corpus.
nums = [5, 6, 7, 8, 15, 16, 17, 25, 26, 27]
# Find gaps between zero and first digit.
nums.unshift 0
# Create array of arrays containing missing digits.
missing_nums = nums.each_cons(2).map do |array|
(array.first.succ...array.last).to_a unless
array.first.succ == array.last
end.compact
# => [[1, 2, 3, 4], [9, 10, 11, 12, 13, 14], [18, 19, 20, 21, 22, 23, 24]]
# Format the results any way you want.
puts missing_nums.map { |ary| "#{ary.first}-#{ary.last}" }
Given your current corpus, this yields the following on standard output:
1-4
9-14
18-24
Just remember the previous number and verify that the current one is the previous plus one:
#! /bin/bash
previous=0
while read n ; do
if (( n != previous + 1 )) ; then
echo $(( previous + 1 ))-$(( n - 1 ))
fi
previous=$n
done
You might need to add some checking to prevent lines like 28-28 for single number gaps.
Perl solution similar to awk solution from StardustOne:
perl -ane 'if ($F[0] != $p+1) {printf "%d-%d\n",$p+1,$F[0]-1}; $p=$F[0]' file.txt
These command-line options are used:
-n loop around every line of the input file, do not automatically print every line
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace. Fields are indexed starting with 0.
-e execute the perl code
Given input file, use the numinterval util and paste its output beside file, then munge it with tr, xargs, sed and printf:
gaps() { paste <(echo; numinterval "$1" | tr 1 '-' | tr -d '[02-9]') "$1" |
tr -d '[:blank:]' | xargs echo |
sed 's/ -/-/g;s/-[^ ]*-/-/g' | xargs printf "%s\n" ; }
Output of gaps file:
5-8
15-17
25-27
How it works. The output of paste <(echo; numinterval file) file looks like:
5
1 6
1 7
1 8
7 15
1 16
1 17
8 25
1 26
1 27
From there we mainly replace things in column #1, and tweak the spacing. The 1s are replaced with -s, and the higher numbers are blanked. Remove some blanks with tr. Replace runs of hyphens like "5-6-7-8" with a single hyphen "5-8", and that's the output.
This one list the ones who breaks the sequence from a list.
Idea taken from #choroba but done with a for.
#! /bin/bash
previous=0
n=$( cat listaNums.txt )
for number in $n
do
numListed=$(($number - 1))
if [ $numListed != $previous ] && [ $number != 2147483647 ]; then
echo $numListed
fi
previous=$number
done

Resources