Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have this sample file below. This file is in a size of 50 MB. I want to match "AAA 2A" and except the first line of the match I want to increment the number in second column. Below input file has 3 lines that contain "AAA 2A" so except the very first line I want to increment the digit in each second column by 1.
Input file
[root#localhost]# cat sample.txt
AAA 2A
BBB 4A
AAA 2A
BBB 1A
AAA 2A
AAA 3A
BBB 2A
AAA 4A
Expected Output
[root#localhost]# cat output.txt
AAA 2A
BBB 4A
AAA 3A
BBB 1A
AAA 4A
AAA 3A
BBB 2A
AAA 4A
awk 'BEGIN{counter=1}$0=="AAA 2A"{counter++;sub("2", counter)}1' <file>
counter starts at 1 and increases every time line is "AAA 2A". On those lines, it replace 2 with value of counter (so first time it gets it, value does not change because counter is at 2 already)
Related
I have tried to come up with this answer but everything I try does not work.
My code below is what I have come up with:
sort -k$field_number "$1".db > temp.txt && cp temp.txt "$1.db"
Shouldn't this line of code sort the .db file by ASCII value (the sort function should sort by ASCII by default?). In the code, field_number corresponds to the column I wish to sort the lines of the file by. When I use my code to format the file (where I am sorting by column 2), I get the output below.
Textfile (the .db file) format:
a 5 5 5
Green 72 72 72
Smith 84 72 93
Jones 85 73 94
z 9 9 9
Ford 92 64 93
Miller 93 73 87
bobua che Apple Xor
Maybe your problem is with your collection. Try this please:
LC_COLLATE=C sort -n --ignore-case -k$field_number "$1".db > temp.txt && cp temp.txt "$1.db"
I have 2 CSV files, where the 1st one is my main CSV that contains all the columns I need. The 2nd CSV contains 2 columns, where the 1st column is an identifier, and the 2nd column is replacement value. For example
Main.csv
aaa 111 bbb 222 ccc 333
ddd 444 eee 555 fff 666
iii 777 jjj 888 kkk 999
lll 101 eee 201 nnn 301
replacement.csv
bbb abc
jjj def
eee ghi
I want the results to look like the following, where for example the 3rd column of the main.csv is the identifier and 1st column of the replacement.csv. By using that as an identifier, the 5th column of main.csv should be replaced with 2nd column of replacement.csv. Also, the main.csv can have repeated values, so all the values should be changed to the appropriate replacement value
aaa 111 bbb 222 abc 333
ddd 444 eee 555 ghi 666
iii 777 jjj 888 def 999
lll 101 eee 201 ghi 301
I tried a code like this
while read col1 col2 col3 col4 col5 col6
do
while read col7 col8
do
if[$col7==col3]
then
col5=col8
fi
done < RepCSV
done < MainCSV > MainCSV
But it did not work.
I'm quite new to bash, so the help will be appreciated. Thanks in advance
Using awk:
$ awk '
NR==FNR { # process the first file
a[$1]=$2 # hash $2 to a, $1 as key
next # next record
}
{ # second file
$5=($3 in a?a[$3]:$5) ยค replace $5 based on $3
}1' replacement main
aaa 111 bbb 222 abc 333
ddd 444 eee 555 ghi 666
iii 777 jjj 888 def 999
lll 101 eee 201 ghi 301
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have three files with following format:
$ cat a.bed
chr1 6 6 aa
chr1 8 8 bb
chr2 22 22 aa
chr3 24 24 bb
$ cat b.bed
chr1 12 12 cc
chr1 6 6 dd
chr5 14 14 cc
$ cat c.bed
chr1 8 8 ss
chr4 11 11 dd
chr1 6 6 aa
I want to compare these files using first two columns and print information for each row whether it is present in one file or multiple files, like:
chr1 6 6 aa 3 a.bed,b.bed,c.bed
chr1 8 8 bb 2 a.bed,c.bed
chr2 22 22 aa 1 a.bed
chr3 24 24 bb 1 a.bed
chr1 12 12 cc 1 b.bed
chr5 14 14 cc 1 b.bed
chr4 11 11 dd 1 c.bed
where 5th column gives number of of files it is present in and 6th column gives name of the files.
awk to the rescue!
$ awk '{a[$1,$2]=(($1,$2) in a?a[$1,$2]",":$0 OFS)FILENAME}
END{for(k in a) print a[k]}' {a,b,c}.bed
results won't be in the same order though.
Explanation
x=c?a:b is the ternary operator, sets x to a or b based on value of c (similar to if-then-else). Here we assign the value of map for key ($1,$2) either by appending FILENAME (if already exists) or setting to the current line (again by appending FILENAME). In the END block, just iterates over this map, and prints the values.
Try these four lines of gawk (doesn't appear to work in awk):
gawk '{print $0, FILENAME}' a.bed > abc.bed
gawk '{print $0, FILENAME}' b.bed >> abc.bed
gawk '{print $0, FILENAME}' c.bed >> abc.bed
gawk '{f = $5;k=$1 " " $2 " " $3 " " $4;if(k in a){a[k] = a[k] "," f}else{a[k] = f};c[k]++};END{for(k in a){print k, c[k], a[k]}}' abc.bed
Single char variables for brevity:
f - file name,
k - key, i.e. the data,
a - an array of keys,
c - an array of key counts.
Er, if I am reading it right, your input and output data samples don't match, e.g. there are only 2 'chr1 6 6 aa' not 3.
I have two files. One contains a list of items, e.g.,
Allie
Bob
John
Laurie
Another file (file2) contains a different list of items in a different order, but some items might overlap with the items in file 1, e.g,
Laurie 45 56 6 75
Moxipen 10 45 56 56
Allie 45 56 67 23
I want to intersect these two files and extract only those lines from file 2 whose first field matches an item in field 1.
i.e., my output should be
Allie 45 56 67 23
Laurie 45 56 6 75
(preferably in this order, but it's OK if not)
grep -f file1 file2 doesn't do what I want.
I also need something efficient because the second file is HUGE.
I also tried this:
awk -F, 'FNR==NR {a[$1]=$0; next}; $1 in a {print a[$1]}' file2 file1
If order doesn't matter then
awk 'FNR==NR{ arr[$1]; next }$1 in arr' file1 file2
Explanation
FNR==NR{ arr[$1]; next } Here we read first file (file1), arr is array, whose index key being first field $1.
$1 in arr we read second file ( file2), if array arr which was created while reading first file, has index key which is second file's first column ($1 in arr gives true, if index key exists), then print current record/row/line from file2
Test Results:
akshay#db-3325:/tmp$ cat file1
Allie
Bob
John
Laurie
akshay#db-3325:/tmp$ cat file2
Laurie 45 56 6 75
Moxipen 10 45 56 56
Allie 45 56 67 23
akshay#db-3325:/tmp$ awk 'FNR==NR{ arr[$1]; next }$1 in arr' file1 file2
Laurie 45 56 6 75
Allie 45 56 67 23
No need for complex joins, it is a filtering function
$ grep -wFf file1 file2
Laurie 45 56 6 75
Allie 45 56 67 23
has the benefit or keeping the order in file2 as well. -w option is for full word matches to eliminate sub-string matches to create false positives. Of course if your sample input is not representative and your data may contain key like entries in other fields this will not work without qualifying beginning of line.
This is the job that join is built for.
Providing a reproducer testable via copy-and-paste with shell functions (which you could replace with your actual input files):
cat_file1() {
printf '%s\n' Allie Bob John Laurie
}
cat_file2() {
printf '%s\n' 'Laurie 45 56 6 75' \
'Moxipen 10 45 56 56' \
'Allie 45 56 67 23'
}
join <(cat_file1 | sort) <(cat_file2 | sort)
...properly emits:
Allie 45 56 67 23
Laurie 45 56 6 75
Of course, don't cat file1 | sort -- run sort <file1 to provide a real handle for better efficiency, or (better!) store your inputs in sorted form in the first place.
I have two 2D-array files to read with bash.
What I want to do is extract the elements inside both files.
These two files contain different rows x columns such as:
file1.txt (nx7)
NO DESC ID TYPE W S GRADE
1 AAA 20 AD 100 100 E2
2 BBB C0 U 200 200 D
3 CCC 9G R 135 135 U1
4 DDD 9H Z 246 246 T1
5 EEE 9J R 789 789 U1
.
.
.
file2.txt (mx3)
DESC W S
AAA 100 100
CCC 135 135
EEE 789 789
.
.
.
Here is what I want to do:
Extract the element in DESC column of file2.txt then find the corresponding element in file1.txt.
Extract the W,S elements in such row of file2.txt then find the corresponding W,S elements in such row of file1.txt.
If [W1==W2 && S1==S2]; then echo "${DESC[colindex]} ok"; else echo "${DESC[colindex]} NG"
How can I read this kind of file as a 2D array with bash or is there any convenient way to do that?
bash does not support 2D arrays. You can simulate them by generating 1D array variables like array1, array2, and so on.
Assuming DESC is a key (i.e. has no duplicate values) and does not contain any spaces:
#!/bin/bash
# read data from file1
idx=0
while read -a data$idx; do
let idx++
done <file1.txt
# process data from file2
while read desc w2 s2; do
for ((i=0; i<idx; i++)); do
v="data$i[1]"
[ "$desc" = "${!v}" ] && {
w1="data$i[4]"
s1="data$i[5]"
if [ "$w2" = "${!w1}" -a "$s2" = "${!s1}" ]; then
echo "$desc ok"
else
echo "$desc NG"
fi
break
}
done
done <file2.txt
For brevity, optimizations such as taking advantage of sort order are left out.
If the files actually contain the header NO DESC ID TYPE ... then use tail -n +2 to discard it before processing.
A more elegant solution is also possible, which avoids reading the entire file in memory. This should only be relevant for really large files though.
If the rows order is not needed be preserved (can be sorted), maybe this is enough:
join -2 2 -o 1.1,1.2,1.3,2.5,2.6 <(tail -n +2 file2.txt|sort) <(tail -n +2 file1.txt|sort) |\
sed 's/^\([^ ]*\) \([^ ]*\) \([^ ]*\) \2 \3/\1 OK/' |\
sed '/ OK$/!s/\([^ ]*\) .*/\1 NG/'
For file1.txt
NO DESC ID TYPE W S GRADE
1 AAA 20 AD 100 100 E2
2 BBB C0 U 200 200 D
3 CCC 9G R 135 135 U1
4 DDD 9H Z 246 246 T1
5 EEE 9J R 789 789 U1
and file2.txt
DESC W S
AAA 000 100
CCC 135 135
EEE 789 000
FCK xxx 135
produces:
AAA NG
CCC OK
EEE NG
Explanation:
skip the header line in both files - tail +2
sort both files
join the needed columns from both files into one table like, in the result will appears only the lines what has common DESC field
like next:
AAA 000 100 100 100
CCC 135 135 135 135
EEE 789 000 789 789
in the lines, which have the same values in 2-4 and 3-5 columns, substitute every but 1st column with OK
in the remainder lines substitute the columns with NG