Translating a sed one-liner into awk - bash

I am parsing files containing lines of "key=value" pairs. An example could be this:
Normal line
Another normal line
[PREFIX] 1=Something 5=SomethingElse 26=42
Normal line again
I'd like to leave all lines not containing key=value pairs as they are, while transforming all lines containing key=value pairs as follows:
Normal line
Another normal line
[PREFIX]
AAA=Something
EEE=SomethingElse
ZZZ=42
Normal line again
Assume I have a valid dictionary for the translation.
What I do at the moment is passing the input to sed, where I turn spaces into newlines for the lines that match '^\['.
The output is then piped into this awk script:
BEGIN {
dict[1] = "AAA"
dict[5] = "EEE"
dict[26] = "ZZZ"
FS="="
}
{
if (match($0, "[0-9]+=.+")) {
key = ""
if ($1 in dict) {
key = dict[$1]
}
printf("%7s = %s\n", key, $2)
}
else {
print
next
}
}
The overall command line then becomes:
cat input | sed '/^\(\[.*\)/s/ /\n/g' | awk -f script.awk
My question is: is there any way I can include the sed operation in the middle so to get rid of that additional step?

$ cat tst.awk
BEGIN {
split("1 AAA 5 EEE 26 ZZZ",tmp)
for (i=1; i in tmp; i+=2) {
dict[tmp[i]] = tmp[i+1]
}
FS="[ =]"
OFS="="
}
$1 == "[PREFIX]" {
print $1
for (i=2; i<NF; i+=2) {
print " " ($i in dict ? dict[$i] : $i), $(i+1)
}
next
}
{ print }
$ awk -f tst.awk file
Normal line
Another normal line
[PREFIX]
AAA=Something
EEE=SomethingElse
ZZZ=42
Normal line again

In fact I could not force awk to read the file twice; one for sed command, one for your algo, so I had to modify your algo.
BEGIN {
dict[1] = "AAA"
dict[5] = "EEE"
dict[26] = "ZZZ"
# FS="="
}
$0 !~/[0-9]+=.+/ { print }
/[0-9]+=.+/ {
nb = split($0,arr1);
for (i=1; i<=nb; i++ in arr1) {
nbb = split(arr1[i], keyVal, "=");
if ( (nbb==2) && (keyVal[1] in dict) ) {
printf("%7s = %s\n", dict[keyVal[1]], keyVal[2])
}
else
print arr1[i];
}
}

When you have to convert a lot, you can first migrate your dict file into a sed script file. When your dicht file has a fixed format, you can convert it on the fly.
Suppose your dict file looks like
1=AAA
5=EEE
26=ZZZ
And your input file is
Normal line
Another normal line
[PREFIX] 1=Something 5=SomethingElse 26=42
Normal line again
You want to do something like
cat input | sed '/^\[/ s/ /\n/g' | sed 's/^1=/ AAA=/'
# Or eliminating the extra step with cat
sed '/^\[/ s/ /\n/g' input | sed 's/^1=/ AAA=/'
So your next step is converting your dict file into sed commands:
sed 's#\([^=]*\)=\(.*\)#s/^\1=/ \2=/#' dictfile
Now you can combine these with
sed '/^\[/ s/ /\n/g' input | sed -f <(
sed 's#\([^=]*\)=\(.*\)#s/^\1=/ \2=/#' dictfile
)

Related

bash - add columns to csv rewrite headers with prefix filename

I'd prefer a solution that uses bash rather than converting to a dataframe in python, etc as the files are quite big
I have a folder of CSVs that I'd like to merge into one CSV. The CSVs all have the same header save a few exceptions so I need to rewrite the name of each added column with the filename as a prefix to keep track of which file the column came from.
head globcover_color.csv glds00g.csv
==> file1.csv <==
id,max,mean,90
2870316.0,111.77777777777777
2870317.0,63.888888888888886
2870318.0,73.6
2870319.0,83.88888888888889
==> file2.csv <==
ogc_fid,id,_sum
"1","2870316",9.98795110916615
"2","2870317",12.3311055738527
"3","2870318",9.81535963468479
"4","2870319",7.77729743926775
The id column of each file might be in a different "datatype" but in every file the id matches the line number. For example, line 2 is always id 2870316.
Anticipated output:
file1_id,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
I'm not quite sure how to do this but I think I'd use the paste command at some point. I'm surprised that I couldn't find a similar question on stackoverflow but I guess it's not that common to have CSV with the same id on the same line number
edit:
I figured out the first part.
paste -d , * > ../rasterjointest.txt achieves what I want but the header needs to be replaced
$ cat tst.awk
BEGIN { FS=OFS="," }
FNR==1 {
fname = FILENAME
sub(/\.[^.]+$/,"",fname)
for (i=1; i<=NF; i++) {
$i = fname "_" $i
}
}
{ row[FNR] = (NR==FNR ? "" : row[FNR] OFS) $0 }
END {
for (rowNr=1; rowNr<=FNR; rowNr++) {
print row[rowNr]
}
}
$ awk -f tst.awk file1.csv file2.csv
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775
To use minimal memory in awk:
$ cat tst.awk
BEGIN {
FS=OFS=","
for (fileNr=1; fileNr<ARGC; fileNr++) {
filename = ARGV[fileNr]
if ( (getline < filename) > 0 ) {
fname = filename
sub(/\.[^.]+$/,"",fname)
for (i=1; i<=NF; i++) {
$i = fname "_" $i
}
}
row = (fileNr==1 ? "" : row OFS) $0
}
print row
exit
}
$ awk -f tst.awk file1.csv file2.csv; paste -d, file1.csv file2.csv | tail -n +2
file1_id,file1_max,file1_mean,file1_90,file2_ogc_fid,file2_id,file2__sum
2870316.0,111.77777777777777,"1","2870316",9.98795110916615
2870317.0,63.888888888888886,"2","2870317",12.3311055738527
2870318.0,73.6,"3","2870318",9.81535963468479
2870319.0,83.88888888888889,"4","2870319",7.77729743926775

Counting palindromes in a text file

Having followed this thread BASH Finding palindromes in a .txt file I can't figure out what am I doing wrong with my script.
#!/bin/bash
search() {
tr -d '[[:punct:][:digit:]#]' \
| sed -E -e '/^(.)\1+$/d' \
| tr -s '[[:space:]]' \
| tr '[[:space:]]' '\n'
}
search "$1"
paste <(search <"$1") <(search < "$1" | rev) \
| awk '$1 == $2 && (length($1) >=3) { print $1 }' \
| sort | uniq -c
All im getting from this script is output of the whole text file. I want to only output palindromes >=3 and count them such as
425 did
120 non
etc. My textfile is called sample.txt and everytime i run the script with: cat sample.txt | source palindrome I get message 'bash: : No such file or directory'.
Using awk and sed
awk 'function palindrome(str) {len=length(str); for(k=1; k<=len/2+len%2; k++) { if(substr(str,k,1)!=substr(str,len+1-k,1)) return 0 } return 1 } {for(i=1; i<=NF; i++) {if(length($i)>=3){ gsub(/[^a-zA-Z]/,"",$i); if(length($i)>=3) {$i=tolower($i); if(palindrome($i)) arr[$i]++ }} } } END{for(i in arr) print arr[i],i}' file | sed -E '/^[0-9]+ (.)\1+$/d'
Tested on 1.2GB file and execution time was ~4m 40s (i5-6440HQ # 2.60GHz/4 cores/16GB)
Explanation :
awk '
function palindrome(str) # Function to check Palindrome
{
len=length(str);
for(k=1; k<=len/2+len%2; k++)
{
if(substr(str,k,1)!=substr(str,len+1-k,1))
return 0
}
return 1
}
{
for(i=1; i<=NF; i++) # For Each field in a record
{
if(length($i)>=3) # if length>=3
{
gsub(/[^a-zA-Z]/,"",$i); # remove non-alpha character from it
if(length($i)>=3) # Check length again after removal
{
$i=tolower($i); # Covert to lowercase
if(palindrome($i)) # Check if it's palindrome
arr[$i]++ # and store it in array
}
}
}
}
END{for(i in arr) print arr[i],i}' file | sed -E '/^[0-9]+ (.)\1+$/d'
sed -E '/^[0-9]+ (.)\1+$/d' : From the final result check which strings are composed of just repeated chracters like AAA, BBB etc and remove them.
Old Answer (Before EDIT)
You can try below steps if you want to :
Step 1 : Pre-processing
Remove all unnecessary chars and store the result in temp file
tr -dc 'a-zA-Z\n\t ' <file | tr ' ' '\n' > temp
tr -dc 'a-zA-Z\n\t ' This will remove all except letters,\n,\t, space
tr ' ' '\n' This will convert space to \n to separate each word in newlines
Step-2: Processing
grep -wof temp <(rev temp) | sed -E -e '/^(.)\1+$/d' | awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }'
grep -wof temp <(rev temp) This will give you all palindromes
-w : Select only those lines containing matches that form whole words.
For example : level won't match with levelAAA
-o : Print only the matched group
-f : To use each string in temp file as pattern to search in <(rev temp)
sed -E -e '/^(.)\1+$/d': This will remove words formed of same letters like AAA, BBBBB
awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }' : This will filter words having length>=3 and counts their frequency and finally prints the result
Example :
Input File :
$ cat file
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
kayak nalayak bob dad , pikachu. meow !! bhow !! 121 545 ding dong AAA BBB done
Output:
$ tr -dc 'a-zA-Z\n\t ' <file | tr ' ' '\n' > temp
$ grep -wof temp <(rev temp) | sed -E -e '/^(.)\1+$/d' | awk 'length>=3 {a[$1]++} END{ for(i in a) print a[i],i; }'
3 dad
3 kayak
3 bob
Just a quick Perl alternative:
perl -0nE 'for( /(\w{3,})/g ){ $a{$_}++ if $_ eq reverse($_)}
END {say "$_ $a{$_}" for keys %a}'
in Perl, $_ should be read as "it".
for( /(\w{3,})/g ) ... for all relevant words (may need some work to reject false positives like "12a21")
if $_ eq reverse($_) ... if it is palindrome
END {say "$_ $a{$_}" for...} ... tell us all the its and its number
\thanks{sokowi,batMan}
Running the Script
The script expects that the file is given as an argument. The script does not read stdin.
Remove the line search "$1" in the middle of the script. It is not part of the linked answer.
Make the script executable using chmod u+x path/to/palindrome.
Call the script using path/to/palindrome path/to/sample.txt. If all the files are in the current working directory, then the command is
./palindrome sample.txt
Alternative Script
Sometimes the linked script works and sometimes it doesn't. I haven't found out why. However, I wrote an alternative script which does the same and is also a bit cleaner:
#! /bin/bash
grep -Po '\w{3,}' "$1" | grep -Evw '(.)\1*' | sort > tmp-words
grep -Fwf <(rev tmp-words) tmp-words | uniq -c
rm tmp-words
Save the script, make it executable, and call it with a file as its first argument.

Copying the contents of a file into another file at a point relative to the end of the file

New to StackExchange, forgive me for errors.
I have an input file that needs to be copied into another file, before the last character.
inputfile.txt:
input {
"inputstuff"
}
filetowriteto.txt:
Stuff {
foo {
"foostuff"
}
bar {
"barstuff"
}
}
After running the script, the resulting file should now be:
filetowriteto.txt:
Stuff {
foo {
"foostuff"
}
bar {
"barstuff"
}
input {
"inputstuff"
}
}
Basically the script copies the input set of lines and pastes them just before the last right bracket in filetowriteto.txt.
The script can't rely on line counts, since filetowriteto.txt doesn't have a predicable amount of foo or bar lines, and I don't know how to use sed or awk to do this.
Try:
$ awk 'FNR==NR{ if (NR>1) print last; last=$0; next} {print " " $0} END{print last}' filetowriteto.txt inputfile.txt
Stuff {
foo {
"foostuff"
}
bar {
"barstuff"
}
input {
"inputstuff"
}
}
To change the file in place:
awk 'FNR==NR{ if (NR>1) print last; last=$0; next} {print " " $0} END{print last}' filetowriteto.txt inputfile.txt >tmp && mv tmp filetowriteto.txt
How it works
FNR==NR{ if (NR>1) print last; last=$0; next}
While reading the first file, (a) if we are not on the first line, print the value of last, (b) assign the text of the current line to last, and (c) skip the rest of the commands and jump to the next line.
This uses a common awk trick. The condition FNR==NR is only true while we are reading the first file. This is because, in awk, NR is the number of lines that we have read so far while FNR is the number of lines that we have read so far from the current file. Thus, FNR==NR is only true when we are reading from the first file.
print " " $0
While reading the second file, print each line with some leading white space.
END{print last}
After we have finished printing the second file, print the last line of the first file.
Given:
$ cat /tmp/f1.txt
input {
"inputstuff"
}
$ cat /tmp/f2.txt
Stuff {
foo {
"foostuff"
}
bar {
"barstuff"
}
}
You can use command grouping to achieve this:
$ ( sed '$d' f2.txt ; cat f1.txt ; tail -n1 f2.txt )
or (this version does not create a sub shell)
$ { sed '$d' f2.txt ; cat f1.txt ; tail -n1 f2.txt; }
How does it work?
sed '$d' f2.txt prints all but the last line of f2.txt
cat f1.txt prints f1.txt at that point
tail -n1 f2.txt print the last line of f2 now
If you want to indent f1.txt, use sed instead of cat. You can also use sed to print the last line:
$ { sed '$d' /tmp/f2.txt ; sed 's/^/ /' /tmp/f1.txt ; sed -n '$p' /tmp/f2.txt; }
Stuff {
foo {
"foostuff"
}
bar {
"barstuff"
}
input {
"inputstuff"
}
}
And then you can redirect the output of the grouping to a file if you wish with a > redirect.
$ cat tst.awk
NR==FNR {
rec = (NR>1 ? rec ORS : "") $0
next
}
FNR>1 {
print prev
if ( sub(/[^[:space:]].*/,"",prev) ) {
indent = prev
}
}
{ prev=$0 }
END {
gsub(ORS,ORS indent,rec)
print indent rec ORS prev
}
$ awk -f tst.awk inputfile.txt filetowriteto.txt
Stuff {
foo {
"foostuff"
}
bar {
"barstuff"
}
input {
"inputstuff"
}
}
The above uses the indentation from the last non-blank line before the last line of the file to be modified to set the indentation for the new file it's inserting.

Bash find/replace and run command on matching group

I'm trying to do a dynamic find/replace where a matching group from the find gets manipulated in the replace.
testfile:
…
other text
base64_encode_SOMEPATH_ something
other(stuff)
text base64_encode_SOMEOTHERPATH_
…
Something like this:
sed -i "" -e "s/(base64_encode_(.*)_)/cat MATCH | base64/g" testfile
Which would output something like:
…
other text
U09NRVNUUklORwo= something
other(stuff)
text U09NRU9USEVSU1RSSU5HCg==
…
Updated per your new requirement. Now using GNU awk for the 3rd arg to match() for convenience:
$ awk 'match($0,/(.*)base64_encode_([^_]+)_(.*)/,arr) {
cmd = "base64 <<<" arr[2]
if ( (cmd | getline rslt) > 0) {
$0 = arr[1] rslt arr[3]
}
close(cmd)
} 1' file
…
other text
U09NRVNUUklORwo= something
other(stuff)
text U09NRU9USEVSU1RSSU5HCg==
…
Make sure you read and understand http://awk.info/?tip/getline if you're going to use getline.
If you can't install GNU awk (but you really, REALLY would benefit from having it so do try) then something like this would work with any modern awk:
$ awk 'match($0,/base64_encode_[^_]+_/) {
arr[1] = substr($0,1,RSTART-1)
arr[2] = arr[3] = substr($0,RSTART+length("base64_encode_"))
sub(/_.*$/,"",arr[2])
sub(/^[^_]+_/,"",arr[3])
cmd = "base64 <<<" arr[2]
if ( (cmd | getline rslt) > 0) {
$0 = arr[1] rslt arr[3]
}
close(cmd)
} 1' file
I say "something like" because you might need to tweak the substr() and/or sub() args if they're slightly off, I haven't tested it.
awk '!/^base64_encode_/ { print } /^base64_encode_/ { fflush(); /^base64_encode_/ { fflush(); sub("^base64_encode_", ""); sub("_$", ""); cmd = "base64" ; print $0 | cmd; close(cmd); }' testfile > testfile.out
This says to print non-matching lines unaltered.
Matching lines get altered with the awk function sub() to extract the string to be encoded, which is then piped to the base64 command, which prints the result to stdout.
The fflush call is needed so that all the previous output from awk has been flushed before the base64 output appears, ensuring lines aren't re-ordered.
Edit:
As pointed out in the comment, testing every line twice for matching a pattern and non-matching the same pattern isn't very good. This single action handles all lines:
{
if ($0 !~ "base64_encode_")
{
print;
next;
}
fflush();
sub("^.*base64_encode_", "");
sub("_$", "");
cmd = "base64";
print $0 | cmd;
close(cmd);
}

get Nth line in file after parsing another file

I have one of my large file as
foo:43:sdfasd:daasf
bar:51:werrwr:asdfa
qux:34:werdfs:asdfa
foo:234:dfasdf:dasf
qux:345:dsfasd:erwe
...............
here 1st column foo, bar and qux etc. are file names. and 2nd column 43,51, 34 etc. are line numbers. I want to print Nth line(specified by 2nd column) for each file(specified in 1st column).
How can I automate above in unix shell.
Actually above file is generated while compiling and I want to print warning line in code.
-Thanks,
while IFS=: read name line rest
do
head -n $line $name | tail -1
done < input.txt
while IFS=: read file line message; do
echo "$file:$line - $message:"
sed -n "${line}p" "$file"
done <yourfilehere
awk 'NR==4 {print}' yourfilename
or
cat yourfilename | awk 'NR==4 {print}'
The above one will work for 4th line in your file.You can change the number as per your requirement.
Just in awk, but probably worse performance than answers by #kev or #MarkReed.
However it does process each file just once. Requires GNU awk
gawk -F: '
BEGIN {OFS=FS}
{
files[$1] = 1
lines[$1] = lines[$1] " " $2
msgs[$1, $2] = $3
}
END {
for (file in files) {
split(lines[file], l, " ")
n = asort(l)
count = 0
for (i=1; i<=n; i++) {
while (++count <= l[i])
getline line < file
print file, l[i], msgs[file, l[i]]
print line
}
close(file)
}
}
'
This might work for you:
sed 's/^\([^,]*\),\([^,]*\).*/sed -n "\2p" \1/' file |
sort -k4,4 |
sed ':a;$!N;s/^\(.*\)\(".*\)\n.*"\(.*\)\2/\1;\3\2/;ta;P;D' |
sh
sed -nr '3{s/^([^:]*):([^:]*):.*$/\1 \2/;p}' namesNnumbers.txt
qux 34
-n no output by default,
-r regular expressions (simplifies using the parens)
in line 3 do {...;p} (print in the end)
s ubstitute foobarbaz with foo bar
So to work with the values:
fnUln=$(sed -nr '3{s/^([^:]*):([^:]*):.*$/\1 \2/;p}' namesNnumbers.txt)
fn=$(echo ${fnUln/ */})
ln=$(echo ${fnUln/* /})
sed -n "${ln}p" "$fn"

Resources