File name substitution using awk and for loop - shell

Hi I am trying to write dynamic filenames using variable substitution and I unable to figure out what am i missing here.
for i in `cat justPid.csv`
do
awk -v var="$i" -F"," '{if ($1==var) {print $0 }}' uniqPid.csv > "$i"file.txt
done
I have also tried the one below and many other combinations but it wont print multiple file names based on the $i.
for i in `cat justPid.csv`
do
awk -v var="$i" -F"," '{if ($1==var) {print $0 }}' uniqPid.csv > ${i}_file.txt
done
Any suggestions?
Edit:
my original intent is to split a 27gb file into manageable chunks based on PID (identifier in the file) so that it can be loaded onto R Studio for analysis. I am working on my laptop and not on a server hence the need to break them into small files.
Also I am using the ("new") ubuntu bash shell on windows.
The smaller test files I am working on look like what Jithin has posted. I will try out the suggestions and will update this post!
$cat justPid.csv
aaaa
bbbb
cccc
$cat uniqPid.csv
aaaa,1234567890
aaaa,aaaaaaaaaa
aaaa,bbbbbbbbbb
bbbb,1234567890
cccc,1234567890
dddd,cccccccccc
ffff,1234567890

I am not quite sure this is what you are looking for, let
input files
$cat justPid.csv
aaaa
bbbb
cccc
$cat uniqPid.csv
aaaa,1234567890
aaaa,aaaaaaaaaa
aaaa,bbbbbbbbbb
bbbb,1234567890
cccc,1234567890
dddd,cccccccccc
ffff,1234567890
script using for loop
for i in $(cat justPid.csv)
do
awk -v var=${i} -F, '$1==var' uniqPid.csv > ${i}_file.txt
done
script using while loop
while read -r i
do
awk -v var=${i} -F, '$1==var' uniqPid.csv > ${i}_file.txt
done < justPid.csv
Output
$ cat aaaa_file.txt
aaaa,1234567890
aaaa,aaaaaaaaaa
aaaa,bbbbbbbbbb
$ cat bbbb_file.txt
bbbb,1234567890
$ cat cccc_file.txt
cccc,1234567890
note: It is not advised to use for loop, see the link Use a while loop and the read command , Don't Read Lines With For

Without sample input/output it's just an untested guess but I THINK all you need is either::
awk -F, '{print > ($1"_file.txt")}' uniqPid.csv
or maybe:
awk -F, 'NR==FNR{a[$1];next} $1 in a{print > ($1"_file.txt")}' justPid.csv uniqPid.csv
So far I don't see any reason for a loop at all. You might need to close the output files as you go but we can address that if/when you provide sample input/output and tell us whether or not you have GNU awk.

Related

bash variables from nested for loops in awk

I want to simply use the two for loop variables in my awk code but I can't. Please help or guide me in the right direction.
for i in {30,60,100};
do
for j in {7,8};
do
awk -v x=$i -v y=$j '{if ($NF <=x) print $0}' S_$i.txt > S_$i_$j.txt;
done;
done
This was the error I received.
awk: fatal: cannot open file S_.txt for reading (No such file or directory). I saw this error.
S_$i_$j.txt is trying to access a variable named $i_. Use S_${i}_${j}.txt instead but also always quote your shell variables so it should really be:
awk -v x="$i" -v y="$j" '{if ($NF <= x) print $0}' "S_${i}.txt" > "S_${i}_${j}.txt"
or more awkishly:
awk -v x="$i" -v y="$j" '$NF <= x' "S_${i}.txt" > "S_${i}_${j}.txt"
and note that you never use y inside your awk script so it could just be:
awk -v x="$i" '$NF <= x' "S_${i}.txt" > "S_${i}_${j}.txt"
but then it's not clear why you'd want to create 2 copies of your output with each inner loop.
Whatever you're doing, though, could almost certainly be done much faster with a single call to awk than calling it multiple times within shell loops!
The problem you asked about has absolutely nothing to do with for loop variables in my awk code btw, it's all shell fundamentals.
Thanks for your quick response.
However, I tried the following and it worked:
for i in {30,60,100};
do
for j in {7,8};
do
awk -v x=$i -v y=$j '{if ($NF <=x) print $0}' "S_"$j".txt" > "S_"$j"_"$i".txt";
done;
done;
Additionally, I realized that S_30.txt didn't exist. So when I changed it to "S_"$j".txt" it worked fine. My bad on that one.

splitting file with awk command

I was trying to split a file into a training data set and a test data set. I have this error
awk: can't open file -v source line number 1.
The command line was as follows:
awk -v lines=$(wc -l < data/yelp/yelp_review.v8.csv) -v fact=0.80 'NR <= lines * fact {print > "train.txt"; next} {print > "val.txt"}' data/yelp/yelp_review.v8.csv
Anybody enlightens me why it was a problem on macbook?
Well .. miken32 has already identified what went wrong with your first attempt. I can't improve on his explanation of the problem.
My suggestion would be that rather than having wc provide your line count, you just do that job with awk itself. Something like this:
awk -v fact=0.8 'NR==FNR{lines++;next} FNR<=lines*fact{print>"train.txt";next} {print>"val.txt"}' "$file" "$file"
Though I'd probably write it more like this:
awk -v fact=0.8 'NR==FNR{lines++;next} {out="val.txt"} FNR<=lines*fact{out="train.txt"} {print > out}' "$file" "$file"
You can decide whether greater elegance is gained by brevity or avoidance of a next. :-)
What does the output from wc -l < data/yelp/yelp_review.v8.csv look like? Something like this perhaps?
74
So what's going to happen when you drop that into your command?
awk -v lines= 74 -v fact=0.80 ...
As you can see, this isn't going to parse well. Always quote any variable data you use:
awk -v lines="$(wc -l < data/yelp/yelp_review.v8.csv)" -v fact=0.80 ...
Awk is smart enough to trim the spaces from the number before using it.

Assign bash variables from one awk command?

Hoping someone can help me make my awk commands more efficient please!
Let's say my text file has around 30 lines of this type of thing:
ENTIRE:11.3.28.4.0
OSVER:Solaris11
VARFREE:3G
I'm assigning these to variables in a bash script like this:
ENTIRE=$(awk -F\: '$1 ~ /ENTIRE/ {print $2}' $HOSTFILE)
RELEASE=$(awk -F\: '$1 ~ /RELEASE/ {print $2}' $HOSTFILE)
OSVER=$(awk -F\: '$1 ~ /OSVER/ {print $2}' $HOSTFILE)
Because I have around 30 of these, it means awk is run 30 times, which is slow, and clearly not the best way.
Can anyone suggest how I can build these into one awk command please?
Thanks in advance!
You don't need awk at all. If modifying the original file isn't an option, use a while loop and the declare command to define each variable.
while IFS=: read name value; do
declare "$name=$value"
done < "$HOSTFILE"
An example:
$ IFS=: read name value <<< "foo:bar"
$ declare "$name=$value"
$ echo "$foo"
bar

Extract string between two patterns (inclusive) while conserving the format

I have a file in the following format
cat test.txt
id1,PPLLTOMaaaaaaaaaaaJACK
id2,PPLRTOMbbbbbbbbbbbJACK
id3,PPLRTOMcccccccccccJACK
I am trying to identify and print the string between TOM and JACK including these two strings, while maintaining the first column FS=,
Desired output:
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
So far I have tried gsub:
awk -F"," 'gsub(/.*TOM|JACK.*/,"",$2) && !_[$0]++' test.txt > out.txt
and have the following output
id1 aaaaaaaaaaa
id2 bbbbbbbbbbb
id3 ccccccccccc
As you can see I am getting close but not able to include TOM and JACK patterns in my output. Plus I am also losing the original FS. What am I doing wrong?
Any help will be appreciated.
You are changing a field ($2) which causes awk to reconstruct the record using the value of OFS as the field separator and so in this case changing the commas to spaces.
Never use _ as a variable name - using a name with no meaning is just slightly better than using a name with the wrong meaning, just pick a name that means something which, in this case is seen but idk what you are trying to do when using that in this context.
gsub() and sub() do not support capture groups so you either need to use match()+substr():
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/){$2=substr($2,RSTART,RLENGTH)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
or use GNU awk for the 3rd arg to match()
$ gawk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=a[0]} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
or for gensub():
$ gawk 'BEGIN{FS=OFS=","} {$2=gensub(/.*(TOM.*JACK).*/,"\\1","",$2)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
The main difference between the match() and gensub() solutions is how they would behave if TOM appeared twice on the line:
$ cat file
id1,PPLLfooTOMbarTOMaaaaaaaaaaaJACK
id2,PPLRTOMbbbbbbbbbbbJACKfooJACKbar
id3,PPLRfooTOMbarTOMcccccccccccJACKfooJACKbar
$
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=a[0]} 1' file
id1,TOMbarTOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACKfooJACK
id3,TOMbarTOMcccccccccccJACKfooJACK
$
$ awk 'BEGIN{FS=OFS=","} {$2=gensub(/.*(TOM.*JACK).*/,"\\1","",$2)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACKfooJACK
id3,TOMcccccccccccJACKfooJACK
and just to show one way of stopping at the first instead of the last JACK on the line:
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=gensub(/(JACK).*/,"\\1","",a[0])} 1' file
id1,TOMbarTOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMbarTOMcccccccccccJACK
Use capture groups to save the parts of the line you want to keep. Here's how to do it with sed
sed 's/^\([^,]*,\).*\(TOM.*JACK\).*/\1\2/' <test.txt > out.txt
Do you mean to do the following?
$ cat test.txt
id1,PPLLTOMaaaaaaaaaaaJACKABCD
id2,PPLRTOMbbbbbbbbbbbJACKDFCC
id3,PPLRTOMcccccccccccJACKSDER
$ cat test.txt | sed -e 's/,.*TOM/,TOM/g' | sed -e 's/JACK.*/JACK/g'
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
$
This should work as long as the TOM and JACK do not repeat themselves.
sed 's/\(.*,\).*\(TOM.*JACK\).*/\1\2/' <oldfile >newfile
Output:
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK

How to increment number in a file

I have one file with the date like below,let say file name is file1.txt:
2013-12-29,1
Here I have to increment the number by 1, so it should be 1+1=2 like..
2013-12-29,2
I tried to use 'sed' to replace and must be with variables only.
oldnum=`cut -d ',' -f2 file1.txt`
newnum=`expr $oldnum + 1`
sed -i 's\$oldnum\$newnum\g' file1.txt
But I get an error from sed syntax, is there any way for this. Thanks in advance.
Sed needs forward slashes, not back slashes. There are multiple interesting issues with your use of '\'s actually, but the quick fix should be (use double quotes too, as you see below):
oldnum=`cut -d ',' -f2 file1.txt`
newnum=`expr $oldnum + 1`
sed -i "s/$oldnum\$/$newnum/g" file1.txt
However, I question whether sed is really the right tool for the job in this case. A more complete single tool ranging from awk to perl to python might work better in the long run.
Note that I used a $ end-of-line match to ensure you didn't replace 2012 with 2022, which I don't think you wanted.
usually I would like to use awk to do jobs like this
following is the code might work
awk -F',' '{printf("%s\t%d\n",$1,$2+1)}' file1.txt
Here is how to do it with awk
awk -F, '{$2=$2+1}1' OFS=, file1.txt
2013-12-29,2
or more simply (this will file if value is -1)
awk -F, '$2=$2+1' OFS=, file1.txt
To make a change to the change to the file, save it somewhere else (tmp in the example below) and then move it back to the original name:
awk -F, '{$2=$2+1}1' OFS=, file1.txt >tmp && mv tmp file1.txt
Or using GNU awk, you can do this to skip temp file:
awk -i include -F, '{$2=$2+1}1' OFS=, file1.txt
Another, single line, way would be
expr cat /tmp/file 2>/dev/null + 1 >/tmp/file
this works if the file doesn't exist or if the file doesnt contain a valid number - in both cases the file is (re)created with a value of 1
awk is the best for your problem, but you can also do the calculation in shell
In case you have more than one rows, I am using loop here
#!/bin/bash
IFS=,
while read DATE NUM
do
echo $DATE,$((NUM+1))
done < file1.txt
Bash one liner option with BC. Sample:
$ echo 3 > test
$ echo 1 + $(<test) | bc > test
$ cat test
4
Also works:
bc <<< "1 + $(<test)" > test

Resources