splitting file with awk command - bash

I was trying to split a file into a training data set and a test data set. I have this error
awk: can't open file -v source line number 1.
The command line was as follows:
awk -v lines=$(wc -l < data/yelp/yelp_review.v8.csv) -v fact=0.80 'NR <= lines * fact {print > "train.txt"; next} {print > "val.txt"}' data/yelp/yelp_review.v8.csv
Anybody enlightens me why it was a problem on macbook?

Well .. miken32 has already identified what went wrong with your first attempt. I can't improve on his explanation of the problem.
My suggestion would be that rather than having wc provide your line count, you just do that job with awk itself. Something like this:
awk -v fact=0.8 'NR==FNR{lines++;next} FNR<=lines*fact{print>"train.txt";next} {print>"val.txt"}' "$file" "$file"
Though I'd probably write it more like this:
awk -v fact=0.8 'NR==FNR{lines++;next} {out="val.txt"} FNR<=lines*fact{out="train.txt"} {print > out}' "$file" "$file"
You can decide whether greater elegance is gained by brevity or avoidance of a next. :-)

What does the output from wc -l < data/yelp/yelp_review.v8.csv look like? Something like this perhaps?
74
So what's going to happen when you drop that into your command?
awk -v lines= 74 -v fact=0.80 ...
As you can see, this isn't going to parse well. Always quote any variable data you use:
awk -v lines="$(wc -l < data/yelp/yelp_review.v8.csv)" -v fact=0.80 ...
Awk is smart enough to trim the spaces from the number before using it.

Related

bash variables from nested for loops in awk

I want to simply use the two for loop variables in my awk code but I can't. Please help or guide me in the right direction.
for i in {30,60,100};
do
for j in {7,8};
do
awk -v x=$i -v y=$j '{if ($NF <=x) print $0}' S_$i.txt > S_$i_$j.txt;
done;
done
This was the error I received.
awk: fatal: cannot open file S_.txt for reading (No such file or directory). I saw this error.
S_$i_$j.txt is trying to access a variable named $i_. Use S_${i}_${j}.txt instead but also always quote your shell variables so it should really be:
awk -v x="$i" -v y="$j" '{if ($NF <= x) print $0}' "S_${i}.txt" > "S_${i}_${j}.txt"
or more awkishly:
awk -v x="$i" -v y="$j" '$NF <= x' "S_${i}.txt" > "S_${i}_${j}.txt"
and note that you never use y inside your awk script so it could just be:
awk -v x="$i" '$NF <= x' "S_${i}.txt" > "S_${i}_${j}.txt"
but then it's not clear why you'd want to create 2 copies of your output with each inner loop.
Whatever you're doing, though, could almost certainly be done much faster with a single call to awk than calling it multiple times within shell loops!
The problem you asked about has absolutely nothing to do with for loop variables in my awk code btw, it's all shell fundamentals.
Thanks for your quick response.
However, I tried the following and it worked:
for i in {30,60,100};
do
for j in {7,8};
do
awk -v x=$i -v y=$j '{if ($NF <=x) print $0}' "S_"$j".txt" > "S_"$j"_"$i".txt";
done;
done;
Additionally, I realized that S_30.txt didn't exist. So when I changed it to "S_"$j".txt" it worked fine. My bad on that one.

why is my while loop skipping over the first line of my file

I have a file with 64 lines in it. I want to extract the first and fifth word of each line to a new file, so i have a while loop running to do this. However, my output file just has 63 lines, and after checking I see that the first line is missing. This is the code I have:
tail -n +9 table.$1 > tab.$1
while read -r
do
awk -v OFS='\t' '{print $1, $5}' > rtable.$1
done < tab.$1
The tail at the beginning is to get the 64 lines I want from a larger file. However it is not the issue as the tab.$1 file is fine, but rtable.$1 file which is shorter.
The first line of your input file is consumed by read -r. All remaining lines are then processed by awk and written to rtable.$1.
The next iteration of your while loop then ends because read -r has nothing to read anymore. And what a good thing that is too, because otherwise awk > rtable.$1 would have run again and overwritten your output file.
Solution: Just remove the loop.
awk -v OFS='\t' '{print $1, $5}' > rtable.$1 < tab.$1
You could even get rid of tab.$1 entirely:
tail -n +9 table.$1 | awk -v OFS='\t' '{print $1, $5}' > rtable.$1
#melpomene beat me to it, but I'll add explanation that is too long for a comment.
awk, by default, reads the full contents of standard input. That is the case here, since your script does not give awk an input filename.
The redirection of while ... done < tab.$1 connects tab.$1 to standard input for not only the while, but everything in the while. Therefore, between read and awk, only one of them can get each line of input.
As #melpomene said, read takes one line, and then awk pulls the rest (its default behaviour).

Assign bash variables from one awk command?

Hoping someone can help me make my awk commands more efficient please!
Let's say my text file has around 30 lines of this type of thing:
ENTIRE:11.3.28.4.0
OSVER:Solaris11
VARFREE:3G
I'm assigning these to variables in a bash script like this:
ENTIRE=$(awk -F\: '$1 ~ /ENTIRE/ {print $2}' $HOSTFILE)
RELEASE=$(awk -F\: '$1 ~ /RELEASE/ {print $2}' $HOSTFILE)
OSVER=$(awk -F\: '$1 ~ /OSVER/ {print $2}' $HOSTFILE)
Because I have around 30 of these, it means awk is run 30 times, which is slow, and clearly not the best way.
Can anyone suggest how I can build these into one awk command please?
Thanks in advance!
You don't need awk at all. If modifying the original file isn't an option, use a while loop and the declare command to define each variable.
while IFS=: read name value; do
declare "$name=$value"
done < "$HOSTFILE"
An example:
$ IFS=: read name value <<< "foo:bar"
$ declare "$name=$value"
$ echo "$foo"
bar

File name substitution using awk and for loop

Hi I am trying to write dynamic filenames using variable substitution and I unable to figure out what am i missing here.
for i in `cat justPid.csv`
do
awk -v var="$i" -F"," '{if ($1==var) {print $0 }}' uniqPid.csv > "$i"file.txt
done
I have also tried the one below and many other combinations but it wont print multiple file names based on the $i.
for i in `cat justPid.csv`
do
awk -v var="$i" -F"," '{if ($1==var) {print $0 }}' uniqPid.csv > ${i}_file.txt
done
Any suggestions?
Edit:
my original intent is to split a 27gb file into manageable chunks based on PID (identifier in the file) so that it can be loaded onto R Studio for analysis. I am working on my laptop and not on a server hence the need to break them into small files.
Also I am using the ("new") ubuntu bash shell on windows.
The smaller test files I am working on look like what Jithin has posted. I will try out the suggestions and will update this post!
$cat justPid.csv
aaaa
bbbb
cccc
$cat uniqPid.csv
aaaa,1234567890
aaaa,aaaaaaaaaa
aaaa,bbbbbbbbbb
bbbb,1234567890
cccc,1234567890
dddd,cccccccccc
ffff,1234567890
I am not quite sure this is what you are looking for, let
input files
$cat justPid.csv
aaaa
bbbb
cccc
$cat uniqPid.csv
aaaa,1234567890
aaaa,aaaaaaaaaa
aaaa,bbbbbbbbbb
bbbb,1234567890
cccc,1234567890
dddd,cccccccccc
ffff,1234567890
script using for loop
for i in $(cat justPid.csv)
do
awk -v var=${i} -F, '$1==var' uniqPid.csv > ${i}_file.txt
done
script using while loop
while read -r i
do
awk -v var=${i} -F, '$1==var' uniqPid.csv > ${i}_file.txt
done < justPid.csv
Output
$ cat aaaa_file.txt
aaaa,1234567890
aaaa,aaaaaaaaaa
aaaa,bbbbbbbbbb
$ cat bbbb_file.txt
bbbb,1234567890
$ cat cccc_file.txt
cccc,1234567890
note: It is not advised to use for loop, see the link Use a while loop and the read command , Don't Read Lines With For
Without sample input/output it's just an untested guess but I THINK all you need is either::
awk -F, '{print > ($1"_file.txt")}' uniqPid.csv
or maybe:
awk -F, 'NR==FNR{a[$1];next} $1 in a{print > ($1"_file.txt")}' justPid.csv uniqPid.csv
So far I don't see any reason for a loop at all. You might need to close the output files as you go but we can address that if/when you provide sample input/output and tell us whether or not you have GNU awk.

How to increment number in a file

I have one file with the date like below,let say file name is file1.txt:
2013-12-29,1
Here I have to increment the number by 1, so it should be 1+1=2 like..
2013-12-29,2
I tried to use 'sed' to replace and must be with variables only.
oldnum=`cut -d ',' -f2 file1.txt`
newnum=`expr $oldnum + 1`
sed -i 's\$oldnum\$newnum\g' file1.txt
But I get an error from sed syntax, is there any way for this. Thanks in advance.
Sed needs forward slashes, not back slashes. There are multiple interesting issues with your use of '\'s actually, but the quick fix should be (use double quotes too, as you see below):
oldnum=`cut -d ',' -f2 file1.txt`
newnum=`expr $oldnum + 1`
sed -i "s/$oldnum\$/$newnum/g" file1.txt
However, I question whether sed is really the right tool for the job in this case. A more complete single tool ranging from awk to perl to python might work better in the long run.
Note that I used a $ end-of-line match to ensure you didn't replace 2012 with 2022, which I don't think you wanted.
usually I would like to use awk to do jobs like this
following is the code might work
awk -F',' '{printf("%s\t%d\n",$1,$2+1)}' file1.txt
Here is how to do it with awk
awk -F, '{$2=$2+1}1' OFS=, file1.txt
2013-12-29,2
or more simply (this will file if value is -1)
awk -F, '$2=$2+1' OFS=, file1.txt
To make a change to the change to the file, save it somewhere else (tmp in the example below) and then move it back to the original name:
awk -F, '{$2=$2+1}1' OFS=, file1.txt >tmp && mv tmp file1.txt
Or using GNU awk, you can do this to skip temp file:
awk -i include -F, '{$2=$2+1}1' OFS=, file1.txt
Another, single line, way would be
expr cat /tmp/file 2>/dev/null + 1 >/tmp/file
this works if the file doesn't exist or if the file doesnt contain a valid number - in both cases the file is (re)created with a value of 1
awk is the best for your problem, but you can also do the calculation in shell
In case you have more than one rows, I am using loop here
#!/bin/bash
IFS=,
while read DATE NUM
do
echo $DATE,$((NUM+1))
done < file1.txt
Bash one liner option with BC. Sample:
$ echo 3 > test
$ echo 1 + $(<test) | bc > test
$ cat test
4
Also works:
bc <<< "1 + $(<test)" > test

Resources