The for loop overwrites or entries duplicates - shell

Say, I have 250 files, and from which I need to extract certain information and store them in a text file.
I have tried for loop in the shell as following,
text= 'home/path/tothe/textfiles'
for sam in $(find ${text} -name \*_PG.tsv);do
#echo ${sam}
awk '{if($2=="ID") print FILENAME"\t""yes""\t""SAP""\t""LUFTA"}' ${sam}
done >> ${text}/metadata.txt
With > operator the output text file is overwritten and with >> the output text file is being entered multiple times or duplicate entry.
I would like to know where should I change to get rid of these issues. Thanks for suggestions !!

I think you can do this with a single invocation of awk:
path=home/path/tothe/textfiles
awk -v OFS='\t' '$2 == "ID" {
print FILENAME, "yes", "SAP", "LUFTA"
}' "$path"/*_PG.tsv > "$path"/metadata.txt
careful with your variable assignments, there should be no spaces around the =
use the shell to expand the list of files, without find
pass the full list of files as arguments to awk, instead of looping one by one
set the Output Field Separator OFS instead of writing \t to separate your fields
redirect the output to the metadata file
I assume that your awk script is behaving as you expect - I removed the useless if since awk scripts are written like condition { action }. I guess you only want one line of output per file, so you can probably add an exit inside the block to avoid processing the rest of the file.

Related

Using awk to try to find a variable in a CSV line

I am trying to go through two files. First one line by line, while using awk to search for a line containing a string pulled from the first file.
while IFS=, read col1 col2 col3
do
echo $(awk -F, -v var="$col2" '$2==var || $2=="www."var {print $0}' searchFile.csv)
//do stuff with data from awk
done < origFile.csv
I am trying to find domain names in this file, and the awk currently is never returning matches. I have checked the files manually to make sure some that are not returning matches are in both, and they are.
I have tried using a nested loop, but bash was not wanting to open a second file to read and would not read the second file. I also tried using grep, but the files are too large and grep would run out of memory.
Sample input for searchFile.csv:
4915,google.com,oct
3532,domain.ca,nov
33451,yahoo.ca,nov
I have ensured there are no spaces in the data being input, and have verified that $col2 from origFile.csv matches data in the searchFile.csv
Does your data file have spaces? If so, your records ($2) will have spaces and may not match (because you are using awk -F,). Try matching with ~ instead of == .

Remove a header from a file during parsing

My script gets every .csv file in a dir and writes them into a new file together. It also edits the files such that certain information is written into every row for a all of a file's entries. For instance this file called "trap10c_7C000000395C1641_160110.csv":
"",1/10/2016
"Timezone",-6
"Serial No.","7C000000395C1641"
"Location:","LS_trap_10c"
"High temperature limit (�C)",20.04
"Low temperature limit (�C)",-0.02
"Date - Time","Temperature (�C)"
"8/10/2015 16:00",30.0
"8/10/2015 18:00",26.0
"8/10/2015 20:00",24.5
"8/10/2015 22:00",24.0
Is converted into this format
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Location:,LS_trap_10c
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,High,temperature,limit,(�C),20.04
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Low,temperature,limit,(�C),-0.02
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,Date,-,Time,Temperature,(�C)
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,16:00,30.0
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,18:00,26.0
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,20:00,24.5
LS_trap_10c,7C000000395C1641,trap10c_7C000000395C1641_160110.csv,8/10/2015,22:00,24.0
I use this script to do this:
dos2unix *.csv
gawk '{print FILENAME, $0}' *.csv>>all_master.erin
sed -i 's/Serial No./SerialNo./g' all_master.erin
sed -i 's/ /,/g' all_master.erin
gawk -F, '/"SerialNo."/ {sn = $3}
/"Location:"/ {loc = $3}
/"([0-9]{1,2}\/){2}[0-9]{4} [0-9]{2}:[0-9]{2}"/ {lin = $0}
{$0 =loc FS sn FS $0}1' all_master.erin > formatted_log.csv
sed -i 's/\"//g' formatted_log.csv
sed -i '/^,/ d' formatted_log.csv
rm all_master.erin
printf "\nDone\n"
I want to remove the messy header from the formatted_log.csv file. I've tried and failed to use a sed, as it seems to remove things that I don't want to remove. Is sed the best way to approach this problem? The current sed fixes some problems with the header, but I want the header gone entirely. Any lines that say "serial no." and "location" are important and require information. The other lines can be removed entirely.
I suppose you edited your script before posting; as it stands, it will not produce the posted output (all_master.erin should be $(<all_master.erin) except in the first occurrence).
You don’t specify many vital details of the format of your input files, so we must guess them. Here are my guesses:
You ignore the first two lines and the subsequent empty third line.
The 4th and 5th lines are useful, since they provide the serial number and location you want to use in all lines of that file
The 6th, 7th and 8th lines are useless.
For each file, you want to discard the first four lines of the posted output.
With these assumptions, this is how I would modify your script:
#!/bin/bash
dos2unix *.csv
awk -vFS=, -vOFS=, \
'{gsub("\"","")}
FNR==4{s=$2}
FNR==5{l=$2}
FNR>8{gsub(" ",OFS);print l,s,FILENAME,$0}' \
*.csv > formatted_log.CSV
printf "\nDone\n"
Explanation of the awk script:
First we delete all double quotes with gsub("\"",""). Then, if the line number is 4, we set the variable s to the second field, which is the serial number. If the line number is 5, we set the variable l to the second field, which is the location. If the line number is greater than 8, we do two things. First, we execute gsub(" ",OFS) to replace all spaces with the value of the output field separator: this is needed because the intended output makes two separate fields of date and time, which were only one field in the input. Second, we print the line preceded by the values of l, s and FILENAME as requested.
Note that I’m using the (questionable) Unix trick of naming the output file with an all-caps extension .CSV to avoid it being wrongly matched by a subsequent *.csv. A better solution would be to put it in another directory, but I don’t know anything about your directory tree so I suggest you modify the output file name yourself.
You could use awk to remove anything
with less than 3 columns in your final file:
awk 'NF>=3' file

How to use awk to split a file and store each filename in a Bash array

Input
A file called input_file.csv, which has 7 columns, and n rows.
Example header and row:
Date Location Team1 Team2 Time Prize_$ Sport
2016 NY Raptors Gators 12pm $500 Soccer
Output
n files, where the rows in each new file are grouped based on their values in column 7 of the original file. Each file is named after that shared value from column 7. Note: each file will have the same header. (The script currently does this.)
Example: if 2 rows in the original file had golf as their value for column 7, they would be grouped together in a file called golf.csv. If 3 other rows shared soccer as their value for column 7, they would be found in soccer.csv.
An array that has the name of each generated file in it. This array lives outside of the scope of awk. (This is what I need help with.)
Example: Array = [golf.csv, soccer.csv]
Situation
The following script produces the desired output. However, I want to run another script on each of the newly generated files and I don't know how.
Question:
My idea is to store the names of each new file in an array. That way, I can loop through the array and do what I want to each file. The code below passes a variable called array into awk, but I don't know how to add the name of each file to the array.
#!/bin/bash
ARRAY=()
awk -v myarray="$ARRAY" -F"\",\"" 'NR==1 {header=$0}; NF>1 && NR>1 {if(! files[$7]) {print header >> ("" $7 ".csv"); files[$7]=1}; print $0 >> ("" $7 ".csv"); close("" $7 ".csv");}' input_file.csv
for i in "${ARRAY[#]}"
do
:
echo $i
done
Rather than struggling to get awk to fill your shell array variable, why not:
make sure that the *.csv files are created in a clean directory
use globbing to loop over all *.csv files in that directory?
awk -F'","' ... # your original Awk command
for i in *.csv # use globbing to loop over resulting *.csv files
do
:
echo $i
done
Just off the top of my head, untested because you haven't supplied very much sample data, what about this?
#!/usr/bin/awk -f
FNR==1 {
header=$0
next
}
! $7 in files {
files[$7]=sprintf("sport-%s.csv", $7)
print header > file
}
{
files[$7]=sprintf("sport-%s.csv", $7)
}
{
print > files[$7]
}
END {
printf("declare -a sportlist=( ")
for (sport in files) {
printf("\"%s\"", sport)
}
printf(" )\n");
}
The idea here is that we store sport names in the array files[], and build filenames out of that array. (You can format the filename inside sprintf() as you see fit.) We step through the file, adding a header line whenever we get a new sport with no recorded filename. Then for non-headers, print to the file based on the sport name.
For your second issue, exporting the array back to something outside of awk, the END block here will output a declare line which can be interpreted by bash. IF you feel lucky, you can eval this awk script inside command expansion, and the declare command will effectively be interpreted by your shell:
eval $(/path/to/awkscript inputfile.csv)
Or, if you subscribe to the school of thought that consiers eval to be evil, you can redirect the awk script's standard output to a temporary file which you source:
/path/to/awkscript inputfile.csv > /tmp/yadda.$$
. /tmp/yadda.$$
(Don't use this temp file, make a real one with mktemp or the like.)
There's no way for any program to modify the environment of the parent shell. Just have the awk script output the names of the files as standard output, and use command substitution to put them in an array.
filesArray=($(awk ... ))
If the files might have spaces in them, you need a different solution; assuming you're on bash 4, you can just be sure to print each file on a separate line and use readarray:
readarray filesArray < <( awk ... )
if the files might have newlines in them, too, then things get tricky...
if your file is not large, you can run another script to get the unique $7 elements, for example
$ awk 'NR>1&&!a[$7]++{print $7}' sports
will print the values, you can change it to your file name format as well, such as
$ awk 'NR>1&&!a[$7]++{print tolower($7)".csv"}' sports
this then can be piped to your other process, here for example to wc
$ awk ... sports | xargs wc
This will do what I THINK you want:
oIFS="$IFS"; IFS=$'\n'
array=( $(awk '{out=$7".csv"; print > out} !seen[out]++{print out}' input_file.csv) )
IFS="$oIFS"
If your input file really is comma-separated instead of space-separated as you show in the sample input in your question then adjust the awk script to suit (You might want to look at GNU awk and FPAT).
If you don't have GNU awk then you'll need to add a bit more code to close the open output files as you go.
The above will fail if you have file names that contain newlines but will be fine for blank chars or other white space.

Edit files in Bash

I have a few files that contain IP addresses. I'm creating a script and have to figure out how to create a new user file with an IP address that is based off the file created before it. If the last file contains an IP of A.B.C.D the new file needs to be A.B.C.(D+4).
I think I need to use the 'sed' and 'awk' commands, but haven't been able to get anything working. How would I go about writing this part of the script?
Here's something to get you started: suppose there is a file called input looks like this:
Input: contents of input
127.0.0.1
127.0.0.2
127.0.0.3
127.0.0.200
You can do on the cmdline:
awk 'BEGIN{FS=OFS="."} {$4=$4+4; print}' input > output
Explanation on what awk is doing here:
awk '...' - invoke awk, a tool used primarily for line-by-line manipulation of files, the stuff enclosed by single quotes are instructions to awk.
BEGIN{FS=OFS="."} - tell awk to use . as the delimiter for both input and output. FS stands for "Field Separator"
{$4=$4+4; print} - $4 means the 4th field. Since . is the delimiter, D corresponds to the 4th field and we add the integer value 4 to the 4th field. The print here is just short hand for printing the entire line.
input - name the input file as argument to awk; save a cat
> output - redirect the output to a file so you can inspect them for any issues before making the user files based on it.
Output: contents of output
127.0.0.5
127.0.0.6
127.0.0.7
127.0.0.204
And then you can read output one line at a time to create new user files as needed, maybe another script with something along the lines of:
while read line
do
echo "this is a user file" > "$line"
done < output
(and adjust it to your needs)
Finally, as long as you understand what's going on in the above, you can skip the output file altogether and just do this all in a one-liner:
awk 'BEGIN{FS=OFS="."} {$4=$4+4; print}' input | while read line; do echo "hello world" > "$line"; done

sed/awk - print text between patterns spanned across multiple lines

I am new to scripting and was trying to learn how to extract any text that exists between two different patterns. However, I am still not able to figure out how to extract text between two patterns in the following scenario:
If I have my input file reading:
Hi I would like
to print text
between these
patterns
and my expected output is like:
I would like
to print text
between these
i.e. my first search pattern is "Hi' and skip this pattern, but print everything that exists in the same line following that matched pattern. My second search pattern is "patterns" and I would like to completely avoid printing this line or any lines beyond that.
I tried the following:
sed -n '/Hi/,/patterns/p' test.txt
[output]
Hi I would like
to print text
between these
patterns
Next, I tried:
`awk ' /'"Hi"'/ {flag=1;next} /'"pattern"'/{flag=0} flag { print }'` test.txt
[output]
to print text
between these
Can someone help me out in identifying how to achieve this?
Thanks in advance
You have the right idea, a mini-state-machine in awk but you need some slight mods as per the following transcript:
pax> echo 'Hi I would like
to print text
between these
patterns ' | awk '
/patterns/ { echo = 0 }
/Hi / { gsub("^.*Hi ", "", $0); echo = 1 }
{ if (echo == 1) { print } }'
Or, in compressed form:
awk '/patterns/{e=0}/Hi /{gsub("^.*Hi ","",$0);e=1}{if(e==1){print}}'
The output of that is:
I would like
to print text
between these
as requested.
The way this works is as follows. The echo variable is initially 0 meaning that no echoing will take place.
Each line is checked in turn. If it contains patterns, echoing is disabled.
If it contains Hi followed by a space, echoing is turned on and gsub is used to modify the line to get rid of everything up to the Hi.
Then, regardless, the line (possibly modified) is echoed when the echo flag is on.
Now, there's going to be edge cases such as:
lines containing two occurrences of Hi; or
lines containing something before the patterns.
You haven't specified how they should be handled so I didn't bother, but the basic concept should be the same.
Updated the solution to remove the line "patterns" :
$ sed -n '/^Hi/,/patterns/{s/^Hi //;/^patterns/d;p;}' file
I would like
to print text
between these
This might work for you (GNU sed):
sed '/Hi /!d;s//\n/;s/.*\n//;ta;:a;s/patterns.*$//;tb;$!{n;ba};:b;/^$/d' file
Just set a flag (f) when you find+replace Hi at the start of a line, clear it when you find patterns, then invoke the default print when the flag is set:
$ awk 'sub(/^Hi /,""){f=1} /patterns/{f=0} f' file
I would like
to print text
between these

Resources