I am writing a shell script program in which I am internally calling an awk script. Here is my script below.
for FILE in `eval echo{0..$fileIterator}`
{
if(FILE == $fileIterator)
{
printindicator =1;
}
grep RECORD FILEARRAY[FILE]| awk 'for(i=1;i<=NF;i++) {if($i ~ XXXX) {XARRAY[$i]++}} END {if(printIndicator==1){for(element in XARRAY){print element >> FILE B}}'
I hope I am clear with my code . Please let me know if you need any other details.
ISSUE
My motivation in this program is to traverse through all the files an get the lines that has "XXXX" in all the files and store the lines in an array. That is what I am doing here. Finally I need to store the contents of the array variable into a file. I can store the contents at each and every step like the below
{if($i ~ XXXX) {XARRAY[$i]++; print XARRAY[$i] >> FILE B}}
But the reason behind not going to this approach is each time I need to do an I/O operation and for this the time taken is much and that is why I am converting that into inmemory everytime and then at last dumping the in memory array(XARRAY) into the file.
The problem I am facing here is that. The shell script calls the awk everytime, the data's are getting stored in the array(XARRAY) but for the next iteration, the previous content of XARRAY is getting deleted and it puts the new content as this assumes this as a new array. Hence at last when I print the contents, it prints only the lately updated XARRAY and not all the data that is expected from this.
SUGGESTIONS EXPECTED
1) How to make the awk script realize that the XARRAY is an old one and not the new one when it is being called everytime in each iteration.
2) One of the alternative is to do an I/O everytime. But I am not interested in this. Is there any other alternative other than this. Thank you.
This post involves combining shell script and awk script to solve a problem. This is very often a useful approach, as it can leverage the strengths of each, and potentially keep the code from getting ugly in either!
You can indeed "preserve state" with awk, with a simple trick: leveraging a coprocess from the shell script (bash, ksh, etc. support coprocess).
Such a shell script launches one instance of awk as a coprocess. This awk instance runs your awk code, which continuously processes its lines of input, and accumulates stateful information as desired.
The shell script continues on, gathering up data as needed, and passes data to the awk coprocess whenever ready. This can run in a loop, potentially blocking or sleeping, potentially acting as a long-running background daemon. Highly versatile!
In your awk script, you need a strategy for triggering the output of the stateful data it has been accumulating. The simplest, is to have an END{} action which triggers when awk stdin closes. If you need output data sooner than that, at each line of input the awk code has a chance to output its data.
I have successfully used this approach many times.
Ouch, can't tell if it is meant to be real or pseudocode!
You can't make awk preserve state. You would either have to save it to a temporary file or store it in a shell variable, the contents of which you'd pass to later invocations. But this is all too much hassle for what I understand you want to achieve.
I suggest you omit the loop, which will allow you to call awk only once with just some reordering. I assume FILE A is the FILE in the loop and FILE B is something external. The reordering would end up something very roughly like:
grep RECORD ${FILEARRAY[#]:0:$fileIterator} | awk 'for(i=1;i<=NF;i++) {if($i ~ XXXX) {XARRAY[$i]++}} END {for(element in XARRAY){print element >> FILEB}'
I move the filename expansion to the grep call and removed the whole printIndicator check.
It could all be done even more efficiently (the obvious one being removal of grep), but you provided too little detail to make early optimisation sensible.
EDIT: fixed the loop iteration with the info from the update. Here's a loopy solution, which is immune to new whitespace issues and too long command lines:
for FILE in $(seq 0 $fileIterator); do
grep RECORD "${FILEARRAY[$FILE]}"
done |
awk 'for(i=1;i<=NF;i++) {if($i ~ XXXX) {XARRAY[$i]++}} END {for(element in XARRAY){print element >> FILEB}'
It still runs awk only once, constantly feeding it data from the loop.
If you want to load the results into an array UGUGU, do the following as well (requires bash 4):
mapfile UGUGU < FILEB
results=$(for loop | awk{for(element in XARRAY)print element})..
I declared result as an array so for every "element" that is being printed it should store in results[1], results[2].
But instead of this, it is performing the below ...
Lets assume
element = "I am fine"(First iteration of for loop),
element = "How are you" (Second iteration of for loop).
My expected result in accordance to this is,
results[1]= "I am fine" and results[2] = "How are you" ,
but the output I am getting is results[1]= "I" results[2]= "am". I dont know why it is delimiting by space .. Any suggestions regarding this
Related
I have a school project that gives me several lines of string in a text like this:
team1-team2:2-1
team3-team1:2-2
etc
it wants me to determine what team won (or drew) and then make a league table with them, awarding points for wins/draws.
this is my first time using bash. what i did was save team1/team2 names in a variable and then do the same for goals. how should i make the table? i managed to make my script create a new file that saves in there all team names (And checking for no duplicates) but i dont know how to continue. should i make an array for each team saving in there their results? and then how do i implement the rankings, for example
team1 3p
team2 1p
etc.
im not asking for actual code, just a guide as to how i should implement it. is making a new file the right move? should i try making a new array with the teams instead? or something else?
The problem can be divided into 3 parts:
Read the input data into memory in a format that can be manipulated easily.
Manipulate the data in memory
Output the results in the desired format.
When reading the data into memory, you might decide to read all the data in one go before manipulating it. Or you might decide to read the input data one line at a time and manipulate each line as it is read. When using shell scripting languages, like bash, the second option usually results in simpler code.
The most important decision to make here is how you want to structure the data in memory. You normally want to avoid duplication of data, and you usually want a data structure that is easy to transform into your desired output. In this case, the most logical data structure is an associative array, using the team name as the key.
Assuming that you have to use bash, here is a framework for you to build upon:
#!/bin/bash
declare -A results
while IFS=':-' read team1 team2 score1 score2; do
if [ ${score1} -gt ${score2} ]; then
((results[${team1}]+=2))
elif [ ...next test... ]; then
...
else
...
fi
done < scores.txt
# Now you have an associative array containing the points for each team.
# You can either output it as it stands, or sort it by piping through the
# 'sort' command.
for key in $[!results[#]}; do
echo ...
done
I would use awk for this
AWK is an interpreted programming language(AWK stands for Aho, Weinberger, Kernighan) designed for text processing and typically used as a data extraction and reporting tool. AWK is used largely with Unix systems.
Using pure bash scripting is often messy for that kind of jobs.
Let me show you how easy it can be using awk
Input file : scores.txt
team1-team2:2-1
team3-team1:2-2
Code :
awk -F'[:-]' ' # set delimiters to ':' or '-'
{
if($3>$4){teams[$1] += 3} # first team gets 3 points
else if ($3<$4){teams[$2] += 3} # second team gets 3 points
else {teams[$1]+=1; teams[$2]+=1} # both teams get 1 point
}
END{ # after scanning input file
for(team in teams){
print(team OFS teams[team]) # print total points per team
}
}' scores.txt | sort -rnk 2 > ranking.txt # sort by nb of points
Output (ranking.txt):
team1 4
team3 1
I am trying to work with some oddly created 'dumps' of some tables in postgres. Due to the tables containing specific data I will have to refrain from posting the exact information but I can give an example.
To give a bit more information, someone though that this exact command was a good way to backup a table.
echo 'select * from test1'|psql > test1.date.txt
However, in this one example that gives a lot of information that no one neeeds. To also be even more fun the person saw fit to remove the | that is normally seen with the data.
So what I end up with is something like this.
rowid test1
-------+----------------------
1 hi
2 no
(2 rows)
To also note, for this customer there are multiple tables here. My thoughts here was to use some simple python to figure out where in each line the + was and then mark those points. Then apply those points to each line throughout the file.
I was able to make this work for one set of files but for some reason the next set of files just doesn't work. What happens instead is that on most lines a pipe gets thrown in the middle of data
Maybe there is something I missing here, but does anyone see an easy way to put something like the above back into a normal delimiter file that I could then just load into the database?
Any python or bash related suggestions would also work in this case. Thank you.
As mentioned above, without a real example of where the '|' are that are causing problems, or a real example of where you are having problem, it is hard to know whether we are addressing your actual issue. That said, your two primary swiss-army=knives for text processing are sed and awk. If you have data similar to your example, with pipes between data fields you need to discard, then awk provides a fairly easy solution.
Take for example your short example and add a pipe in the middle that needs to be discarded, e.g.
$ cat dat/pgsql2.txt
rowid test1
-------+----------------------
1 | hi
2 | no
To process the file in awk discarding the '|' and outputting the remaining records in comma-separated-value format, you could do something like the following:
awk '{
if (NR > 2) {
for (i = 1; i <= NF; i++) {
if ($i != "|") {
if (i == 1)
printf "%s", $i
else
printf ",%s", $i
}
printf "\n"
}
}
}' inputfile
Which simply reads from inputfile (last line) and loops over the number of fields (NF) (3 in this case) and if the row-number is > 2 (to omit the heading) and the field $i is not "|", then it simply checks if this is the first field and outputs it without a comma, otherwise all other fields are output with a preceding comma.
Example Output
1,hi
2,no
awk is a bit awkward at first, but as far as text processing goes, there isn't much that will top it.
After trying multiple methods the only way I could make this work sadly was to just use the import feature for Excel and then play with that to get the columns I needed.
I need to provide a listing of a website's pages. The only thing to change per line is the page number at the end of the line. So for example, I need to take:
mywebsite.com/things/stuff/?q=content&page=1
And from that generate a sequential listing of pages:
mywebsite.com/things/stuff/?q=content&page=1
mywebsite.com/things/stuff/?q=content&page=2
mywebsite.com/things/stuff/?q=content&page=3
I need to list all pages between 1 - 120.
I have been using bash but any shell that gets the job done is fine. I don't have any code to show because I simply just don't know how to begin. It sounds simple enough but so far I'm completely at a loss as to how I can accomplish this.
With GNU bash 4:
printf '%s\n' 'mywebsite.com/things/stuff/?q=content&page='{1..120}
You can simply use:
for i in $(seq 120); do echo 'mywebsite.com/things/stuff/?q=content&page='"$i"; done > list.txt
I am currently working on a script that processes and combines several different files, and for the one part, it is necessary that I find the difference between two different times in order to determine a "total" amount of time that someone has worked. the times themselves are in the following format
34:18:00,40:26:00,06:08:00
with the first one being start time, second end time, third total time. Although this one is displayed correctly, there are some entries that need to be double checked and corrected (the total time is not correct based on the start/end time). I have found several different solutions in other posts but most of them also include dates and such too (most of them using awk), I am not experienced with awk so am not sure how to go about removing the date portion from those examples. I have also heard that I could convert the times to unix epoch time, but I was just curious if there were any other ways to accomplish this, thanks!
Something like this might help you:
#!/bin/bash
time2seconds() {
a=( ${1//:/ } )
echo $((${a[0]}*3600+${a[1]}*60+${a[2]}))
}
seconds2time() {
printf "%.2d:%.2d:%.2d" $(($1/3600)) $((($1/60)%60)) $(($1%60))
}
IFS=, read start stop difference <<< "34:18:00,40:26:00,06:08:00"
printf "Start=%s\n" "$start"
printf "Stop=%s\n" "$stop"
printf "Difference=%s (given in file: %s)\n" $(seconds2time $(($(time2seconds $stop)-$(time2seconds $start)))) "$difference"
Output is:
Start=34:18:00
Stop=40:26:00
Difference=06:08:00 (given in file: 06:08:00)
Note: there's nothing that checks if the times are in a valid format, I don't know how reliable your data are.
I'm writing a perl script that reads a file into an array. I wrote the program on Windows, using Perl 5.16 (it also works on 5.14), and the script failed using a Mac with Perl 5.12.
The part that failed is this: my #array = <$file>. On the Mac, the array came back the correct size (same as number of lines in the file), but every element except the last one was empty. The code worked correctly when I switched to this instead:
my #array;
while(<$file>){
push #array, $_;
}
I'm not sure if it would have made a difference if I switched the line endings to be LF instead of CRLF (Windows style). Though the problem is fixed, it leaves me puzzled. I thought those two code snippets I listed were exactly the same thing. What is the difference in them that produces different results here?
The answer is that the two methods are exactly equivalent, as you suspected. Example:
my $start = tell DATA; #store beginning filehandle position
my #array1 = <DATA>;
seek DATA,$start,0; #reset filehandle position
my #array2;
while(<DATA>){
push #array2,$_;
}
print "List assignment:\n #array1\n";
print "Looping through:\n #array2\n";
__DATA__
1
2
foo
bar
Your previous failure was likely something else. Perhaps some sort of problem with Perl on Mac or Mac's file IO was involved, but more likely it was some other part of your code (by this I mean nothing personal: I would make the same assumption about my own code).