Bash script working with second column from txt but keep first column in result as relevant - bash

I am trying to write a bash script to ease a process with IP information gathering.
Right now I have made a script which runs throught the one column of IP address in multiple files, looks for geo and host information and stores it to a new file.
What would be nice is also to have a script that generates a result from files with a 3 columns - date, time, ip address. Separator is space.
I tried this an that but no. I am a total newbie :)
This is my original script:
#!/usr/bin/env bash
find *.txt -print0 | while read -d $'\0' file;
do
for i in $( cat "$file")
do echo -e "$i,"$( geoiplookup -f "/usr/share/GeoIP/GeoLiteCity.dat" $i | cut -d' ' -f6,8-9)" "$(nslookup $i | grep name | awk '{print $4}')"" >> "res/res-"$file".txt";
done
done
Input file example
2014-03-06 12:13:27 213.102.145.172
2014-03-06 12:18:24 83.177.253.118
2014-03-25 15:42:01 213.102.155.173
2014-03-25 15:55:47 213.101.185.223
2014-03-26 15:21:43 90.130.182.2
Can you please help me on this?

It's not entirely clear what the current code is attempting to do, but here is a hopefully useful refactoring which could be at least a starting point.
#!/usr/bin/env bash
find *.txt -print0 | while read -d $'\0' file;
do
while read date time ip; do
geo=$(geoiplookup -f "/usr/share/GeoIP/GeoLiteCity.dat" "$ip" |
cut -d' ' -f6,8-9)
addr=$(nslookup "$ip" | awk '/name/ {print $4}')
#addr=$(dig +short -x "$ip")
echo "$ip $geo $addr"
done <"$file" >"res/res-$file.txt"
done
My copy of nslookup does not output four fields but I assume that part of your script is correct. The output from dig +short is better suitable for machine processing, so maybe switch to that instead. Perhaps geoiplookup also offers an option to output machine-readable results, or maybe there is an alternative interface which does.
I assume it was a mistake that your script would output partially comma-separated, partially whitespace-separated results, so I changed that, too. Maybe you should use CSV or JSON instead if you intend for other tools to be able to read this output.
Trying to generate a file named res/res-$file.txt will only work if file is not in any subdirectory, so I'm guessing you will want to fix that with basename; or perhaps the find loop should be replaced with a simple for file in *.txt instead.

Related

How to parse the output of `ls -l` into multiple variables in bash?

There are a few answers on this topic already, but pretty much all of them say that it's bad to parse the output of ls -l, and therefore suggest other methods.
However, I'm using ncftpls -l, and so I can't use things like shell globs or find – I think I have a genuine need to actually parse the ls -l output. Don't worry if you're not familiar with ncftpls, the output returns in exactly the same format as if you were just using ls -l.
There is a list of files at a public remote ftp directory, and I don't want to burden the remote server by re-downloading each of the desired files every time my cronjob fires. I want to check, for each one of a subset of files within the ftp directory, whether the file exists locally; if not, download it.
That's easy enough, I just use
tdy=`date -u '+%Y%m%d'`_
# Today's files
for i in $(ncftpls 'ftp://theftpserver/path/to/files' | grep ${tdy}); do
if [ ! -f $i ]; then
ncftpget "ftp://theftpserver/path/to/files/${i}"
fi
done
But I came upon the issue that sometimes the cron job will download a file that hasn't finished uploading, and so when it fires next, it skips the partially downloaded file.
So I wanted to add a check to make sure that for each file that I already have, the local file size matches the size of the same file on the remote server.
I was thinking along the lines of parsing the output of ncftpls -l and using awk, something like
for i in $(ncftpls -l 'ftp://theftpserver/path/to/files' | awk '{print $9, $5}'); do
...
x=filesize # somehow get the file size and the filename
y=filename # from $i on each iteration and store in variables
...
done
but I can't seem to get both the filename and the filesize from the server into local variables on the same iteration of the loop; $i alternates between $9 and $5 in the awk string with each iteration.
If I could manage to get the filename and filesize into separate variables with each iteration, I could simply use stat -c "%s" $i to get the local size and compare it with the remote size. Then its a simple ncftpget on each remote file that I don't already have. I tinkered with syncing programs like lftp too, but didn't have much luck and would rather do it this way.
Any help is appreciated!
for loop splits when it sees any whitespace like space, tab, or newline. So, IFS is needed before loop, (there are a lot of questions about ...)
IFS=$'\n' && for i in $(ncftpls -l 'ftp://theftpserver/path/to/files' | awk '{print $9, $5}'); do
echo $i | awk '{print $NF}' # filesize
echo $i | awk '{NF--; print}' # filename
# you may have spaces in filenames, so is better to use last column for awk
done
The better way I think is to use while not for, so
ls -l | while read i
do
echo $i | awk '{print $9, $5}'
#split them if you want
x=echo $i | awk '{print $5}'
y=echo $i | awk '{print $9}'
done

bash: cURL from a file, increment filename if duplicate exists

I'm trying to curl a list of URLs to aggregate the tabular data on them from a set of 7000+ URLs. The URLs are in a .txt file. My goal was to cURL each line and save them to a local folder after which I would grep and parse out the HTML tables.
Unfortunately, because of the format of the URLs in the file, duplicates exist (example.com/State/City.html. When I ran a short while loop, I got back fewer than 5500 files, so there are at least 1500 dupes in the list. As a result, I tried to grep the "/State/City.html" section of the URL and pipe it to sed to remove the / and substitute a hyphen to use with curl -O. cURL was trying to grab
Here's a sample of what I tried:
while read line
do
FILENAME=$(grep -o -E '\/[A-z]+\/[A-z]+\.htm' | sed 's/^\///' | sed 's/\//-/')
curl $line -o '$FILENAME'
done < source-url-file.txt
It feels like I'm missing something fairly straightforward. I've scanned the man page because I worried I had confused -o and -O which I used to do a lot.
When I run the loop in the terminal, the output is:
Warning: Failed to create the file State-City.htm
I think you dont need multitude seds and grep, just 1 sed should suffice
urls=$(echo -e 'example.com/s1/c1.html\nexample.com/s1/c2.html\nexample.com/s1/c1.html')
for u in $urls
do
FN=$(echo "$u" | sed -E 's/^(.*)\/([^\/]+)\/([^\/]+)$/\2-\3/')
if [[ ! -f "$FN" ]]
then
touch "$FN"
echo "$FN"
fi
done
This script should work and also take care of downloading same files multiple files.
just replace the touch command by your curl one
First: you didn't pass the url info to grep.
Second: try this line instead:
FILENAME=$(echo $line | egrep -o '\/[^\/]+\/[^\/]+\.html' | sed 's/^\///' | sed 's/\//-/')

How Do I Convert A Cut Command In Bash Into Grep With Given Code?

I've written a template engine script that uses cut to extract certain elements from a file, but I want to use grep in place of the cut. Here is the code I have written:
#!/bin/bash
IFS=# #makes # a delimiter.
while read line
do
dataspace=`echo ${line} | cut -d'=' -f1`
value=`echo ${line} | cut -d"=" -f2`
printf -v $dataspace "$value" #make the value stored in value into the name of a dataspace.
done < 'template.vars' #read template.vars for standard input.
skipflag=false #initialize the skipflag to false
while read line #while it is reading standard input one line at a time
Just came to the conclusion that the code blocks system here does not support bash.
Anyway, since stackoverflow isn't letting me put Bash into codeblocks, I am not putting the entire script since it would look nasty. Based on what is currently high-lighted, how would I go about changing the part using the cut command into a line using the grep command?
As has been noted, you should give more information for a better answer. Going with what you have, I would say that awk is a better option than grep
dataspace=$(awk '$0=$1' FS== <<< "$line")
value=$(awk '$0=$2' FS== <<< "$line")

Faster grep in many files from several strings in a file

I have the following working script to grep in a directory of Many files from some specific strings previously saved into a file.
I use the files extension to grep all files as its name are random and note that every string from my previously file should be searched in all the files.
Also, I cut the outputting grep as it return 2 or 3 lines of the matched file and I only want a specific part that shows the filename.
I might be using something redundant, how it could be faster?
#!/bin/bash
#working but slow
cd /var/FILES_DIRECTORY
while read line
do
LC_ALL=C fgrep "$line" *.cps | cut -c1-27 >> /var/tmp/test_OUT.txt
done < "/var/tmp/test_STRINGS.txt"
grep -F -f /var/tmp/test_STRINGS.txt *.cps | cut -c1-27
Isn't what you're looking for ?
this should speed up your script :
#!/bin/bash
#working fast
cd /var/FILES_DIRECTORY
export LC_ALL=C
grep -f /var/tmp/test_STRINGS.txt *.cps | cut -c1-27 > /var/tmp/test_OUT.txt

Is there any write buffer in bash programming?

Is there any write-to-file buffer in bash programming? And if there is any, is it possible to change its size.
Here is the problem
I have a bash script which reads a file line by line then manipulates the read data and then write the result in to another file. Something like this
while read line
some grep, but and sed
echo and append to another file
The input data is really huge (nearly 20GB of text file). The progress is slow so a question arise that if the default behavior of bash, is to write the result into the output file for each read line, then the progress will be slow.
So I want to know, is there any mechanism to buffer some outputs and then write that chunk to file? I searched on the internet about this issue but didn't find any useful information...
Is is an OS related question or bash? The OS is centos release 6.
The script is
#!/bin/bash
BENCH=$1
grep "CPU 0" $BENCH > `pwd`/$BENCH.cpu0
grep -oP '(?<=<[vp]:0x)[0-9a-z]+' `pwd`/$BENCH.cpu0 | sed 'N;s/\n/ /' | tr '[:lower:]' '[:upper:]' > `pwd`/$BENCH.cpu0.data.VP
echo "grep done"
while read line ; do
w1=`echo $line | cut -d ' ' -f1`
w11=`echo "ibase=16; $w1" | bc`
w2=`echo $line | cut -d ' ' -f2`
w22=`echo "ibase=16; $w2" | bc`
echo $w11 $w22 >> `pwd`/$BENCH.cpu0.data.VP.decimal
done <"`pwd`/$BENCH.cpu0.data.VP"
echo "convertion done"
Each echo and append in your loop are opening and closing the file which may have a negative impact on performance.
A likely better approach (and you should profile) is simply:
grep 'foo' | sed 's/bar/baz' | [any other stream operations] <$input_file >$output_file
If you must keep the existing structure then an alternative approach would be to create a named pipe:
mkfifo buffer
Then create 2 processes: one which writes into the pipe, and one with reads from the pipe.
#proc1
while read line <$input_file; do
grep foo | sed 's/bar/baz' >buffer
done
#proc2
while read line <buffer; do
echo line >>$output_file
done
In reality I would expect the bottleneck to be entirely file IO, but this does create an independence between the reading and writing, which may be desirable.
If you have 20GB of RAM lying around, it may improve performance to use a memory mapped temporary file instead of a named pipe.
Just to see what the differences were, I created a file containing a bunch of
a somewhat long string followed by a number: 0000001
Containing 10,000 lines (about 50MiB) and then ran it through a shell read loop
while read line ; do
echo $line | grep '00$' | cut -d " " -f9 | sed 's/^00*//'
done < data > data.out
Which took almost 6 minutes. Compared with the equivalent
grep '00$' data | cut -d " " -f9 | sed 's/^00*//' > data.fast
which took 0.2 seconds. To remove the cost of the forking, I tested
while read line ; do
:
done < data > data.null
where : is a shell built-in which does nothing at all. As expected data.null had no contents and the loop still took 21 seconds to run through my small file. I wanted to test against a 20GB input file, but I'm not that patient.
Conclusion: learn how to use awk or perl because you will wait forever if you try to use the script you posted while I was writing this.

Resources