Serialize data in file - bash

I need to add a portion to the beginning of every line in a file.
if my data is
Wombat
muskrat
gobbledegook
...
tomatopaste
I need to make it
000000001 Wombat
000000002 muskrat
000000003 gobbledegook
...
000009982 tomatopaste
to work with a specific machine. I can do it with excel, but a command line solution would be preferred so I can integrate it into a script. Any suggestions? Also this is a smaller file. I would also need to make it work with files upwards of 1 million lines. Which, at that point, excel handles extremely poorly (hit go and go to lunch poorly)

If you need the numbers zero-padded as you have them, awk is probably the best solution:
awk '{printf "%09d ", NR}1' file.txt
Awk is a series of condition{action} pairs. For each line in the input file, the conditions are evaluated in order and if any is true, the corresponding action is evaluated. One or the other may be omitted. If a condition is omitted the action is performed every line; if an action is omitted, the current line is printed. Our first pair here is {printf...}, which is just an action; since there is no condition, it is evaluated every line. printf is a pretty universal function, printing the subsequent information according to the format string passed. We pass here %09d, which says the argument should be an integer (d), printed zero-padded (0) to nine digits (9); % starts a format specifier. The argument given, NR, is the Number of Rows seen so far in the input. Finally, we have 1, which is the next condition, always true. As mentioned earlier, lacking an explicit action, the current line is printed.

Bash-only solution:
( while read -r; do printf "%09u %s\n" $((++i)) "$REPLY" ; done ) < file.txt
Edit: The above relies on the fact that i is either uninitialized, or has a value of 0. This is a bit safer (and obviates the need for a () subshell:
i=0; while read -r; do printf "%09u %s\n" $((++i)) "$REPLY" ; done < file.txt

Related

How to sequentially copy multiple rows (e.g. 1-10 then 11-20 and so on) from one big file using FOR loop to other multiple files (maybe sed or awk)?

I'm new to Bash and Linux. I'm trying to make it happen in Bash. So far my code looks smth like this:
The problem is with coping using awk or sed, i suppose... as i try to put the sum of a constant and a variable somehow wrong. Brackets or different quotes do not make any difference.
for i in {0..10}
do
touch /path/file$i
awk -v "NR>=1+$i*9 && NR<=$i*9" /path/BigFile > /path/file$i
(or with sed) sed -n "1+$i*9,$i*9" /path/BigFile > /path/file$i
done
Thank you in advance
Instead of reinventing this wheel, you can use the split utility. split -l 10 will tell it to split into chunks of 10 lines (or maybe less for the last one), and there are some options you can use to control the output filenames -- you probably want -d to get numeric suffixes (the default is alphabetical).
hobbs' answer (use split) is the right way to solve this. Aside from being simpler, it's also linear (meaning that if you double the input size, it only takes twice as long) while these for loops are quadratic (doubling the input quadruples the time it takes), plus the loop requires that you know in advance how many lines there are in the file. But for completeness let me explain what went wrong in the original attempts.
The primary problem is that the math is wrong. NR>=1+$i*9 && NR<=$i*9 will never be true. For example, in the first iteration, $i is 0, so this is equivalent to NR>=1 && NR<=0, which requires that the record number be at least 1 but no more than 0. Similarly, when $i is 1, it becomes NR>=10 && NR<=9... same basic problem. What you want is something like NR>=1+$i*9 && NR<=($i+1)*9, which matches for lines 1-9, then 10-18, etc.
The second problem is that you're using awk's -v option without supplying a variable name & value. Either remove the -v, or (the cleaner option) use -v to convert the shell variable i into an awk variable (and then put the awk program in single-quotes and don't use $ to get the variable's value -- that's how you get shell variables' values, not awk variables). Something like this:
awk -v i="$i" 'NR>=1+i*9 && NR<=(i+1)*9' /path/BigFile > "/path/file$i"
(Note that I double-quoted everything involving a shell variable reference -- not strictly necessary here, but a general good scripting habit.)
The sed version also has a couple of problems of its own. First, unlike awk, sed doesn't do math; if you want to use an expression as a line number, you need to have the shell do the math with $(( )) and pass the result to sed as a simple number. Second, your sed command specifies a line range, but doesn't say what to do with it; you need to add a p command to print those lines. Something like this:
sed -n "$((1+i*9)),$(((i+1)*9)) p" /path/BigFile > "/path/file$i"
Or, equivalently, omit -n and tell it to delete all but those lines:
sed "$((1+i*9)),$(((i+1)*9)) ! d" /path/BigFile > "/path/file$i"
But again, don't actually do any of these; use split instead.

I want to compare one line to the next line, but only in the third column, from a file using bash

So, what I'm trying to do is read in a file, loop through it comparing it line by line, but only in the third column. Sorry if this doesn't make sense, but maybe this will help. I have a file of names:
JOHN SMITH SMITH
JIM JOHNSON JOHNSON
JIM SMITH SMITH
I want to see if (first, col3)SMITH is equal to JOHNSON, if not, move onto the next name. If (first, col3) SMITH is equal to (second, col3) SMITH, then I'll do something with that.
Again, I'm sorry if this doesn't make much sense, but I tried to explain it as best as I could.
I was attempting to see if they were equal, but obviously that didn't work. Here is what I have so far, but I got stuck:
while read -a line
do
if [ ${line[2]} == ${line[2]} ]
then
echo -e "${line[2]}" >> names5.txt
else
echo "Not equal."
fi
done < names4.txt
Store your immediately prior line in a separate variable, so you can compare against it:
#!/usr/bin/env bash
old_line=( )
while read -r -a line
do
if [ "${line[2]}" = "${line[2]}" ]; then
printf '%s\n' "${line[2]}"
else
echo "Not equal." >&2
fi
old_line=( "${line[#]}" )
done <names4.txt >>names5.txt
Some other changes of note:
Instead of re-opening names5.txt every time you want to write a single line to it, we're opening it just once, for the whole loop. (You could make this >names5.txt if you want to clear it at the top of the loop and append from there, which is likely to be desirable behavior).
We're avoiding echo -e. See the APPLICATION USE and RATIONALE sections of the POSIX standard for echo for background on why echo use is not recommended for new development when contents are not tightly constrained (known not to contain any backslashes, for example).
We're quoting both sides of the test operation. This is mandatory with [ ] to ensure correct operation of words can be expanded as globs (ie. if you have a word *, you don't want it replaced with a list of files in your current directory in the final command), or if they can contain spaces (not so much a concern here, since you're using the same IFS value for the read -a as the unquoted expansion). Even if using [[ ]], you want to quote the right-hand side so it's treated as a literal string and not a pattern.
We're passing -r to read, which ensures that backslashes are not silently removed (changing \t in the input to just t, for example).
When you want to compare each third field with all previous third fields, you need to store the old third fields in an array. You can use awk for this.
When you only want to see the repeated third fields, you can use other tools:
cut -d" " -f3 names4.txt | sort | uniq -d
EDIT:
When you onlu want to print doubles from 2 consecutive lines, it is even easier:
cut -d" " -f3 names4.txt | uniq -d

How to read a space in bash - read will not

Many people have shown how to keep spaces when reading a line in bash. But I have a character based algorithm which need to process each end every character separately - spaces included. Unfortunately I am unable to get bash read to read a single space character from input.
while read -r -n 1 c; do
printf "[%c]" "$c"
done <<< "mark spitz"
printf "[ ]\n"
yields
[m][a][r][k][][s][p][i][t][z][][ ]
I've hacked my way around this, but it would be nice to figure out how to read a single any single character.
Yep, tried setting IFS, etc.
Just set the input field separator(a) so that it doesn't treat space (or any character) as a delimiter, that works just fine:
printf 'mark spitz' | while IFS="" read -r -n 1 c; do
printf "[%c]" "$c"
done
echo
That gives you:
[m][a][r][k][ ][s][p][i][t][z]
You'll notice I've also slightly changed how you're getting the input there, <<< appears to provide a extraneous character at the end and, while it's not important to the input method itself, I though it best to change that to avoid any confusion.
(a) Yes, I'm aware that you said you've tried setting IFS but, since you didn't actually show how you'd tried this, and it appears to work fine the way I do it, I have to assume you may have just done something wrong.

Manipulating data text file with bash command?

I was given this text file, call stock.txt, the content of the text file is:
pepsi;drinks;3
fries;snacks;6
apple;fruits;9
baron;drinks;7
orange;fruits;2
chips;snacks;8
I will need to use bash-script to come up this output:
Total amount for drinks: 10
Total amount for snacks: 14
Total amount for fruits: 11
Total of everything: 35
My gut tells me I will need to use sed, group, grep and something else.
Where should I start?
I would break the exercise down into steps
Step 1: Read the file one line at a time
while read -r line
do
# do something with $line
done
Step 2: Pattern match (drinks, snacks, fruits) and do some simple arithmetic. This step requires that you tokenized each line which I'll leave an exercise for you to figure out.
if [[ "$line" =~ "drinks" ]]
then
echo "matched drinks"
.
.
.
fi
Pure Bash. A nice application for an associative array:
declare -A category # associative array
IFS=';'
while read name cate price ; do
((category[$cate]+=price))
done < stock.txt
sum=0
for cate in ${!category[#]}; do # loop over the indices
printf "Total amount of %s: %d\n" $cate ${category[$cate]}
((sum+=${category[$cate]}))
done
printf "Total amount of everything: %d\n" $sum
There is a short description here about processing comma separated files in bash here:
http://www.cyberciti.biz/faq/unix-linux-bash-read-comma-separated-cvsfile/
You could do something similar. Just change IFS from comma to semicolon.
Oh yeah, and a general hint for learning bash: man is your friend. Use this command to see manual pages for all (or most) of commands and utilities.
Example: man read shows the manual page for read command. On most systems it will be opened in less, so you should exit the manual by pressing q (may be funny, but it took me a while to figure that out)
The easy way to do this is using a hash table, which is supported directly by bash 4.x and of course can be found in awk and perl. If you don't have a hash table then you need to loop twice: once to collect the unique values of the second column, once to total.
There are many ways to do this. Here's a fun one which doesn't use awk, sed or perl. The only external utilities I've used here are cut, sort and uniq. You could even replace cut with a little more effort. In fact lines 5-9 could have been written more easily with grep, (grep $kind stock.txt) but I avoided that to show off the power of bash.
for kind in $(cut -d\; -f 2 stock.txt | sort | uniq) ; do
total=0
while read d ; do
total=$(( total+d ))
done < <(
while read line ; do
[[ $line =~ $kind ]] && echo $line
done < stock.txt | cut -d\; -f3
)
echo "Total amount for $kind: $total"
done
We lose the strict ordering of your original output here. An exercise for you might be to find a way not to do that.
Discussion:
The first line describes a sub-shell with a simple pipeline using cut. We read the third field from the stock.txt file, with fields delineated by ;, written \; here so the shell does not interpret it. The result is a newline-separated list of values from stock.txt. This is piped to sort, then uniq. This performs our "grouping" step, since the pipeline will output an alphabetic list of items from the second column but will only list each item once no matter how many times it appeared in the input file.
Also on the first line is a typical for loop: For each item resulting from the sub-shell we loop once, storing the value of the item in the variable kind. This is the other half of the grouping step, making sure that each "Total" output line occurs once.
On the second line total is initialized to zero so that it always resets whenever a new group is started.
The third line begins the 'totaling' loop, in which for the current kind we find the sum of its occurrences. here we declare that we will read the variable d in from stdin on each iteration of the loop.
On the fourth line the totaling actually occurs: Using shell arithmatic we add the value in d to the value in total.
Line five ends the while loop and then describes its input. We use shell input redirection via < to specify that the input to the loop, and thus to the read command, comes from a file. We then use process substitution to specify that the file will actually be the results of a command.
On the sixth line the command that will feed the while-read loop begins. It is itself another while-read loop, this time reading into the variable line. On the seventh line the test is performed via a conditional construct. Here we use [[ for its =~ operator, which is a pattern matching operator. We are testing to see whether $line matches our current $kind.
On the eighth line we end the inner while-read loop and specify that its input comes from the stock.txt file, then we pipe the output of the entire loop, which by now is simply all lines matching $kind, to cut and instruct it to show only the third field, which is the numeric field. On line nine we then end the process substitution command, the output of which is a newline-delineated list of numbers from lines which were of the group specified by kind.
Given that the total is now known and the kind is known it is a simple matter to print the results to the screen.
The below answer is OP's. As it was edited in the question itself and OP hasn't come back for 6 years, I am editing out the answer from the question and posting it as wiki here.
My answer, to get the total price, I use this:
...
PRICE=0
IFS=";" # new field separator, the end of line
while read name cate price
do
let PRICE=PRICE+$price
done < stock.txt
echo $PRICE
When I echo, its :35, which is correct. Now I will moving on using awk to get the sub-category result.
Whole Solution:
Thanks guys, I manage to do it myself. Here is my code:
#!/bin/bash
INPUT=stock.txt
PRICE=0
DRINKS=0
SNACKS=0
FRUITS=0
old_IFS=$IFS # save the field separator
IFS=";" # new field separator, the end of line
while read name cate price
do
if [ $cate = "drinks" ]; then
let DRINKS=DRINKS+$price
fi
if [ $cate = "snacks" ]; then
let SNACKS=SNACKS+$price
fi
if [ $cate = "fruits" ]; then
let FRUITS=FRUITS+$price
fi
# Total
let PRICE=PRICE+$price
done < $INPUT
echo -e "Drinks: " $DRINKS
echo -e "Snacks: " $SNACKS
echo -e "Fruits: " $FRUITS
echo -e "Price " $PRICE
IFS=$old_IFS

How can I use bash (grep/sed/etc) to grab a section of a logfile between 2 timestamps?

I have a set of mail logs: mail.log mail.log.0 mail.log.1.gz mail.log.2.gz
each of these files contain chronologically sorted lines that begin with timestamps like:
May 3 13:21:12 ...
How can I easily grab every log entry after a certain date/time and before another date/time using bash (and related command line tools) without comparing every single line? Keep in mind that my before and after dates may not exactly match any entries in the logfiles.
It seems to me that I need to determine the offset of the first line greater than the starting timestamp, and the offset of the last line less than the ending timestamp, and cut that section out somehow.
Convert your min/max dates into "seconds since epoch",
MIN=`date --date="$1" +%s`
MAX=`date --date="$2" +%s`
Convert the first n words in each log line to the same,
L_DATE=`echo $LINE | awk '{print $1 $2 ... $n}'`
L_DATE=`date --date="$L_DATE" +%s`
Compare and throw away lines until you reach MIN,
if (( $MIN > $L_DATE )) ; then continue ; fi
Compare and print lines until you reach MAX,
if (( $L_DATE <= $MAX )) ; then echo $LINE ; fi
Exit when you exceed MAX.
if (( $L_DATE > $MAX )) ; then exit 0 ; fi
The whole script minmaxlog.sh looks like this,
#!/usr/bin/env bash
MIN=`date --date="$1" +%s`
MAX=`date --date="$2" +%s`
while true ; do
read LINE
if [ "$LINE" = "" ] ; then break ; fi
L_DATE=`echo $LINE | awk '{print $1 " " $2 " " $3 " " $4}'`
L_DATE=`date --date="$L_DATE" +%s`
if (( $MIN > $L_DATE )) ; then continue ; fi
if (( $L_DATE <= $MAX )) ; then echo $LINE ; fi
if (( $L_DATE > $MAX )) ; then break ; fi
done
I ran it on this file minmaxlog.input,
May 5 12:23:45 2009 first line
May 6 12:23:45 2009 second line
May 7 12:23:45 2009 third line
May 9 12:23:45 2009 fourth line
June 1 12:23:45 2009 fifth line
June 3 12:23:45 2009 sixth line
like this,
./minmaxlog.sh "May 6" "May 8" < minmaxlog.input
Here one basic idea of how to do it:
Examine the datestamp on the file to see if it is irrelevent
If it could be relevent, unzip if necessary and examine the first and last lines of the file to see if it contains the start or finish time.
If it does, use a recursive function to determine if it contains the start time in the first or second half of the file. Using a recursive function I think you could find any date in a million line logfile with around 20 comparisons.
echo the logfile(s) in order from the offset of the first entry to the offset of the last entry (no more comparisons)
What I don't know is: how to best read the nth line of a file (how efficient is it to use tail n+**n|head 1**?)
Any help?
You have to look at every single line in the range you want (to tell if it's in the range you want) so I'm guessing you mean not every line in the file. At a bare minimum, you will have to look at every line in the file up to and including the first one outside your range (I'm assuming the lines are in date/time order).
This is a fairly simple pattern:
state = preprint
for every line in file:
if line.date >= startdate:
state = print
if line.date > enddate:
exit for loop
if state == print:
print line
You can write this in awk, Perl, Python, even COBOL if you must but the logic is always the same.
Locating the line numbers first (with say grep) and then just blindly printing out that line range won't help since grep also has to look at all the lines (all of them, not just up to the first outside the range, and most likely twice, one for the first line and one for the last).
If this is something you're going to do quite often, you may want to consider shifting the effort from 'every time you do it' to 'once, when the file is stabilized'. An example would be to load up the log file lines into a database, indexed by the date/time.
That takes a while to get set up but will result in your queries becoming a lot faster. I'm not necessarily advocating a database - you could probably achieve the same effect by splitting the log files into hourly logs thus:
2009/
01/
01/
0000.log
0100.log
: :
2300.log
02/
: :
Then for a given time, you know exactly where to start and stop looking. The range 2009/01/01-15:22 through 2009/01/05-09:07 would result in:
some (the last bit) of the file 2009/01/01/1500.txt.
all of the files 2009/01/01/1[6-9]*.txt.
all of the files 2009/01/01/2*.txt.
all of the files 2009/01/0[2-4]/*.txt.
all of the files 2009/01/05/0[0-8]*.txt.
some (the first bit) of the file 2009/01/05/0900.txt.
Of course, I'd write a script to return those lines rather than trying to do it manually each time.
Maybe you can try this:
sed -n "/BEGIN_DATE/,/END_DATE/p" logfile
It may be possible in a Bash environment but you should really take advantage of tools that have more built-in support for working with Strings and Dates. For instance Ruby seems to have the built in ability to parse your Date format. It can then convert it to an easily comparable Unix Timestamp (a positive integer representing the seconds since the epoch).
irb> require 'time'
# => true
irb> Time.parse("May 3 13:21:12").to_i
# => 1241371272
You can then easily write a Ruby script:
Provide a start and end date. Convert those to this Unix Timestamp Number.
Scan the log files line by line, converting the Date into its Unix Timestamp and check if that is in the range of the start and end dates.
Note: Converting to a Unix Timestamp integer first is nice because comparing integers is very easy and efficient to do.
You mentioned "without comparing every single line." Its going to be hard to "guess" at where in the log file the entries starts being too old, or too new without checking all the values in between. However, if there is indeed a monotonically increasing trend, then you know immediately when to stop parsing lines, because as soon as the next entry is too new (or old, depending on the layout of the data) you know you can stop searching. Still, there is the problem of finding the first line in your desired range.
I just noticed your edit. Here is what I would say:
If you are really worried about efficiently finding that start and end entry, then you could do a binary search for each. Or, if that seems like overkill or too difficult with bash tools you could have a heuristic of reading only 5% of the lines (1 in every 20), to quickly get a close to exact answer and then refining that if desired. These are just some suggestions for performance improvements.
I know this thread is old, but I just stumbled upon it after recently finding a one line solution for my needs:
awk -v ts_start="2018-11-01" -v ts_end="2018-11-15" -F, '$1>=ts_start && $1<ts_end' myfile
In this case, my file has records with comma separated values and the timestamp in the first field. You can use any valid timestamp format for the start and end timestamps, and replace these will shell variables if desired.
If you want to write to a new file, just use normal output redirection (> newfile) appended to the end of above.

Resources