UNIX shell-scripting: Split a textfile by its entries - shell

I'm trying to analyze an enormous text file (1.6GB), whose data lines look like this:
20090118025859 -2.400000 78.100000 1023.200000 0.000000
20090118025900 -2.500000 78.100000 1023.200000 0.000000
20090118025901 -2.400000 78.100000 1023.200000 0.000000
I don't even know how many lines there are. But I'm trying to split the file by date. The left number is a time stamp (these lines for example are from 2009, january 18th).
How can I split this file into pieces according to the date?
The number of entries per date differs, so using split with a constant number won't work.
Everything I know would be to grep file '20090118*' > data20090118.dat , but there sure is a way to do all the dates at once, right?
Thanks in advance,
Alex

Using awk:
awk '{print > "data"substr($1,0,8)".dat"}' myfile

This should work if the items are in date sequence:
date=20090101 # Change to the earliest date
while IFS= read -rd $'\n' line
do
if [ "$(echo "$line" | cut -d ' ' -f 1 | cut -c 1-8)" -eq $date ]
then
echo "$line" >> "$date.dat"
else
let date++
fi
done < log.dat

With the caveats that each day needs to have more than 1 record,
and that the output file will have blank lines:
uniq --all-repeated=separate -w8 file | csplit -s - '/^$/' '{*}'
We really should have an option to uniq to output even uniq records.
Also csplit should have an option to suppress the matched line.

Related

Append number of days since the date in the line to each line in the file using Bash

I have a file that consists of the following...
false|aaa|user|aaa001|2014-12-11|
false|bbb|user|bbb||
false|ccc|user|ccc|2021-10-19|
false|ddd|user|ddd|2018-11-16|
false|eee|user|eee|2020-06-02|
I want to use the date in the 5th column to calculate the number of days from the current date and append it to each line in the file.
The end result would be a file that looks like the following, assuming the current date is 1/13/2022...
false|aaa|user|aaa001|2014-12-11|2590
false|bbb|user|bbb||
false|ccc|user|ccc|2021-10-19|86
false|ddd|user|ddd|2018-11-16|1154
false|eee|user|eee|2020-06-02|590
Some lines in the file will not contain a date value (which is expected). I need a solution for a Bash script on Linux.
I am able to submit a command using echo for a single line and then calculate the number of days from the current date by using cut on the 5th field (see below)...
echo "false|aaa|user|aaa001|2014-12-11" | echo $(( ($(date --date=date +"%Y-%m-%d" +%s) - $(date --date=cut -d'|' -f5 +%s) )/(60*60*24) ))
2590
I don't know how to do this one line at a time, capture the 'number of days' value and then append it to each line in the file.
Here's an approach using
paste to append the outputs
sed to arrange the empty lines and
awk to calculate the desired days.
This works with GNU date. BSD date has to use something like date -jf x +%s.
EDIT: Updated the date to compare with to current day.
% current=$(date +%m/%d/%Y)
% paste -d"\0" file <(cut -d"|" -f5 file |
sed 's/^$/#/' |
xargs -Ix date -d x +%s 2>&1 |
awk -v cur="$(date -d "$current" +%s)" '/invalid/{print 0; next}
{print int((cur-$1)/3600/24)}')
false|aaa|user|aaa001|2014-12-11|2590
false|bbb|user|bbb||0
false|ccc|user|ccc|2021-10-19|86
false|ddd|user|ddd|2018-11-16|1154
false|eee|user|eee|2020-06-02|590
Also date returns date: invalid date ‘#’ in the empty case. If any other implementation behaves differently the awk regex has to be adjusted accordingly.
Data
% cat file
false|aaa|user|aaa001|2014-12-11|
false|bbb|user|bbb||
false|ccc|user|ccc|2021-10-19|
false|ddd|user|ddd|2018-11-16|
false|eee|user|eee|2020-06-02|

Grep dates from file and format them

I have a file "tmp.txt" looking like that:
random text random text 25/06/2021 15:15:15
random text random text 26/06/2021 15:15:15
random text random text 26/06/2021 15:15:15
and I would like to:
extract all datetimes
add 4 hours
display them as timestamp
I didn't figured out yet how to add hour as I,m facing an issue with the date format not being recognized by the date function.
(I would like to be able to do it with a single line command if possible)
Here is my current command:
egrep -o "[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}" tmp.txt | while read -r line ; do echo $(date -d "$line" +%s);done
Help appreciated!
Tried and Tested, Minimal Solution
You can use the below command line to get the desired result. I have tested it with your example and it worked as expected on my Linux machine.
egrep -o "[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}" tmp.txt | while read -r line; do dd=${line:0:2}; mm=${line:3:2}; yyyy=${line:6:4}; time=${line:11:8}; date -d "${yyyy}-${mm}-${dd} ${time} 4 hours" +'%Y-%m-%d %H:%M:%S'; done
I'll break it down into multiple lines so it's easy to understand:
egrep -o "[0-9]{2}/[0-9]{2}/[0-9]{4} [0-9]{2}:[0-9]{2}:[0-9]{2}" tmp.txt \
| \
while read -r line; do
# Reading date and time into separate variables
dd=${line:0:2};
mm=${line:3:2};
yyyy=${line:6:4};
time=${line:11:8};
# Adding 4 hours and displaying datetime in desired format
date -d "${yyyy}-${mm}-${dd} ${time} 4 hours" +'%Y-%m-%d %H:%M:%S';
done
To add 4 hours, you can just mention it after the datetime in -d option as shown above, I tried with hours, minute and days and it worked as expected
For your input file tmp.txt:
random text random text 25/06/2021 15:15:15
random text random text 26/06/2021 15:15:15
random text random text 26/06/2021 15:15:15
On running my command, the output was:
2021-06-25 19:15:15
2021-06-26 19:15:15
2021-06-26 19:15:15
I tested it with edge cases like close to midnight time, leap years etc and it worked fine
Let me adjust the timestamps to make the output more interesting:
$ cat tmp.txt
random text random text 25/06/2021 15:15:15
random text random text 26/06/2021 20:15:15
random text random text 26/06/2021 23:15:15
#jhnc has the right idea: use a language that's both good at text manipulation and can do date arithmetic. I'd use Time::Piece
perl -MTime::Piece -lne '
m{(\d\d/\d\d/\d\d\d\d \d\d:\d\d:\d\d)} or continue;
$t = Time::Piece->strptime($1, "%d/%m/%Y %T");
$t += 4 * 3600;
print $t->strftime("%F %T")
' tmp.txt
2021-06-25 19:15:15
2021-06-27 00:15:15
2021-06-27 03:15:15
Or, here's perl piping into xargs for the date stuff
perl -pe 's{.*(\d{2})/(\d{2})/(\d{4}) (\d{2}:\d{2}:\d{2}).*}
{$2/$1/$3 +4 hours $4}
' tmp.txt | xargs -I DT date -d DT '+%F %T'
2021-06-25 19:15:15
2021-06-27 00:15:15
2021-06-27 03:15:15

Get a percentage of randomly chosen lines from a text file

I have a text file (bigfile.txt) with thousands of rows. I want to make a smaller text file with 1 % of the rows which are randomly chosen. I tried the following
output=$(wc -l bigfile.txt)
ds1=$(0.01*output)
sort -r bigfile.txt|shuf|head -n ds1
It give the following error:
head: invalid number of lines: ‘ds1’
I don't know what is wrong.
Even after you fix your issues with your bash script, it cannot do floating point arithmetic. You need external tools like Awk which I would use as
randomCount=$(awk 'END{print int((NR==0)?0:(NR/100))}' bigfile.txt)
(( randomCount )) && sort -r file | shuf | head -n "$randomCount"
E.g. Writing a file with with 221 lines using the below loop and trying to get random lines,
tmpfile=$(mktemp /tmp/abc-script.XXXXXX)
for i in {1..221}; do echo $i; done >> "$tmpfile"
randomCount=$(awk 'END{print int((NR==0)?0:(NR/100))}' "$tmpfile")
If I print the count, it would return me a integer number 2 and using that on the next command,
sort -r "$tmpfile" | shuf | head -n "$randomCount"
86
126
Roll a die (with rand()) for each line of the file and get a number between 0 and 1. Print the line if the die shows less than 0.01:
awk 'rand()<0.01' bigFile
Quick test - generate 100,000,000 lines and count how many get through:
seq 1 100000000 | awk 'rand()<0.01' | wc -l
999308
Pretty close to 1%.
If you want the order random as well as the selection, you can pass this through shuf afterwards:
seq 1 100000000 | awk 'rand()<0.01' | shuf
On the subject of efficiency which came up in the comments, this solution takes 24s on my iMac with 100,000,000 lines:
time { seq 1 100000000 | awk 'rand()<0.01' > /dev/null; }
real 0m23.738s
user 0m31.787s
sys 0m0.490s
The only other solution that works here, heavily based on OP's original code, takes 13 minutes 19s.

Sorting and printing a file in bash UNIX

I have a file with a bunch of paths that look like so:
7 /usr/file1564
7 /usr/file2212
6 /usr/file3542
I am trying to use sort to pull out and print the path(s) with the most occurrences. Here it what I have so far:
cat temp| sort | uniq -c | sort -rk1 > temp
I am unsure how to only print the highest occurrences. I also want my output to be printed like this:
7 1564
7 2212
7 being the total number of occurrences and the other numbers being the file numbers at the end of the name. I am rather new to bash scripting so any help would be greatly appreciated!
To emit only the first line of output (with the highest number, since you're doing a reverse numeric sort immediately prior), pipe through head -n1.
To remove all content which is not either a number or whitespace, pipe through tr -cd '0-9[:space:]'.
To filter for only the values with the highest number, allowing there to be more than one:
{
read firstnum name && printf '%s\t%s\n' "$firstnum" "$name"
while read -r num name; do
[[ $num = $firstnum ]] || break
printf '%s\t%s\n' "$num" "$name"
done
} < temp
If you want to avoid sort and you are allowed to use awk, then you can do this:
awk '{
if($1>maxcnt) {s=$1" "substr($2,10,4); maxcnt=$1} else
if($1==maxcnt) {s=s "\n"$1" "substr($2,10,4)}} END{print s}' \
temp

Using sed to find, convert and replace lines

I don't know too much of bash scripting and I'm trying to develop a bash script to do this operations:
I have a lot of .txt files in the same directory.
Every .txt file follows this structure:
file1.txt:
<name>first operation</name>
<operation>21</operation>
<StartTime>1292435633</StartTime>
<EndTime>1292435640</EndTime>
<name>second operation</name>
<operation>21</operation>
<StartTime>1292435646</StartTime>
<EndTime>1292435650</EndTime>
I want to search every <StartTime> line and convert it to standard date/time format (not unix timestamp) but preserving the structure <StartTime>2010-12-15 22:52</StartTime>, for example. This could be a function of search/replace, using sed? I think I could use these function that I found: date --utc --date "1970-01-01 $1 sec" "+%Y-%m-%d %T"
I want to to do the same with <EndTime> tag.
I should do this for all *.txt files in a directory.
I tried using sed but with not wanted results. As I said I don't know so much of bash scripting so any help would be appreciated.
Thank you for your help!
Regards
sed is incapable of doing date conversions; instead I would reccomend you to use a more appropriate tool like awk:
echo '<StartTime>1292435633</StartTime>' | awk '{
match($0,/[0-9]+/);
t = strftime("%F %T",substr($0,RSTART,RLENGTH),1);
sub(/[0-9]+/,t)
}
{print}'
If your input files have one tag per line, as in your structure example, it should work flawlessly.
If you need to repeat the operation for every .txt file just use a shell for:
for file in *.txt; do
awk '/^<[^>]*Time>/{
match($0,/[0-9]+/);
t = strftime("%F %T",substr($0,RSTART,RLENGTH),1);
sub(/[0-9]+/,t)
} 1' "$file" >"$file.new"
# mv "$file.new" "$file"
done
In comparison to the previous code, I have done two minor changes:
added condition /^<[^>]*Time>/ that checks if the current line starts with or
converted {print} to the shorter '1'
If the files ending with .new contain the result you were expecting, you can uncomment the line containing mv.
Using grep:
while read line;do
if [[ $line == *"<StartTime>"* || $line == *"<EndTime>"* ]];then
n=$(echo $line | grep -Po '(?<=(>)).*(?=<)')
line=${line/$n/$(date -d #$n)}
fi
echo $line >> file1.new.txt
done < file1.txt
$ cat file1.new.txt
<name>first operation</name>
<operation>21</operation>
<StartTime>Wed Dec 15 18:53:53 CET 2010</StartTime>
<EndTime>Wed Dec 15 18:54:00 CET 2010</EndTime>
<name>second operation</name>
<operation>21</operation>
<StartTime>Wed Dec 15 18:54:06 CET 2010</StartTime>
<EndTime>Wed Dec 15 18:54:10 CET 2010</EndTime>

Resources