How to process a log file using MapReduce - hadoop

I would like to understand how I could process a log file using MapReduce.
For example if I have a file transfer log like this:
Start_Datestamp,file_name,source, host,file_size,transfered_size
2012-11-18 T 16:05:00.000, FileA,SourceA, HostA,1Gb, 500Mb
2012-11-18 T 16:25:00.000, FileA,SourceA, HostB,1Gb, 500Mb
2012-11-18 T 16:33:00.000, FileB,SourceB, HostB,2Gb, 2GB
2012-11-18 T 17:07:00.000, FileC,SourceC, HostA,1Gb, 500Mb
2012-11-18 T 17:19:00.000, FileB,SourceC, HostA,1Gb, 500Mb
2012-11-18 T 17:23:00.000, FileA,SourceC, HostC,1Gb, 500Mb
and I want to aggregate and output like this:
Start_Datestamp,file_name,source, Total_transfered_size
2012-11-18 T 16:00, FileA,SourceA, 1000Mb
2012-11-18 T 16:30, FileB,SourceB, 2GB
2012-11-18 T 17:00, FileC,SourceC,500Mb
2012-11-18 T 17:00, FileB,SourceC, 500Mb
2012-11-18 T 17:00, FileA,SourceC, 500Mb
It should aggregate file transfers by 30 min interval as shown above.
I managed to implement the 30 min interval aggregation using below tutorial:
http://www.informit.com/articles/article.aspx?p=2017061
But it's very simple as:
Start_Datestamp,count
2012-11-18 T 16:00, 2
2012-11-18 T 16:30, 1
2012-11-18 T 17:00,3
But not sure how to use other fields. I tried use WritableComparable to create composite keys to compose Start_Datestamp,file_name,source but it's not working correctly. Could someone direct me?
UPDATE!!
So now I managed to print multiple fields using Sudarshan advice. However, I have encountered an another issue.
For example, lets take a look at the sample data from above table:
Start_Datestamp,file_name,source, host,file_size,transfered_size
2012-11-18 T 16:05:00.000, FileA,SourceA, HostA,1Gb, 500Mb
2012-11-18 T 16:25:00.000, FileA,SourceA, HostB,1Gb, 500Mb
2012-11-18 T 16:28:00.000, FileA,SourceB, HostB,1Gb, 500Mb
What I would like to do is group the data by timestamp by 30 mins interval, source, sum(transfered_size)
so it would like this:
Start_Datestamp,source, Total_transfered_size
2012-11-18 T 16:00,SourceA, 1000Mb <<==Please see those two records are now merged to '16:00' timestamp .
2012-11-18 T 16:00,SourceB, HostB,1Gb, 500Mb <<===this record should not be merged because different source, even though the timetamp is within '16:00' frame.
But what is happening in my case is that only the first record for each intervals are being printed
e.g.
Start_Datestamp,source, Total_transfered_size
2012-11-18 T 16:00,SourceA, 1000Mb <<== Only this records is getting printed. The other one is not printing.
In my Map class, I've added the following spinets:
out = "," + src_loc + "," + dst_loc + "," + remote + ","
+ transfer + " " + activity + "," + read_bytes+ ","
+ write_bytes + "," + file_name + " "
+ total_time + "," + finished;
date.setDate(calendar.getTime());
output.collect(date, new Text(out));
Then in reducer:
String newline = System.getProperty("line.separator");
while (values.hasNext()) {
out += values.next().toString() + newline;
}
output.collect(key, new Text(out));
I think the problem is with the reducer iteration.
I tried moving the below code within the while loop, which appears to be printing all records. But I'm not entirely sure whether this is the correct approach. Any advice will be much appreciated.
output.collect(key, new Text(out));

You are going down the right path here, now instead of passing 1 in the value.
custom_key will the time in 30 min intervals
output.collect(custom_key, one);
You can pass the entire log text.
output.collect(customkey, log_text);
In the reducer you will then receive the entire log text in your iterable. Parse it in your reducer and use the relevant fields.
map<source,datatransferred>
for loop on iterable
parse log text line
extract file_name,source, Total_transffered_size
store the sum of data into the map against the source
end loop
for loop on map
output time,source,sum calculated in above step
end loop
The answer has couple of assumptions
Don't mind multiple output files
Not concerned with the ordering of the output

Related

Dynamic number system in Qlik Sense

My data consists of large numbers, I have a column say - 'amount', while using it in charts(sum of amount in Y axis) it shows something like 1.4G, I want to show them as if is billion then e.g. - 2.8B, or in millions then 80M or if it's in thousands (14,000) then simply- 14k.
I have used - if(sum(amount)/1000000000 > 1, Num(sum(amount)/1000000000, '#,###B'), Num(sum(amount)/1000000, '#,###M')) but it does not show the M or B at the end of the figure and also How to include thousand in the same code.
EDIT: Updated to include the dual() function.
This worked for me:
=dual(
if(sum(amount) < 1, Num(sum(amount), '#,##0.00'),
if(sum(amount) < 1000, Num(sum(amount), '#,##0'),
if(sum(amount) < 1000000, Num(sum(amount)/1000, '#,##0k'),
if(sum(amount) < 1000000000, Num(sum(amount)/1000000, '#,##0M'),
Num(sum(amount)/1000000000, '#,##0B')
))))
, sum(amount)
)
Here are some example outputs using this script to format it:
=sum(amount)
Formatted
2,526,163,764
3B
79,342,364
79M
5,589,255
5M
947,470
947k
583
583
0.6434
0.64
To get more decimals for any of those, like 2.53B instead of 3B, you can format them like '#,##0.00B' by adding more zeroes at the end.
Also make sure that the Number Formatting property is set to Auto or Measure expression.

tmssoftware TTMSFNCGrid slow data loading

Delphi 10.4.2, TTMSFNCGrid ver. 1.0.5.16
I am downloading about 30,000 records from the database into a json object. This takes about 1 minute.
I then try to enter (for a loop) the data into TTMSFNCGrid which has about 30,000 records and 16 columns. The data entry takes 20 minutes ! This is how long it takes to render and populate the grid. How can I speed up this process?
I use something like this
for _i:= 0 to JSON_ARRAY_DANE.Count-1 do
begin
_row:= JSON_ARRAY_DANE.Items[_i] as TJSONObject;
_grid.Cells[0,_i+1]:=_row.GetValue('c1').Value;
_grid.Cells[1,_i+1]:=_row.GetValue('c2').Value;
_grid.Cells[2,_i+1]:=_row.GetValue('c3').Value;
.
.
_grid.Cells[2,_i+1]:=_row.GetValue('c16').Value;
end
Resolved.
Need to add:
_grid.BeginUpdate;
_grid.EndUpdate;
**_grid.BeginUpdate;**
for _i:= 0 to JSON_ARRAY_DANE.Count-1 do
begin
_row:= JSON_ARRAY_DANE.Items[_i] as TJSONObject;
_grid.Cells[0,_i+1]:=_row.GetValue('c1').Value;
_grid.Cells[1,_i+1]:=_row.GetValue('c2').Value;
_grid.Cells[2,_i+1]:=_row.GetValue('c3').Value;
.
.
_grid.Cells[16,_i+1]:=_row.GetValue('c16').Value;
end;
**_grid.EndUpdate;**

Tibco Spotfire - time in seconds & milliseconds in Real, convert to a time of day

I have a list of time in a decimal format of seconds, and I know what time the series started. I would like to convert it to a time of day with the offset of the start time applied. There must be a simple way to do this that I am really missing!
Sample source data:
\Name of source file : 260521-11_58
\Recording from 26.05.2021 11:58
\Channels : 1
\Scan rate : 101 ms = 0.101 sec
\Variable 1: n1(rpm)
\Internal identifier: 63
\Information1:
\Information2:
\Information3:
\Information4:
0.00000 3722.35645
0.10100 3751.06445
0.20200 1868.33350
0.30300 1868.36487
0.40400 3722.39355
0.50500 3722.51831
0.60600 3722.50464
0.70700 3722.32446
0.80800 3722.34277
0.90900 3722.47729
1.01000 3722.74048
1.11100 3722.66650
1.21200 3722.39355
1.31300 3751.02710
1.41400 1868.27539
1.51500 3722.49097
1.61600 3750.93286
1.71700 1868.30334
1.81800 3722.29224
The Start time & date is 26.05.2021 11:58, and the LH column is elapsed time in seconds with the column name [Time] . So I just want to convert the decimal / real to a time or timespan and add the start time to it.
I have tried lots of ways that are really hacky, and ultimately flawed - the below works, but just ignores the milliseconds.
TimeSpan(0,0,0,Integer(Floor([Time])),[Time] - Integer(Floor([Time])))
The last part works to just get milli / micro seconds on its own, but not as part of the above.
Your formula isn't really ignoring the milliseconds, you are using the decimal part of your time (in seconds) as milliseconds, so the value being returned is smaller than the format mask.
You need to convert the seconds to milliseconds, so something like this should work
TimeSpan(0,0,0,Integer(Floor([Time])),([Time] - Integer(Floor([Time]))) * 1000)
To add it to the time, this would work
DateAdd(Date("26-May-2021"),TimeSpan(0,0,0,Integer([Time]),([Time] - Integer([Time])) * 1000))
You will need to set the column format to
dd-MMM-yyyy HH:mm:ss:fff

Spark SQL performance too slow if the number of rows are 100000

I'm testing Spark performance with very many rows table.
What I did is very simple.
Prepare csv file which has many rows and only 2 data records.
eg, csv file is like as follows:
col000001,col000002,,,,,,,col100000
dtA000001,dtA000002,,,,,,,,dtA100000
dtB000001,dtB000002,,,,,,,,dtB100000
dfdata100000 = sqlContext.read.csv('../datasets/100000c.csv', header='true')
dfdata100000.registerTempTable("tbl100000")
result = sqlContext.sql("select col000001,ol100000 from tbl100000")
Then get 1 row by show(1)
%%time
result.show(1)
File sizes are as follows(very small).
File name shows the number of rows:
$ du -m *c.csv
3 100000c.csv
1 10000c.csv
1 1000c.csv
1 100c.csv
1 20479c.csv
2 40000c.csv
2 60000c.csv
3 80000c.csv
Results are like as follows:
As you can see, the execution time is exponentially increase.
Example result:
+---------+---------+
|col000001|col100000|
+---------+---------+
|dtA000001|dtA100000|
+---------+---------+
only showing top 1 row
CPU times: user 218 ms, sys: 509 ms, total: 727 ms
Wall time: 53min 22s
Question1: Is it an acceptable result? Why is the execution time exponentially increase?
Question2: Is there any other method to do faster?

HyperTable: Loading data using Mutators Vs. LOAD DATA INFILE

I am starting a discussion, which I hope, will become one place to discuss data loading method using mutators Vs. loading using flat file via 'LOAD DATA INFILE'.
I have been baffled to get enormous performance gain using mutators (using batch size = 1000 or 10000 or 100K et cetera).
My project involved loading close to 400 million rows of social media data into HyperTable to be used for real time analytics. It took me close to 3 days to just load just 1 million row of data (code sample below). Each row is approximately 32 byte. So, in order to avoid taking 2-3 weeks to load this much data, I prepared a flat file with rows and used DATA LOAD INFILE method. Performance gain was amazing. Using this method, loading rate was 368336 cells/sec.
See below for actual snapshot of action:
hypertable> LOAD DATA INFILE "/data/tmp/users.dat" INTO TABLE users;
Loading 7,113,154,337 bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 508.07 s
Avg key size: 8.92 bytes
Total cells: 218976067
Throughput: 430998.80 cells/s
Resends: 2210404
hypertable> LOAD DATA INFILE "/data/tmp/graph.dat" INTO TABLE graph;
Loading 12,693,476,187 bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 1189.71 s
Avg key size: 17.48 bytes
Total cells: 437952134
Throughput: 368118.13 cells/s
Resends: 1483209
Why is performance difference between 2 method is so vast? What's the best way to enhance mutator performance. Sample mutator code is below:
my $batch_size = 1000000; # or 1000 or 10000 make no substantial difference
my $ignore_unknown_cfs = 2;
my $ht = new Hypertable::ThriftClient($master, $port);
my $ns = $ht->namespace_open($namespace);
my $users_mutator = $ht->mutator_open($ns, 'users', $ignore_unknown_cfs, 10);
my $graph_mutator = $ht->mutator_open($ns, 'graph', $ignore_unknown_cfs, 10);
my $keys = new Hypertable::ThriftGen::Key({ row => $row, column_family => $cf, column_qualifier => $cq });
my $cell = new Hypertable::ThriftGen::Cell({key => $keys, value => $val});
$ht->mutator_set_cell($mutator, $cell);
$ht->mutator_flush($mutator);
I would appreciate any input on this? I don't have tremendous amount of HyperTable experience.
Thanks.
If it's taking three days to load one million rows, then you're probably calling flush() after every row insert, which is not the right thing to do. Before I describe hot to fix that, your mutator_open() arguments aren't quite right. You don't need to specify ignore_unknown_cfs and you should supply 0 for the flush_interval, something like this:
my $users_mutator = $ht->mutator_open($ns, 'users', 0, 0);
my $graph_mutator = $ht->mutator_open($ns, 'graph', 0, 0);
You should only call mutator_flush() if you would like to checkpoint how much of the input data has been consumed. A successful call to mutator_flush() means that all data that has been inserted on that mutator has durably made it into the database. If you're not checkpointing how much of the input data has been consumed, then there is no need to call mutator_flush(), since it will get flushed automatically when you close the mutator.
The next performance problem with your code that I see is that you're using mutator_set_cell(). You should use either mutator_set_cells() or mutator_set_cells_as_arrays() since each method call is a round-trip to the ThriftBroker, which is expensive. By using the mutator_set_cells_* methods, you amortize that round-trip over many cells. The mutator_set_cells_as_arrays() method can be more efficient for languages where object construction overhead is large in comparison to native datatypes (e.g. string). I'm not sure about Perl, but you might want to give that a try to see if it boosts performance.
Also, be sure to call mutator_close() when you're finished with the mutator.

Resources