Profiling a multi-tiered, distributed, web application (server side)

Profiling a multi-tiered, distributed, web application (server side) - performance

I would like to profile a complex web application from the server PoV.
According to the wikipedia link above, and the Stack Overflow profiling tag description, profiling (in one of its forms) means getting a list (or a graphical representation) of APIs/components of the application, each with the number of calls and time spent in it during run-time.
Note that unlike a traditional one-program/one-language a web server application may be:
Distributed over multiple machines
Different components may be written in different languages
Different components may be running on top of different OSes, etc.
So the traditional "Just use a profiler" answer is not easily applicable to this problem.
I'm not looking for:
Coarse performance stats like the ones provided by various log-analysis tools (e.g. analog) nor for
client-side, per-page performance stats like the ones presented by tools like Google's Pagespeed, or Yahoo! Y!Slow, waterfall diagrams, and browser component load times)
Instead, I'm looking for a classic profiler-style report:
number of calls
call durations
by function/API/component-name, on the server-side of the web application.
Bottom line, the question is:
How can one profile a multi-tiered, multi-platform, distributed web application?
A free-software based solution is much preferred.
I have been searching the web for a solution for a while and couldn't find anything satisfactory to fit my needs except some pretty expensive commercial offerings. In the end, I bit the bullet, thought about the problem, and wrote my own solution which I wanted to freely share.
I'm posting my own solution since this practice is encouraged on SO
This solution is far from perfect, for example, it is at very high level (individual URLs) which may not good for all use-cases. Nevertheless, it has helped me immensely in trying to understand where my web-app spends its time.
In the spirit on open source and knowledge sharing, I welcome any other, especially superior, approaches and solutions from others.

Thinking of how traditional profilers work, it should be straight-forward to come up with a general free-software solution to this challenge.
Let's break the problem into two parts:
Collecting the data
Presenting the data
Collecting the data
Assume we can break our web application into its individual
constituent parts (API, functions) and measure the time it takes
each of these parts to complete. Each part is called thousands of
times a day, so we could collect this data over a full day or so on
multiple hosts. When the day is over we would have a pretty big and
relevant data-set.
Epiphany #1: substitute 'function' with 'URL', and our existing
web-logs are "it". The data we need is already there:
Each part of a web API is defined by the request URL (possibly
with some parameters)
The round-trip times (often in microseconds) appear on each line
We have a day, (week, month) worth of lines with this data handy
So if we have access to standard web-logs for all the distributed
parts of our web application, part one of our problem (collecting
the data) is solved.
Presenting the data
Now we have a big data-set, but still no real insight.
How can we gain insight?
Epiphany #2: visualize our (multiple) web-server logs directly.
A picture is worth a 1000 words. Which picture can we use?
We need to condense 100s of thousands or millions lines of multiple
web-server logs into a short summary which would tell most of the
story about our performance. In other words: the goal is to generate
a profiler-like report, or even better: a graphical profiler report,
directly from our web logs.
Imagine we could map:
Call-latencies to one dimension
Number of calls to another dimension, and
The function identities to a color (essentially a 3rd dimension)
One such picture: a stacked-density chart of latencies by API
appears below (functions names were made-up for illustrative purposes).
The Chart:
Some observations from this example
We have a tri-modal distribution representing 3 radically
different 'worlds' in our application:
The fastest responses, are centered around ~300 microseconds
of latency. These responses are coming from our varnish cache
The second fastest, taking a bit less than 0.01 seconds on
average, are coming from various APIs served by our middle-layer
web application (Apache/Tomcat)
The slowest responses, centered around 0.1 seconds and
sometimes taking several seconds to respond to, involve round-trips
to our SQL database.
We can see how dramatic caching effects can be on an application
(note that the x-axis is on a log10 scale)
We can specifically see which APIs tend to be fast vs slow, so
we know what to focus on.
We can see which APIs are most often called each day.
We can also see that some of them are so rarely called, it is hard to even see their color on the chart.
How to do it?
The first step is to pre-process and extract the subset needed-data
from the logs. A trivial utility like Unix 'cut' on multiple logs
may be sufficient here. You may also need to collapse multiple
similar URLs into shorter strings describing the function/API like
'registration', or 'purchase'. If you have a multi-host unified log
view generated by a load-balancer, this task may be easier. We
extract only the names of the APIs (URLs) and their latencies, so we
end up with one big file with a pair of columns, separated by TABs
*API_Name Latency_in_microSecs*
func_01 32734
func_01 32851
func_06 598452
...
func_11 232734
Now we run the R script below on the resulting data pairs to produce
the wanted chart (using Hadley Wickham's wonderful ggplot2 library).
Voilla!
The code to generate the chart
Finally, here's the code to produce the chart from the API+Latency TSV data file:
#!/usr/bin/Rscript --vanilla
#
# Generate stacked chart of API latencies by API from a TSV data-set
#
# ariel faigon - Dec 2012
#
.libPaths(c('~/local/lib/R',
'/usr/lib/R/library',
'/usr/lib/R/site-library'
))
suppressPackageStartupMessages(library(ggplot2))
# grid lib needed for 'unit()':
suppressPackageStartupMessages(library(grid))
#
# Constants: width, height, resolution, font-colors and styles
# Adapt to taste
#
wh.ratio = 2
WIDTH = 8
HEIGHT = WIDTH / wh.ratio
DPI = 200
FONTSIZE = 11
MyGray = gray(0.5)
title.theme = element_text(family="FreeSans", face="bold.italic",
size=FONTSIZE)
x.label.theme = element_text(family="FreeSans", face="bold.italic",
size=FONTSIZE-1, vjust=-0.1)
y.label.theme = element_text(family="FreeSans", face="bold.italic",
size=FONTSIZE-1, angle=90, vjust=0.2)
x.axis.theme = element_text(family="FreeSans", face="bold",
size=FONTSIZE-1, colour=MyGray)
y.axis.theme = element_text(family="FreeSans", face="bold",
size=FONTSIZE-1, colour=MyGray)
#
# Function generating well-spaced & well-labeled y-axis (count) breaks
#
yscale_breaks <- function(from.to) {
from <- 0
to <- from.to[2]
# round to 10 ceiling
to <- ceiling(to / 10.0) * 10
# Count major breaks on 10^N boundaries, include the 0
n.maj = 1 + ceiling(log(to) / log(10))
# if major breaks are too few, add minor-breaks half-way between them
n.breaks <- ifelse(n.maj < 5, max(5, n.maj*2+1), n.maj)
breaks <- as.integer(seq(from, to, length.out=n.breaks))
breaks
}
#
# -- main
#
# -- process the command line args: [tsv_file [png_file]]
# (use defaults if they aren't provided)
#
argv <- commandArgs(trailingOnly = TRUE)
if (is.null(argv) || (length(argv) < 1)) {
argv <- c(Sys.glob('*api-lat.tsv')[1])
}
tsvfile <- argv[1]
stopifnot(! is.na(tsvfile))
pngfile <- ifelse(is.na(argv[2]), paste(tsvfile, '.png', sep=''), argv[2])
# -- Read the data from the TSV file into an internal data.frame d
d <- read.csv(tsvfile, sep='\t', head=F)
# -- Give each data column a human readable name
names(d) <- c('API', 'Latency')
#
# -- Convert microseconds Latency (our weblog resolution) to seconds
#
d <- transform(d, Latency=Latency/1e6)
#
# -- Trim the latency axis:
# Drop the few 0.001% extreme-slowest outliers on the right
# to prevent them from pushing the bulk of the data to the left
Max.Lat <- quantile(d$Latency, probs=0.99999)
d <- subset(d, Latency < Max.Lat)
#
# -- API factor pruning
# Drop rows where the APIs is less than small % of total calls
#
Rare.APIs.pct <- 0.001
if (Rare.APIs.pct > 0.0) {
d.N <- nrow(d)
API.counts <- table(d$API)
d <- transform(d, CallPct=100.0*API.counts[d$API]/d.N)
d <- d[d$CallPct > Rare.APIs.pct, ]
d.N.new <- nrow(d)
}
#
# -- Adjust legend item-height &font-size
# to the number of distinct APIs we have
#
API.count <- nlevels(as.factor(d$API))
Legend.LineSize <- ifelse(API.count < 20, 1.0, 20.0/API.count)
Legend.FontSize <- max(6, as.integer(Legend.LineSize * (FONTSIZE - 1)))
legend.theme = element_text(family="FreeSans", face="bold.italic",
size=Legend.FontSize,
colour=gray(0.3))
# -- set latency (X-axis) breaks and labels (s.b made more generic)
lat.breaks <- c(0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10)
lat.labels <- sprintf("%g", lat.breaks)
#
# -- Generate the chart using ggplot
#
p <- ggplot(data=d, aes(x=Latency, y=..count../1000.0, group=API, fill=API)) +
geom_bar(binwidth=0.01) +
scale_x_log10(breaks=lat.breaks, labels=lat.labels) +
scale_y_continuous(breaks=yscale_breaks) +
ggtitle('APIs Calls & Latency Distribution') +
xlab('Latency in seconds - log(10) scale') +
ylab('Call count (in 1000s)') +
theme(
plot.title=title.theme,
axis.title.y=y.label.theme,
axis.title.x=x.label.theme,
axis.text.x=x.axis.theme,
axis.text.y=y.axis.theme,
legend.text=legend.theme,
legend.key.height=unit(Legend.LineSize, "line")
)
#
# -- Save the plot into the png file
#
ggsave(p, file=pngfile, width=WIDTH, height=HEIGHT, dpi=DPI)

Your discussion of "back in the day" profiling practice is true.
There's just one little problem it always had:
In non-toy software, it may find something, but it doesn't find much, for a bunch of reasons.
The thing about opportunities for higher performance is, if you don't find them, the software doesn't break, so you just can pretend they don't exist.
That is, until a different method is tried, and they are found.
In statistics, this is called a type 2 error - a false negative.
An opportunity is there, but you didn't find it.
What it means is if somebody does know how to find it, they're going to win, big time.
Here's probably more than you ever wanted to know about that.
So if you're looking at the same kind of stuff in a web app - invocation counts, time measurements, you're not liable to do better than the same kind of non-results.
I'm not into web apps, but I did a fair amount of performance tuning in a protocol-based factory automation app many years ago.
I used a logging technique.
I won't say it was easy, but it did work.
The people I see doing something similar is here, where they use what they call a waterfall chart.
The basic idea is rather than casting a wide net and getting a lot of measurements, you trace through a single logical thread of transactions, analyzing where delays are occurring that don't have to.
So if results are what you're after, I'd look down that line of thinking.

Related

Looking for a more efficient way to pull data from multiple datasets in SAS

I'm trying to find a more efficient and speedier way (if possible) to pull subsets of observations that meet certain criteria from multiple hospital claims datasets in SAS. A simplified but common type of data pull would look like this:
data out.qualifying_patients;
set in.state1_2017
in.state1_2018
in.state1_2019
in.state1_2020
in.state2_2017
in.state2_2018
in.state2_2019
in.state2_2020;
array prcode{*} I10_PR1-I10_PR25;
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
if cohort=1 then output;
run;
Now imagine that instead of 2 states and 4 years we have 18 states and 9 years -- each about 1GB in size. The code above works fine but it takes FOREVER to run on our non-optimized server setup. So I'm looking for alternate methods to perform the same task but hopefully at a faster clip.
I've tried including (KEEP=) or (DROP=) statements for each dataset included the SET statement to limit the variables being scanned, but this really didn't have much of an impact on speed -- and, for non-coding-related reasons, we pretty much need to pull all the variables.
I've also experimented a bit with hash tables but it's too much to store in memory so that didn't seem to solve the issue. This also isn't a MERGE issue which seems to be what hash tables excel at.
Any thoughts on other approaches that might help? Every data pull we do contains customized criteria for a given project, but we do these pulls a lot and it seems really inefficient to constantly be processing thru the same datasets over and over but not benefitting from that. Thanks for any help!

I happend to have a 1GB dataset on my compute, I tried several times, it takes SAS no more than 25 seconds to set the dataset 8 times. I think the set statement is too simple and basic to improve its efficient.
I think the issue may located at the do loop. Your program runs do loop 25 times for each record, may assigns to cohort more than once, which is not necessary. You can change it like:
do i=1 to 25 until(cohort=1);
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then cohort=1;
end;
This can save a lot of do loops.

First, parallelization will help immensely here. Instead of running 1 job, 1 dataset after the next; run one job per state, or one job per year, or whatever makes sense for your dataset size and CPU count. (You don't want more than 1 job per CPU.). If your server has 32 cores, then you can easily run all the jobs you need here - 1 per state, say - and then after that's done, combine the results together.
Look up SAS MP Connect for one way to do multiprocessing, which basically uses rsubmits to submit code to your own machine. You can also do this by using xcmd to literally launch SAS sessions - add a parameter to the SAS program of state, then run 18 of them, have them output their results to a known location with state name or number, and then have your program collect them.
Second, you can optimize the DO loop more - in addition to the suggestions above, you may be able to optimize using pointers. SAS stores character array variables in memory in adjacent spots (assuming they all come from the same place) - see From Obscurity to Utility:
ADDR, PEEK, POKE as DATA Step Programming Tools from Paul Dorfman for more details here. On page 10, he shows the method I describe here; you PEEKC to get the concatenated values and then use INDEXW to find the thing you want.
data want;
set have;
array prcode{*} $8 I10_PR1-I10_PR25;
found = (^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ0ZZ')) or
(^^ indexw (peekc (addr(prcode[1]), 200 ), '0DTJ4ZZ'))
;
run;
Something like that should work. It avoids the loop.
You also could, if you want to keep the loop, exit the loop once you run into an empty procedure code. Usually these things don't go all 25, at least in my experience - they're left-filled, so I10_PR1 is always filled, and then some of them - say, 5 or 10 of them - are filled, then I10_PR11 and on are empty; and if you hit an empty one, you're all done for that round. So not just leaving when you hit what you are looking for, but also leaving when you hit an empty, saves you a lot of processing time.

You probably should consider a hardware upgrade or find someone who can tune your server. This paper suggests tips to improve the processing of large datasets.
Your code is pretty straightforward. The only suggestion is to kill the loop as soon as the criteria is met to avoid wasting unnecessary resources.
do i=1 to 25;
if prcode{i} in ("0DTJ0ZZ","0DTJ4ZZ") then do;
output; * cohort criteria met so output the row;
leave; * exit the loop immediately;
end;
end;

How can 'behave' be used to test time constraints?

I couldn't work out if Stack Overflow is an appropriate site to ask this and if not then what the appropriate site would be! There are so many stack exchange sites now and I didn't want to go through 200 sites :S
When I try and test whether my functions run within X seconds using behave (ie gherkin feature files and behave test steps), the code takes longer to run with behave testing than it would on its own. Especially at the beginning of the test but also in other parts.
Has anybody tested time constraints with behave before and know a workaround to adjust for the extra time that behave adds?
Is this even possible?
EDIT: To show how I'm timing the tests:
#when("the Python script provides images to the model")
def step_impl(context):
context.x_second_requirement = 5
# TODO: Investigate why this takes so long, when I'm not using behave I can use a 0.8 second timing constraint
context.start_time = time.time()
context.car_brain.tick()
context.end_time = time.time()
#then("the model must not take more than X seconds to produce output")
def step_impl(context):
assert context.end_time - context.start_time < context.x_second_requirement
Cheers,
Milan

Tensorflow dequeue is very slow on Cloud ML

I am trying to run a CNN on the cloud (Google Cloud ML) because my laptop does not have a GPU card.
So I uploaded my data on Google Cloud Storage. A .csv file with 1500 entries, like so:
| label | img_path |
| label_1| /img_1.jpg |
| label_2| /img_2.jpg |
and the corresponding 1500 jpgs.
My input_fn looks like so:
def input_fn(filename,
batch_size,
num_epochs=None,
skip_header_lines=1,
shuffle=False):
filename_queue = tf.train.string_input_producer(filename, num_epochs=num_epochs)
reader = tf.TextLineReader(skip_header_lines=skip_header_lines)
_, row = reader.read(filename_queue)
row = parse_csv(row)
pt = row.pop(-1)
pth = filename.rpartition('/')[0] + pt
img = tf.image.decode_jpeg(tf.read_file(tf.squeeze(pth)), 1)
img = tf.to_float(img) / 255.
img = tf.reshape(img, [IMG_SIZE, IMG_SIZE, 1])
row = tf.concat(row, 0)
if shuffle:
return tf.train.shuffle_batch(
[img, row],
batch_size,
capacity=2000,
min_after_dequeue=2 * batch_size + 1,
num_threads=multiprocessing.cpu_count(),
)
else:
return tf.train.batch([img, row],
batch_size,
allow_smaller_final_batch=True,
num_threads=multiprocessing.cpu_count())
Here is what the full graph looks like (very simple CNN indeed):
Running the training with a batch size of 200, then most of the compute time on my laptop (on my laptop, the data is stored locally) is spent on the gradients node which is what I would expect. The batch node has a compute time of ~12ms.
When I run it on the cloud (scale-tier is BASIC), the batch node takes more than 20s. And the bottleneck seems to be coming from the QueueDequeueUpToV2 subnode according to tensorboard:
Anyone has any clue why this happens? I am pretty sure I am getting something wrong here, so I'd be happy to learn.
Few remarks:
-Changing between batch/shuffle_batch with different min_after_dequeue does not affect.
-When using BASIC_GPU, the batch node is also on the CPU which is normal according to what I read and it takes roughly 13s.
-Adding a time.sleep after queues are started to ensure no starvation also has no effect.
-Compute time is indeed linear in batch_size, so with a batch_size of 50, the compute time would be 4 times smaller than with a batch_size of 200.
Thanks for reading and would be happy to give more details if anyone needs.
Best,
Al
Update:
-Cloud ML instance and Buckets were not in the same region, making them in the same region improved result 4x.
-Creating a .tfrecords file made the batching take 70ms which seems to be acceptable. I used this blog post as a starting point to learn about it, I recommend it.
I hope this will help others to create a fast data input pipeline!

Try converting your images to tfrecord format and read them directly from graph. The way you are doing it, there is no possibility of caching and if your images are small, you are not taking advantage of the high sustained reads from cloud storage. Saving all your jpg images into a tfrecord file or small number of files will help.
Also, make sure your bucket is a single region bucket in a region that had gpus and that you are submitting to cloudml in that region.

I've got the similar problem before. I solved it by changing tf.train.batch() to tf.train.batch_join(). In my experiment, with 64 batch size and 4 GPUs, it took 22 mins by using tf.train.batch() whilst it only took 2 mins by using tf.train.batch_join().
In Tensorflow doc:
If you need more parallelism or shuffling of examples between files, use multiple reader instances using the tf.train.shuffle_batch_join
https://www.tensorflow.org/programmers_guide/reading_data

wget: Trawl sample space of 100000000, max 100 results returned

Not sure if this is stack or code review, as I'm open to completely different approaches to the problem and, though I've started with PowerShell, am not wedded to a particular language or style.
I'm currently working with a web server on which we aren't authorised to access the back end.
It returns a list of generated certificates based on a left-justified filter, e.g. if you type 100 in the search box and click submit it will search for all certificates beginning with 100*, or the range 10000000 - 10099999
All of our certs are eight digit numbers giving a sample space of 00000000-99999999. I'm attempting to find which certificates, in this sample space, actually exist, given the certificate names must be unique.
The major caveat is that the server will only return the first 100 results, if your query returns more than that many results due to there being more than 100 extant certificates in that range, the extras are discarded.
My first approach was to just use wget (technically PowerShell's Invoke-WebRequest) and iterate through the range of queries 000000 to 999999 (100 at a time), which was working & I was on track for a mid-September finish.
Unfortunately there are people that want this data sooner, so I've had to write a recursive function that (with my default input) queries a ten million-certs large sample space at once and searches a progressively smaller space until < 99 certs are returned for each subspace, then moving onto the next ten million.
The data isn't evenly distributed or very predictable, 'most' (~90%?) certs cluster around 10000000-19999999 and 30000000-39999999 but I need them all.
Here's the function I'm currently using, it seems to be working (results are being written to file, and faster than before), but it is still ongoing. Are there any:
Glaring errors with the function
Better choices of inputs (for better efficiency)
Completely different approaches that would be better
The variable '$certsession' is established outside this snippet and represents the web server session (login information, cookies etc.)
function RecurseCerts ($min,$max,$step,$level) {
for ($certSpace = $min; $certSpace -le $max; $certSpace += $step) {
$levelMultiplier = "0" * $level
#Assuming a level of 3, these ToString arguments would turn a '5' into 005, a '50' into 050, and so on. Three or more digit numbers are unchanged.
$query = ($certSpace).ToString($levelMultiplier)
$resultsArray = New-Object System.Collections.ArrayList
"Query is $query"
#Get webpage, split content by newline, search for lines with a certificate common name and add them to the results array
Invoke-WebRequest -uri "https://webserver.com/app?service=direct%2F1%2FSearchPage%2F%24Form&sp=S0&Form0=%24TextField%2C%24Submit&%24TextField=$query&%24Submit=Search" -websession $certsession | %{$_.content -split "`n" | %{if ($_ -match "cn=(.*?),ou") {$resultsArray = $resultsArray + $matches[1]}}}
#If we got more than 98 results for our query, make the search more specific, until we don't get more than 98 (else condition).
if ($resultsArray.count -gt 98) {"Recursing at $certSpace"; $subLevel = $level + 1; $subSpace = $certSpace * 10; RecurseCerts -min $subSpace -max ($subSpace + 9) -step 1 -level $subLevel}
#This is the most specific 0-98 for this range, write it out to the file
else {"Completed range $certspace"; $resultsArray | out-file c:\temp\certlist.txt -encoding utf8 -append}
}
}
#Level 3 means include rightmost 3 digits eg. search 101 for range 10100000 - 10199999
#Level 4 would be the subspace 1010-1019 (so a search for 1015 returns 10150000 - 10159999)
RecurseCerts -min 0 -max 9 -step 1 -level 1
Since I've added 'language agnostic', feel free to ask for any needed PowerShell clarifications. I could also attempt to re-write it in pseudo-code if desired.
I think the fact that ranges are already iterated should prevent duplication when it is done with a subspace and jumps back to the higher level (re-capturing things it already captured at a lower level should be prevented), but I'd be lying if I said I fully understood the program flow here.
If it turns out there is duplication I can just filter the text file for duplicates. However, I'd still be interested in approaches that eliminate this problem if it exists.
*I've updated the code to display an indicator of progress to the console, and based on suggestions also changed the array type used to arraylist. The server is pretty fragile so I've avoided multi-threading for now, but it would normally be a useful feature of tasks like this - here's a summary of some ways to do this in PowerShell.
Here's an example of the behaviour currently. Notably the entire ten-million range 00000000 - 09999999 had less than 98 certificates and was thus processed without needing a recursion.
RecurseCerts behaviour

Moving my comments to an answer:
First suggestion: become authorised to access the back end.
The big room for performance increase is threading/split work over multiple clients. Since it's just a big space of numbers you could easily:
have two PowerShell processes running, searching 00000000-49999999 in one and 50000000-99999999 in the other (or as many processes as you want).
have two computers doing it, if you have others to access
use PowerShell multiprocessing (threading, jobs, workflows) although those are more complex to use.
Since this is mostly going to be server/network bound slowness, it's probably not worth the more difficult techniques, but script with start/end numbers and run it twice would be quite easy.
The code $resultsArray = $resultsArray + $matches[1] is very slow; arrays are immutable (fixed size) so this causes PowerShell to make a new array and copy the array into it. In a loop, adding many thousands of things, it will have a lot of overhead. Use $a = [System.Collections.ArrayList]#() and $a.Add($thing) instead.
How fast can the server respond (is it on the LAN or Internet)? If it's over a WAN connection there's a latency limit to how fast you can go, but if it's searching a big database and takes a while to return a page, that puts a bigger limit on what you can speed up from the client side.
How big is the response page? Invoke-WebRequest parses the HTML into a full DOM and it's very slow, and you're not using the DOM so you don't need that. You can use [System.Net.WebClient] to download the content as a string:
e.g.
$web = New-Object System.Net.WebClient
$web.DownloadString($url)
In terms of design, how many certificates are you expecting in the 100M search space? 10k? 50M? Your recursive function risks searching and pulling and ignoring the same certificates over and over trying to get below 100. Depending on distribution, I'd be tempted to look for and block-out the biggest chunks with 0 certificates. If you can rule out a range of 1M in one request that's enormously useful. Searching 1M, finding too many certs, searching 500K, too many, [...] searching 10K finding too many, seems wasteful and slow.

Increasing the Loading Speed of Large Files

There are two large text files (Millions of lines) that my program uses. These files are parsed and loaded into hashes so that the data can be accessed quickly. The problem I face is that, currently, the parsing and loading is the slowest part of the program. Below is the code where this is done.
database = extractDatabase(#type).chomp("fasta") + "yml"
revDatabase = extractDatabase(#type + "-r").chomp("fasta.reverse") + "yml"
#proteins = Hash.new
#decoyProteins = Hash.new
File.open(database, "r").each_line do |line|
parts = line.split(": ")
#proteins[parts[0]] = parts[1]
end
File.open(revDatabase, "r").each_line do |line|
parts = line.split(": ")
#decoyProteins[parts[0]] = parts[1]
end
And the files look like the example below. It started off as a YAML file, but the format was modified to increase parsing speed.
MTMDK: P31946 Q14624 Q14624-2 B5BU24 B7ZKJ8 B7Z545 Q4VY19 B2RMS9 B7Z544 Q4VY20
MTMDKSELVQK: P31946 B5BU24 Q4VY19 Q4VY20
....
I've messed around with different ways of setting up the file and parsing them, and so far this is the fastest way, but it's still awfully slow.
Is there a way to improve the speed of this, or is there a whole other approach I can take?
List of things that don't work:
YAML.
Standard Ruby threads.
Forking off processes and then retrieving the hash through a pipe.

In my usage, reading all or part the file into memory before parsing usually goes faster. If the database sizes are small enough this could be as simple as
buffer = File.readlines(database)
buffer.each do |line|
...
end
If they're too big to fit into memory, it gets more complicated, you have to setup block reads of data followed by parse, or threaded with separate read and parse threads.

Why not use the solution devised through decades of experience: a database, say SQLlite3?

(To be different, although I'd first recommend looking at (Ruby) BDB and other "NoSQL" backend-engines, if they fit your need.)
If fixed-sized records with a deterministic index are used then you can perform a lazy-load of each item through a proxy object. This would be a suitable candidate for a mmap. However, this will not speed up the total access time, but will merely amortize the loading throughout the life-cycle of the program (at least until first use and if some data is never used then you get the benefit of never loading it). Without fixed-sized records or deterministic index values this problem is more complex and starts to look more like a traditional "index" store (eg. a B-tree in an SQL back-end or whatever BDB uses :-).
The general problems with threading here are:
The IO will likely be your bottleneck around Ruby "green" threads
You still need all the data before use
You may be interested in the Widefinder Project, just in general "trying to get faster IO processing".

I don't know too much about Ruby but I have had to deal with the problem before. I found the best way was to split the file up into chunks or separate files then spawn threads to read each chunk in at a single time. Once the partitioned files are in memory combining the results should be fast. Here is some information on Threads in Ruby:
http://rubylearning.com/satishtalim/ruby_threads.html
Hope that helps.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio