The Scenario: A Script That Never Finishes
Every seasoned developer has faced this problem: a script that works perfectly on your development machine with a few megabytes of data, but when pointed at a production-scale dataset—hundreds of megabytes, millions of lines—it runs for hours and never seems to finish. This is the story of one such script and the systematic process used to transform it from a performance disaster into a highly efficient data processing tool.
The initial challenge was simple: process a 100MB+ text file. The problem? The existing Ruby script was too slow to be usable. The goal: make it process the entire file in under 30 seconds.
This isn’t a story about magic tricks. It’s a guide to a disciplined, repeatable engineering process that beats guesswork every time.
Step 1: The Foundation - A Fast Feedback Loop
Before writing a single line of optimized code, the first and most critical step is to create an efficient feedback loop. Trying to profile or test against a file that takes hours to run is a recipe for frustration and failure.
1. Shrink the Dataset: We created a smaller, representative sample of the data. A simple shell command is perfect for this:
# Take the first 4000 lines from the large file to create a test file
head -n 4000 data_large.txt > data_4000.txt
2. Establish a Baseline Metric: With the smaller file, we could now run the script in a reasonable time (e.g., 5-10 seconds). This became our baseline. The goal is to drive this number down with each iteration.
3. Guarantee Correctness: The script came with a test suite. By running these tests after every change, we ensured our optimizations didn’t alter the program’s logic. Never optimize without a safety net of tests.
Step 2: Profiling - Let the Data Show You the Bottleneck
Do not guess where the code is slow. You will almost certainly be wrong. Profiling tools analyze a program as it runs and tell you exactly where it spends its time. For this task, we used tools like stackprof and ruby-prof.
A profiler generates a flame graph, which is a visual representation of your application’s call stack. The widest parts of the graph are the “hot spots”—the methods where your program is spending the most time. This is where you must focus your efforts.
Our initial flame graph pointed unequivocally to our first major bottleneck.
Step 3: Iterative Optimization - Attack the Hot Spots
With a feedback loop and a profiler, we began a cycle: Profile -> Hypothesize -> Change -> Measure.
Optimization #1: Killing the O(n²) Search
Problem: The profiler showed that for each user, we were scanning the entire array of sessions to find the ones belonging to that user.
Analysis: This is a classic O(n²) complexity issue. As the number of users and sessions grows, the execution time explodes.
Before: The code iterated through each user and then performed a slow
selecton the massivesessionsarray:users.each do |user| user_sessions = sessions.select { |session| session['user_id'] == user['id'] } # ... process user_sessions ... endSolution: We changed the data structure. Instead of a flat array of sessions, we now create a
Hashwhere keys areuser_ids. This allows for an instantaneous O(1) lookup.After: We process the sessions file once, grouping sessions by user in a hash. The subsequent lookup is incredibly fast.
# First, parse all sessions into a hash sessions = {} file_lines.each do |line| cols = line.split(',') if cols[0] == 'session' session = parse_session(cols) sessions[session['user_id']] ||= [] sessions[session['user_id']] << session end end # Later, lookup is instant users.each do |user| user_sessions = sessions[user['id']] || [] # ... process user_sessions ... endResult: A 4x performance improvement. The
selectcall vanished from the profiler.
Optimization #2: Eliminating Expensive Parsing
Problem: The new flame graph showed that
Date.parsewas a major bottleneck. It was being called for every single session.Analysis: Parsing strings into Date objects is a very heavy operation. We only needed to sort the dates, which can be done on the date strings themselves (if they are in
YYYY-MM-DDformat).Before: The code mapped over every session for a user just to parse the date.
user.sessions.map { |s| s['date'] }.map { |d| Date.parse(d) }.sort.reverse.map(&:iso8601)Solution: We removed the
Date.parsecall entirely. The dates are kept as strings and sorted directly. This is significantly faster.After:
user_sessions.map{|s| s['date']}.sort.reverseResult: A further 40% performance boost. High-level, convenient methods can hide significant performance costs at scale.
Optimization #3: Efficient Uniqueness Calculation
Problem: The original code for finding unique browsers was also a hidden O(n²) operation.
Analysis: For every session, it iterated through the entire array of already-found unique browsers to see if the new one was already present.
Before: This
all?check inside a loop is very inefficient.uniqueBrowsers = [] sessions.each do |session| browser = session['browser'] uniqueBrowsers += [browser] if uniqueBrowsers.all? { |b| b != browser } endSolution: The idiomatic Ruby way is to collect all items into an array first, and then call
.uniqon it once.After:
browsers = [] file_lines.each do |line| # ... browsers << session['browser'] if is_session end unique_count = browsers.uniq.countResult: A respectable 10% improvement and much cleaner code.
The Final Result: Success and Protection
After several such iterations, we ran the script against the full data_large.txt file.
It finished in under 30 seconds.
We had achieved our goal. The total measured performance increase on our test files was over 33x.
To protect this hard-won progress, we added a new test to our suite: a performance test that runs the script on a sample dataset and fails the build if the execution time exceeds a certain threshold. This ensures that a future code change doesn’t accidentally re-introduce a performance regression.
Conclusion: The Optimization Framework
This case study proves that performance optimization is not an art; it’s a science. The framework is simple and powerful:
- Isolate and Measure: Create a fast feedback loop with a small dataset and a clear metric.
- Profile, Don’t Guess: Use a profiler to find the real, data-backed bottlenecks.
- Iterate: Change one thing at a time, measure the impact, and verify correctness.
- Protect: Guard your gains with automated performance tests.
By following this process, you can systematically and confidently solve even the most daunting performance problems.