The Challenge: The Script That Ate All the Memory
It’s a common scenario: a data processing script works perfectly on a small test file, but when deployed to production with a multi-gigabyte dataset, it consumes all available RAM, slows to a crawl, and eventually crashes. We faced this exact problem with a Ruby script designed to parse a large CSV file. The script’s memory usage would balloon to hundreds of megabytes, far exceeding our target limit of 70MB.
This post is a case study in how we systematically diagnosed and fixed the memory bloat, transforming the script into a lean, efficient, and scalable tool. The key takeaway? Stop thinking of files as single objects and start treating them as streams.
Step 1: Profiling - Finding the Memory Hog
Before you can fix a memory issue, you have to find its source. Guesswork is futile. We used the memory_profiler gem to analyze our script. This tool is invaluable for pinpointing which objects are being allocated and, more importantly, which are being retained in memory.
The initial profiler report was clear: millions of String and Array objects were being allocated and retained, all originating from a single line of code.
Step 2: The Optimizations - A Three-Pronged Attack
Our analysis revealed three primary areas for improvement. We tackled them one by one.
Optimization #1: Stop Reading the Whole File (Embrace Streaming)
The Problem: The profiler pointed directly at
CSV.read. This method is convenient, but it reads the entire file into a single, massive array of arrays in memory. For a large file, this is a guaranteed memory killer.Before: The entire file was loaded into memory at once.
# task-2.rb (Original) require 'csv' def work(file_path) lines = CSV.read(file_path, 'r', ',') users = [] lines.each do |line| # ... process and accumulate users end endThe Solution: Instead of reading the whole file, we switched to processing it one line at a time. Using
File.foreachcreates an enumerator that yields each line, keeping memory usage flat and constant, regardless of the file size.After: We process the file as a stream, never holding more than one line in memory at a time during the read phase.
# task-2.rb (Optimized) def work(file_path) File.foreach(file_path).lazy.each_slice(BATCH_SIZE) do |batch| # ... process a small batch of lines end end
Optimization #2: Stop Hoarding Objects (Process in Batches)
The Problem: Streaming the file was a huge step, but the profiler still showed memory growing over time. Why? Because we were still accumulating all the parsed user data into a single
usersarray for later processing.The Solution: We combined our streaming approach with batch processing. By using
each_slice(BATCH_SIZE), we process the data in manageable chunks. After each chunk is processed, the array holding that batch’s data goes out of scope, allowing Ruby’s Garbage Collector (GC) to reclaim the memory.After: The full implementation processes data in chunks of 1000, ensuring the
usersarray never becomes excessively large.# task-2.rb (Optimized) BATCH_SIZE = 1000 def work(file_path) users = [] File.foreach(file_path).lazy.each_slice(BATCH_SIZE) do |batch| batch.each do |line| # ... parse line and add to users array end # Process the batch of users here # ... # Clear the array to free memory for the next batch users.clear end end
Optimization #3: Reduce String Allocations
The Problem: The memory profiler revealed a high number of allocated strings. While small, millions of tiny string allocations add up, putting pressure on the GC.
The Solution: We made two changes. First, we used
.freezeon constant strings within loops to ensure they are allocated only once. Second, we used symbols (:user,:session) as hash keys instead of strings, as symbols are unique and immutable, saving memory.After: A small but impactful change to reduce allocations.
# task-2.rb (Optimized) # Using a frozen string as a constant USER_KEY = 'user'.freeze # ... inside a loop if cols[0] == USER_KEY # ... end
The Result: From 200MB+ to a Stable 65MB
The combination of these three optimizations had a dramatic effect. The script’s memory usage, which previously shot up past 200MB and continued to climb, now stabilizes at a lean ~65MB, well within our 70MB target. It can now process files of any size with a consistent, predictable, and low memory footprint.
Conclusion: Key Principles for Memory-Efficient Ruby
This case study highlights three critical principles for writing memory-efficient Ruby code, especially for data processing:
- Stream, Don’t Read: Never load a large file into memory entirely if you can process it line-by-line.
- Process in Batches, Don’t Accumulate: Don’t hold onto parsed objects for longer than you need to. Process them in manageable chunks and let the GC do its job.
- Be Mindful of Allocations: In hot loops, small changes to reduce object creation (especially for strings) can have a large cumulative impact.
By applying these principles, you can build robust and scalable scripts that handle massive datasets with ease.