Taming Ruby's Memory Bloat: A Practical Guide to Processing Large Files

The Challenge: The Script That Ate All the Memory

It’s a common scenario: a data processing script works perfectly on a small test file, but when deployed to production with a multi-gigabyte dataset, it consumes all available RAM, slows to a crawl, and eventually crashes. We faced this exact problem with a Ruby script designed to parse a large CSV file. The script’s memory usage would balloon to hundreds of megabytes, far exceeding our target limit of 70MB.

This post is a case study in how we systematically diagnosed and fixed the memory bloat, transforming the script into a lean, efficient, and scalable tool. The key takeaway? Stop thinking of files as single objects and start treating them as streams.

Step 1: Profiling - Finding the Memory Hog

Before you can fix a memory issue, you have to find its source. Guesswork is futile. We used the memory_profiler gem to analyze our script. This tool is invaluable for pinpointing which objects are being allocated and, more importantly, which are being retained in memory.

The initial profiler report was clear: millions of String and Array objects were being allocated and retained, all originating from a single line of code.

Step 2: The Optimizations - A Three-Pronged Attack

Our analysis revealed three primary areas for improvement. We tackled them one by one.

Optimization #1: Stop Reading the Whole File (Embrace Streaming)

The Problem: The profiler pointed directly at CSV.read. This method is convenient, but it reads the entire file into a single, massive array of arrays in memory. For a large file, this is a guaranteed memory killer.

Before: The entire file was loaded into memory at once.

# task-2.rb (Original)
require 'csv'

def work(file_path)
  lines = CSV.read(file_path, 'r', ',')
  users = []
  lines.each do |line|
    # ... process and accumulate users
  end
end

The Solution: Instead of reading the whole file, we switched to processing it one line at a time. Using File.foreach creates an enumerator that yields each line, keeping memory usage flat and constant, regardless of the file size.

After: We process the file as a stream, never holding more than one line in memory at a time during the read phase.

# task-2.rb (Optimized)
def work(file_path)
  File.foreach(file_path).lazy.each_slice(BATCH_SIZE) do |batch|
    # ... process a small batch of lines
  end
end

Optimization #2: Stop Hoarding Objects (Process in Batches)

The Problem: Streaming the file was a huge step, but the profiler still showed memory growing over time. Why? Because we were still accumulating all the parsed user data into a single users array for later processing.
The Solution: We combined our streaming approach with batch processing. By using each_slice(BATCH_SIZE), we process the data in manageable chunks. After each chunk is processed, the array holding that batch’s data goes out of scope, allowing Ruby’s Garbage Collector (GC) to reclaim the memory.

After: The full implementation processes data in chunks of 1000, ensuring the users array never becomes excessively large.

# task-2.rb (Optimized)
BATCH_SIZE = 1000

def work(file_path)
  users = []
  File.foreach(file_path).lazy.each_slice(BATCH_SIZE) do |batch|
    batch.each do |line|
      # ... parse line and add to users array
    end

    # Process the batch of users here
    # ...

    # Clear the array to free memory for the next batch
    users.clear
  end
end

Optimization #3: Reduce String Allocations

The Problem: The memory profiler revealed a high number of allocated strings. While small, millions of tiny string allocations add up, putting pressure on the GC.
The Solution: We made two changes. First, we used .freeze on constant strings within loops to ensure they are allocated only once. Second, we used symbols (:user, :session) as hash keys instead of strings, as symbols are unique and immutable, saving memory.

After: A small but impactful change to reduce allocations.

# task-2.rb (Optimized)
# Using a frozen string as a constant
USER_KEY = 'user'.freeze

# ... inside a loop
if cols[0] == USER_KEY
  # ...
end

The Result: From 200MB+ to a Stable 65MB

The combination of these three optimizations had a dramatic effect. The script’s memory usage, which previously shot up past 200MB and continued to climb, now stabilizes at a lean ~65MB, well within our 70MB target. It can now process files of any size with a consistent, predictable, and low memory footprint.

Conclusion: Key Principles for Memory-Efficient Ruby

This case study highlights three critical principles for writing memory-efficient Ruby code, especially for data processing:

Stream, Don’t Read: Never load a large file into memory entirely if you can process it line-by-line.
Process in Batches, Don’t Accumulate: Don’t hold onto parsed objects for longer than you need to. Process them in manageable chunks and let the GC do its job.
Be Mindful of Allocations: In hot loops, small changes to reduce object creation (especially for strings) can have a large cumulative impact.

By applying these principles, you can build robust and scalable scripts that handle massive datasets with ease.

The Challenge: The Script That Ate All the Memory#

Step 1: Profiling - Finding the Memory Hog#

Step 2: The Optimizations - A Three-Pronged Attack#

Optimization #1: Stop Reading the Whole File (Embrace Streaming)#

Optimization #2: Stop Hoarding Objects (Process in Batches)#

Optimization #3: Reduce String Allocations#

The Result: From 200MB+ to a Stable 65MB#

Conclusion: Key Principles for Memory-Efficient Ruby#