Thursday, May 19, 2011

Proving Rails memory leak with gdb

This is a follow up to Rails3 leaks (more) memory?.

I am no gdb expert by any means, but I thought I'd give it a try to see if I can identify the cause behind the controller instances. The basic approach is to catch the GC in the act of marking a controller instance and follow the chain to try to identify why it is being marked (as marked stuff is not swept).

Couple setup steps. First, I re-compiled ruby with debugging symbols which turned out to be trivial thanks to rvm. Second, I removed all memprof stuff from my simple test app.

With that out of the way, I can start the app with rails server -e production and get it in the right state with,
curl http://localhost:3000/authors/new curl http://localhost:3000/authors/new curl http://localhost:3000/authors/new curl http://localhost:3000/final_gc Finally I can attach to this running process with gdb.

I know there is an instance of the AuthorsController somewhere on the heap. So, I can look up the class that is the AuthorsController with,
(gdb) call rb_const_get(rb_cObject, rb_intern("AuthorsController")) $14 = 4315109640 and set a conditional breakpoint in rc_gc_mark function in gc.c to look for objects of the class i'm interested in,
(gdb) break gc.c:1361 (gdb) cond 1 (rb_type(ptr) == 2 && rb_class_real(rb_class_of(obj)) == 4315109640) (gdb) c Now I hit the server with curl http://localhost:3000/final_gc and wait ... the condition on the breakpoint makes the gc process much slower, but eventually it stopped,
Breakpoint 1, rb_gc_mark (ptr=4377820520) at gc.c:1361 warning: Source file is more recent than executable. /Users/pkmiec/.rvm/src/ree-1.8.7-2010.02/source/gc.c:1361:30666:beg:0x100043cb4 (gdb) bt #0 rb_gc_mark (ptr=4377820520) at gc.c:1361 #1 0x000000010003064b in blk_mark (data=0x102a50b00) at eval.c:8596 #2 0x00000001000443e4 in gc_mark_children (ptr=4377672960) at gc.c:1570 #3 0x0000000100043ce7 in rb_gc_mark (ptr=4377673120) at gc.c:1366 #4 0x0000000100043b6f in mark_entry (key=66906, value=4377673120) at gc.c:1293 #5 0x00000001000a959c in st_foreach (table=0x10250df70, func=0x100043b56 , arg=0) at st.c:487 #6 0x0000000100043b9e in rb_mark_tbl (tbl=0x10250df70) at gc.c:1302 #7 0x00000001000442ce in gc_mark_children (ptr=4315109640) at gc.c:1538 #8 0x0000000100043ce7 in rb_gc_mark (ptr=4315109640) at gc.c:1366 #9 0x000000010004434d in gc_mark_children (ptr=4315134280) at gc.c:1551 ... First, I verified the breakpoint condition,
(gdb) call ((struct RString *)classname(rb_class_real(4315109640)))->ptr $5 = 0x10250df50 "AuthorsController" What caught my eye next is line #8 ... rb_gc_mark of 4315109640 ... but wait 4315109640 is the class. So luckily the "distance" between something known and the instance is very short! Then I tried to decipher what line #4 means,
(gdb) call rb_id2name(66906) $8 = 0x102731700 "@_config" (gdb) call rb_type(4377673120) $2 = 11 # <- Hash (gdb) p ((RVALUE *)4377673120)->as.hash $15 = { basic = { flags = 16971, klass = 4315911160 }, tbl = 0x10047aa40, iter_lev = 0, ifnone = 4377672960 } So, the class has an instance variable called @_config and it's value is a hash. You can see this is in fact true by looking in active_support (3.0.7) at configurable.rb line 12. Line #3 says this hash is being marked and line #2 says 4377672960 is marked. Notice 4377672960 is the ifnone of the hash ... so in ruby this is the block passed to Hash.new. This is exactly how ActiveSupport::InheritableOptions.new works (which is what is assigned to @_config).
(gdb) call rb_type(4377672960) $7 = 34 (gdb) p ((RVALUE *)4377672960)->as.data $11 = { basic = { flags = 98, klass = 4331614640 }, dmark = 0x1000305f1 , dfree = 0x1000306ea , data = 0x102a94c90 } I didn't know what data is :(, so I looked at blk_mark function at eval.c:8596,
static void blk_mark(data) struct BLOCK *data; { while (data) { rb_gc_mark_frame(&data->frame); rb_gc_mark((VALUE)data->scope); rb_gc_mark((VALUE)data->var); rb_gc_mark((VALUE)data->body); rb_gc_mark((VALUE)data->self); <---------- line 8596 !!! rb_gc_mark((VALUE)data->dyna_vars); rb_gc_mark((VALUE)data->cref); rb_gc_mark(data->wrapper); rb_gc_mark(data->block_obj); data = data->prev; } } Bingo! A request comes to /authors/new and is routed to an instance of AuthorsController. That instance renders a view and that view asks the controller instance for it's config. The controller instance asks the controller class for its config. The AuthorsController hasn't initialized its config yet so it creates an instance of ActiveSupport::InheritableOptions (which extends OrderedOptions which extends OrderedHash). ActiveSupport::InheritableOptions.new is implemented as super() { |h,k| parent[k] }. That block creates a closure which captures self which is the controller instance. Here is where I smell a ruby bug as I don't understand why self at this point is the controller instance.

This makes sense. The controller instances left in memory are the ones that perform lazy initialization and that lazy initialization creates scopes which are assigned to global or class level variables. This is why eager loading at start-up works.

Tuesday, May 17, 2011

Rails3 leaks (more) memory?

I'm involved in upgrading a rather large Rails app from Rails 2.3.4 to Rails 3.0.7. I say rather large because this app has about 150 active record models and another 200 classes that use the active model functionality. It also uses 90ish gems and plugins. The app is deployed with mongrels and REE 1.8.7 (2010.02 .. 32bit) with GC tuned.

With Rails2, a mongrel initially uses about 210MB of memory and eventually grows to high 200's as requests force additional parts of the app to be loaded. Monit is set to restart anything that goes beyond 300MB, which happens periodically but at a manageable rate (i.e. once per hour).

With Rails3, a mongrel initially uses about 190MB of memory and is restarted by Monit after few requests. Turning off Monit, the mongrel process quickly grows to 400s with initial few requests to few different controllers and then to high 500s without even hitting the more obscure parts of the app.

Something is wrong!

Additional experiments revealed that it's not a leak in the traditional sense. Meaning hitting the same url over and over does not grow the memory. Only the first and sometimes also the second requests cause memory to jump. This points to the memory growth being related to Rails' lazy loading. Rails3 (much more so than Rails2) defers lots of initialization until "later" to help achieve faster start up time. That "later" is usually when a request is being handled at which point Rails initializes whatever is needed to complete the request. For example, even though the User model is required during start time, the columns are not obtained from the db and the attribute methods are not created until they are needed.

I configured the app such that I can run in in production mode on my MacBook. I added memprof to the Gemfile. I created the following dead simple selenium script,
login # /sessions/new, fill out form, redirected to /dashboards open_and_wait("/owners/new") open_and_wait("/owners/new") open_and_wait("/owners/new") open_and_wait("/owners/new") open_and_wait("/final_gc") This scripts hits the SessionsController, DashboardsController, and OwnersController. The final_gc is a middleware right after Rails::Lock which just GCs and returns 200. Afterwards, the heap is dumped via kill -URG <pid> as provided by memprof gem.

I was surprised to find instances of the controllers on the heap (and therefore everything that controllers reference: env, templates, etc). Let's focus on the OwnersController. That controller was hit 4 times and two instances were hanging around. In fact, it was an instance that handled the first request and another instance that handled the second request. The third and fourth requests did not leave controller instances. This behavior is very consistent.

At first, I was convinced the complexity of the app was causing the problem (i.e. some monkey patch or custom functionality). But the more I poked around the less it seemed related to the app itself.

So, I created a brand new Rails 3.0.7 application with two models: Post and Author. I've modified the selenium script to do the following,
open_and_wait("/authors/new") open_and_wait("/authors/new") open_and_wait("/authors/new") open_and_wait("/authors/new") open_and_wait("/authors/new") open_and_wait("/final_gc") where /authors/new is just,
def new @author = Author.new end Amazingly, this simple app has the same problem in that an instance of AuthorsController, which handled the first request, is hanging around (metal.rb line 178 is where controller instances are created),
{"_id":"0x80d2ab0", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_controller/metal.rb", "line":178,"time":1305667260896820,"type":"object", "class":"0x1d3f6b0", "class_name":"AuthorsController", "ivars":{"@_status":200,"@_response_body":"0x809d928", "@_env":"0x80f69d8", "@_view_runtime":"0x8091830", "@_request":"0x80d29e8", "@lookup_context":"0x80bfaa0", "@_headers":"0x80d2998", "@author":"0x80ae3e0", "@action_has_layout":true,"@_action_name":"0x80d2e70", "@_config":"0x80aea98", "@view_context_class":null,"@_db_runtime":0,"@_response":"0x80d2920"},"code":2} so following my understanding of GC, the controller instance is hanging around because something in the global scope is directly or indirectly referencing this instance. So, I ran a script to produce a everything that directly and indirectly reference this instance,
{"_id":"0x80afbc8", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/activesupport-3.0.7/lib/active_support/callbacks.rb", "line":93, ...} {"_id":"0x80d2920", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_controller/metal/rack_delegation.rb", ...} {"_id":"0x80d29e8", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_dispatch/http/request.rb", "line":42, ...} {"_id":"0x80d2ab0", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_controller/metal.rb", "line":178, ...} {"_id":"0x80f59e8", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_dispatch/middleware/remote_ip.rb", "line":47, ...} {"_id":"0x80f69d8", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/rack-1.2.2/lib/rack/handler/mongrel.rb", "line":46, ...} {"_id":"0x82964c8", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_dispatch/http/response.rb", "line":149, ...} It is a limited amount of stuff and nothing stands out as to why GC would no collect this graph (just to be sure, I uploaded this heap dump to memprof.com and confirmed this graph is correct). I noticed that the heap data contains the thread responsible for creating that thing on the heap. Using this, I can find everything that was created (and is still around) during the request that caused the controller instance to leak,
{"_id":"0x809da40", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_view/template.rb", "line":105,"time":1305667260916474,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80a09c0", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_view/template.rb", "line":105,"time":1305667260916068,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80a2e28", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_view/template/resolver.rb", "line":10,"time":1305667260915127,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80a2f18", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_view/template/resolver.rb", "line":10,"time":1305667260915111,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80a4958", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_view/template/resolver.rb", "line":10,"time":1305667260912128,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80a5cb8", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/activesupport-3.0.7/lib/active_support/core_ext/class/attribute.rb", "line":73,"time":1305667260911863,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80a6e38", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/activesupport-3.0.7/lib/active_support/core_ext/class/attribute.rb", "line":73,"time":1305667260911852,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80a70b8", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_dispatch/routing/route_set.rb", "line":283,"time":1305667260911649,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80a7130", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_dispatch/routing/route_set.rb", "line":283,"time":1305667260911638,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80a9908", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_view/template/resolver.rb", "line":10,"time":1305667260905821,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80a99f8", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_view/template/resolver.rb", "line":10,"time":1305667260905808,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80aa8d0", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/activemodel-3.0.7/lib/active_model/attribute_methods.rb", "line":110,"time":1305667260905562,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80aa8f8", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/activemodel-3.0.7/lib/active_model/attribute_methods.rb", "line":110,"time":1305667260905553,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80ab6e0", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/activemodel-3.0.7/lib/active_model/attribute_methods.rb", "line":110,"time":1305667260905452,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80ab708", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/activemodel-3.0.7/lib/active_model/attribute_methods.rb", "line":110,"time":1305667260905443,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80ad850", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/activemodel-3.0.7/lib/active_model/attribute_methods.rb", "line":110,"time":1305667260904746,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80ad878", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/activemodel-3.0.7/lib/active_model/attribute_methods.rb", "line":110,"time":1305667260904737,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80ae980", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/activesupport-3.0.7/lib/active_support/ordered_hash.rb", "line":37,"time":1305667260904385,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80aea48", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/activesupport-3.0.7/lib/active_support/ordered_hash.rb", "line":37,"time":1305667260904366,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80afbc8", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/activesupport-3.0.7/lib/active_support/callbacks.rb", "line":93,"time":1305667260904191,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80b0e38", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_dispatch/http/parameter_filter.rb", "line":27,"time":1305667260903865,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80c1058", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_view/lookup_context.rb", "line":25,"time":1305667260899722,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80c1080", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_view/lookup_context.rb", "line":25,"time":1305667260899715,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80c1c88", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_view/lookup_context.rb", "line":25,"time":1305667260899620,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80c1cb0", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_view/lookup_context.rb", "line":25,"time":1305667260899614,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80c3da8", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_view/template/resolver.rb", "line":9,"time":1305667260899365,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x80c3ec0", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_view/template/resolver.rb", "line":9,"time":1305667260899331,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} {"_id":"0x82964c8", "file":"/Users/pkmiec/.rvm/gems/ree-1.8.7-2010.02/gems/actionpack-3.0.7/lib/action_dispatch/http/response.rb", "line":149,"time":1305667261167737,"type":"data", "class":"0x22f31b0", "class_name":"Proc", "thread":"0x80f7310" ...} As expected, you see some of the Rails lazy loading stuff (i.e. attribute_methods for active model and resolver / template stuff for partial). I began thinking that lazy loading is related to the memory leak because that's the only difference between the first request and consequent requests. This test this theory, I added the following code to production.rb which pre-loads model and controller bits,
class Paul def self.init_controllers(controllers) controllers.each do |c| puts "init_controller #{c.name}" begin c.view_context_class c.config rescue => ignored end init_controllers(c.subclasses) end end def self.init_models(models) models.each do |m| puts "init_model #{m.name}" begin m.define_attribute_methods m.new({}) rescue => ignored end init_models(m.subclasses) end end end config.after_initialize do Paul.init_models(ActiveRecord::Base.subclasses) Paul.init_controllers(ApplicationController.subclasses) end Eureka ... no more controller instances hanging around! This same trick avoids controller instances / memory leak on the rather large app which I started with.

But the question of WHY still remains. I believe it has to do with some of the craziness of lazy-initialization (i.e. eigen / singleton classes created by attribute_methods, scope capturing in resolver.rb with Hash.new { }, etc). But the heap dump does not show it ... which leads to be believe the problem is somehow fundamental to ruby vm.

I've tried both mysql and sqlite3. I've tried mongrel and webrick. I've tried ruby-1.8.7-p334, ree-1.8.7-2010.02, and ree-1.8.7-2011.03 (memprof does not work with 1.9 yet). All show the same behavior.

Perhaps someone with more memory / GC /gdb experience can shed a light ...