<<O>>  Difference Topic NotifyVsNetworkBufferOverflow (r1.1 - 25 Dec 2005 - GaranceDrosehn)
Line: 1 to 1
Added:
>
>
META TOPICPARENT BrainStorming
Coke recently ran into the 'Network buffer overflow' (NBO) problem while doing some work with the CJ bot:
   -> From Coke, to lily-dev:
   - CJ just got a traceback and a
     >> Network buffer overflow: 1 line of output to you has been lost <<            

   - I send cj a message. he sends the server "/what cj-admin", parses the 
   - result, then once he gets the result of that and parses, turns around
   - and sends the server a second command. the results of *that* should be
   - grabbed and sent back to me.
Over the past two years, we have made a lot of progress in the lilycoreDB code such that network buffer overflows are much much less common. The thing is, it seems odd that they would ever happen, at least at this point. The way the lily server is configured right now, the server reserves 64KB of buffer space for each logged-in user for buffering messages sent from the server to that user's client. And most operations on lily will only send a few 80-byte lines at a time. AND there are many different places in the code which are checking buffered_output_length() to make sure that code is not adding more text to those buffers if the buffers are filling up. So, how is it that we still (occasionally) see the problem?

Well, for one thing, much of the checking for buffer-overflows are done in the notify-related verbs that we have added to the $player object (#6). And there are still a number of places which write out a line by calling 'notify(player, ...)' instead of 'player:notify(...)'. I have been fixing those up as I notice them, but I know there are still places which make direct calls to the system's notify() routine, instead of going thru the notify-related verbs where we have added the extra checking. Note that in a few places in the code, we do need to call the system's notify() verb directly, so we can't just blindly change all calls from one to the other.

We also have four different notify-related verbs on $player, and until recently they all had different ideas of how to avoid NBO's. We have notify(), notifyLoop(), notify_list(), and notifyListLoop(). Originally there was just notify(), and its attempt would "just fail" if the network-buffers for that user were full. By "just fail", I mean the line sent to the user via player:notify() would be lost. We then noticed that if we did a 'suspend(0)' after a failure and tried again, then at least some lines would succeed on the second attempt to send them.

(aside: we first tried to avoid these overflows by changing the buffer size from 16KB/user to 32KB/user, and then again to 64KB/user. But I started balking when people wanted to increase the buffer size yet again to 128KB/user. On a historical note, the main LambdaMOO project also increased from 16KB/user to 64KB/user in 1999, while I'm pretty sure we did it a few years sooner than that.)

When SLCP support showed up, it became much more important that we didn't loose lines. For one thing, there were some SLCP operations (such as "%sync") which produced a lot of output all at once. Thus, it was much more likely we'd fill up the buffers. Not only that, but correct operation of the SLCP clients depended on them getting all the information which was getting sent. Now with SLCP, a lost line might mean that it was impossible for a user to send to some existing discussion, because their client wouldn't know the discussion even existed. So, if one 'suspend(0)' and retry worked in some cases, then an infinite loop of doing 'suspend(0)' and a retry would have to work "even better!". I have vague memories that we initially tried to do this in the original 'player:notify()' verb, only to run into some serious and mysterious server lock-ups when we did that. So, we created the new 'player:notifyLoop()' verb, which would take a single line, and keep looping around trying to send it to the user. We then use 'notifyLoop()' instead of 'notify()' for those messages which seemed the most critical for the correct operation of an SLCP client. In one or two cases this also caused serious lockups on the server, so we changed a few of them back to 'notify()', without really doing a full analysis of what the problem was.

The notifyLoop() and notifyListLoop() verbs are just versions of notify() and notify_list() which take a list of strings, and will write all of them to the player's network connection. This avoids some overhead, and allows a convenient way to supply a prefix to add to all the strings in that list. The only reason I make special notice of them in this context is that we generally forgot about these verbs, and as we made improvements to notify() and/or notify_list(), we didn't always remember to make similar improvements to notifyLoop() and notifyListLoop(). So the chance of running into NBO's would depend on which of the four verbs was being used to print the message -- or if the system's 'notify(player, ...)' routine was called directly for the message.

In 2004 I came upon the system routine called buffered_output_length(connection), and realized that this could help us out more than retrying failures from notify(). This tells us how many bytes are already buffered up on a network connection (where the first parameter can be some player-object, in which case it references the network connection for that player's current session). One shortcoming of this system routine is that it doesn't tell us how much room remains in the buffers, it only tells us how much is presently used. But we can write code which assumes the maximum is 64KB/user, and probably be safe enough. The nice thing about using this routine is that you can delay the addition of more lines to the network-buffer well before the buffers run out of space. I have used this in some verbs, and with very good results.

So I tried adding calls to check buffered_output_length(this) in player:notify() (in which case, 'this' is the value of the player object which is being written to), and immediately users reported some errors. At the time I was only testing the extra check, so I immediately removed it when errors were reported. Later on, I realized that the problem is that player:notify() is often called across player objects. E.G.: "userJoe" is sending a message to "userJane", so a process owned by "userJoe" is calling "userJane:notify()". Apparently that runs into a permissions-error (but I have not investigated completely to say that for sure -- maybe that is just due to the way we have our permissions set up). In any case, remember this point: "the object for userJoe can be calling userJane:notify()".

I'll write up more thoughts on this issue at a later time...

-- GaranceDrosehn - 24 Dec 2005

Revision -
Revision r1.1 - 25 Dec 2005 - 08:29 - GaranceDrosehn