Jump to content

SpiderMonkey ESR31 upgrade


Recommended Posts

I just wondered, do the memory graphs differ when you would package the public folder (like is done in releases)?

These graphs definitely not because they only show SpiderMonkey memory usage and SpiderMonkey needs to read the scripts in uncompressed form.

I'm not quite sure about how the VFS accesses the rest of the data in the public folder or the ZIP file. It could only make a difference if it loads the ZIP or the folder into memory and extracts the data from there on demand. Jan, a former 0 A.D. developer, spent a lot of work in optimizing the VFS, so I doubt there are any quick wins for performance or memory usage there.

Link to comment
Share on other sites

What do you mean by kind of a memory leak? Wouldn't it make sense to have all Javascript code compiled, as it doesn't change?

There are two kinds of concerns.

The first is that some of our code gets used rarely (like random map code for a specific map or a specific trigger script) and it could cause unnecessary bloat to keep the compiled versions in memory. The other concern is that this setting is meant just for testing and it could have unwanted side-effects like keeping code that isn't accessible anymore. Running one non-visual replay worked quite well though. The memory usage was somewhere between the two graphs of v24 and v31+GGC.

I should probably investigate a bit further and ask the SpiderMonkey developers. My concern is that they design the heuristics for keeping JIT code just for the use case of Firefox and not for a case like ours where most of the code should be kept.

Link to comment
Share on other sites

SpiderMonkey even ion-compiles and inlines functions created via the Function() constructor (Google's Chrome doesn't). These shouldn't get garbage collected, because they are very useful (and avoid lots of boilerplate code) to create hot functions on the fly.

Edited by agentx
Link to comment
Share on other sites

I've asked in the newsgroup about the problem with JIT code GC.

The different types of functions (loop bodies are another example) and if/how they are protected from GC is something the SpiderMonkey developers have to think about. I'm trying to avoid diving too deep into SpiderMonkey internals. I'm just one person and have enough to do with the interface to 0 A.D.. ;)

Link to comment
Share on other sites

There are two kinds of concerns.

The first is that some of our code gets used rarely (like random map code for a specific map or a specific trigger script) and it could cause unnecessary bloat to keep the compiled versions in memory. The other concern is that this setting is meant just for testing and it could have unwanted side-effects like keeping code that isn't accessible anymore. Running one non-visual replay worked quite well though. The memory usage was somewhere between the two graphs of v24 and v31+GGC.

I should probably investigate a bit further and ask the SpiderMonkey developers. My concern is that they design the heuristics for keeping JIT code just for the use case of Firefox and not for a case like ours where most of the code should be kept.

If the memory consumption was between the old and new version, that would be of no concern, would it? Even the v24 was well below 100Megs most of the time, if I don't misread your graph. Or are there many complaints about 0ad being memory hungry?

Link to comment
Share on other sites

It's not too bad in this specific test, I'm more concerned about the support of SpiderMonkey.

If there's no solution that is meant for our case, it could suddenly break without an alternative.

So far I know about the following ways to prevent GC of JIT code:

  1. Set the flag "alwaysPreserveCode" in vm/Runtime.cpp (tested here)
  2. Use the testing functions to enable gcPreserveCode
  3. Keep calling js::NotifyAnimationActivity with a delay of less than a second

The first approach has the problem that it completely disables JIT GC for the whole engine and the source code needs to be changed. This solution doesn't work for SpiderMonkey versions provided by Linux distributions for example.

The second approach did not work in my most recent tests, but if I investigate further, it probably could work. It would not require changes to the library source code, but otherwise has the same problems and is obviously declared as a testing function that shouldn't be used in productive code. The third option seems to be actually used by some parts of Firefox, but it seems very fragile. GC of JIT code could happen again if there's only a single lag spike of more than a second, or another case when our calls don't get executed with less than a second delay.

I really hope the SpiderMonkey devs have a better alternative to offer.

  • Like 1
Link to comment
Share on other sites

> GC of JIT code could happen again if there's only a single lag spike of more than a second

Does this imply GC of all code and possibly full re-compilation of the source tree (GUI, bots, simulation, ...) or just parts? And if re-compilation takes longer than second, does it it trigger another GC and enters a loop?

Link to comment
Share on other sites

> GC of JIT code could happen again if there's only a single lag spike of more than a second

Does this imply GC of all code and possibly full re-compilation of the source tree (GUI, bots, simulation, ...) or just parts? And if re-compilation takes longer than second, does it it trigger another GC and enters a loop?

I think it means all code in one compartment (simulation OR gui OR AI OR ...) because js::NotifyAnimationActivity works on the compartment level.

It will not enter a loop and as far as I know, JIT code GC only happens as part of the normal full GC. We define some rules to trigger full GC which depend on how much the memory usage increased since the last full GC. However, there are other conditions that could trigger a full GC, so it would not be enough to add a js::NotifyAnimationActivity call right before we call the full GC. I assume it means JIT code GC will only happen if the last call to js::NotifyAnimationActivity was longer than a second ago at the time the full GC runs.

Anyway, it seem to be a hack to use js::NotifyAnimationActivity to prevent JIT code GC during such a long period of time.

Link to comment
Share on other sites

> Anyway, it seem to be a hack to use js::NotifyAnimationActivity to prevent JIT code GC during such a long period of time.

Definitely, especially while more and more full apps move into the browser. However, if it even allows to run Unreal Engine 3 in FireFox...

Is that what Artillery, the gaming company, uses? https://www.artillery.com

Link to comment
Share on other sites

  • 1 month later...

I've spent quite some time analyzing the effect of gcPreserveCode on memory usage and performance.
In the previous measurements, we have seen a 13.5% performance improvement in the non-visual replay when gcPreserveCode is enabled. I've also checked memory usage there, but it was only the size of the JS heap reported by SpiderMonkey. If SpiderMonkey needs more memory with gcPreserveCode that is not part of the JS heap, we wouldn't have seen it there.

So one of the first things to do is comparing total memory usage:
post-7202-0-89939500-1415290201.png

Here you see a graph of the total memory usage during a non-visual replay of 15000 turns with preserving JIT code and without (both v31). The measurement in blue will be explained later.
There's one sample all 5 seconds. As expected, the keepjit graph (green) is a bit shorter because it runs faster. In the end of the replay, keeping the JIT code needs 76 MB or 24-25% more memory. This is just the non-visual replay, so the percentage looks less scary with a real game where much more memory is needed for loading models, textures, sound etc.

It's not only important what effect the gcPreserveCode setting has during a game, but also what effect it has when playing several games in a row. The JIT code is kept on the compartment level (might partially be zone level, but that's the same for this case in practice). We create a new compartment when we start a new game and destroy the compartment when we end it. This means we don't keep simulation JIT code for more than one game. The same is true for the NetServer in multiplayer games, for random maps and for GUI pages. Also we can run a shrinking GC at any time, which throws away all the JIT code and associated information.

The third graph (in blue) shows an additional memory measurement when a shrinking GC is called all 500 turns and gcPreserveCode is enabled. It confirms that calling a shrinking GC really frees the additional memory used for JIT code and associated data. I would have expected it to have more or less the same negative effect on performance as the default settings, but actually it looks like the performance is as good as with keeping the JIT code for the whole game. It's a bit dangerous to conclude that from this single measurement though. I've not been especially careful about other programs runnings on the system and it's just one measurement.

I've also worked on fixing some problems with the new tracelogger in v31. I've shown some tracelogging graphs in the past already. It not only gives information about how long JS functions run and how often they are called, but also engine internal information like in which optimization mode they run and how often they had to be recompiled. Among other information you can also see how long garbage collection and minor GCs took.
The tracelogger wasn't really used for such large programs like 0 A.D. so far, so it had a few bugs in that regard (correct flushing of log files to disk for example). Some of them are fixed in even newer versions or scattered across different bugzilla bugs, waiting to be completed, reviewed and committed. Together with h4writer who helped us in the past already and who is the main developer of the tracelogger, I've managed to backport the most important fixes to v31. In addition, we discovered a limitation with large output files and he has made a new reduce script that can be used to reduce the size of the output files.

Now that the new tracelogger is functional, it can be used to further analyze the effect of gcPreserveCode. The following diagrams are made from Tracelogger data.
They show how long different code runs in the different levels of optimization. Code starts running in the interpreter, which is the slowest mode. After a function has run a few times in the interpreter, it starts to get compiled to baseline code (BaselineCompilation) and then runs in Baseline mode. The highest level of optimization is IonMonkey. TL stands for Tracelogger and can be ignored because it's only enabled when the Tracelogger is used. So basically the higher the IonMonkey part, the better. Interpreter should be close to 0%. I've added everything >=1% to the diagram and left the rest out.

V31 (no gcPreserveCode, no shrinking GCs)

post-7202-0-30230400-1415294511_thumb.pn

V31 with gcPreserveCode
post-7202-0-79487100-1415295009_thumb.pn

v31 with gcPreserveCode and shrinking GCs all 500 turns

post-7202-0-51182700-1415295131_thumb.pn

The Tracelogger data matches the results from the performance measurement. Using gcPreserveCode without a shrinking GC obviously gives better results than with shrinking GC from the Tracelogger perspective. The difference is quite small though and probably doesn't justify the increased memory usage. Also the memory graphs indicate that there's probably not a big difference between these two modes (but this would need to be confirmed with additional performance measurements).

The Tracelogger results from the default mode without gcPreserveCode are much worse (as expected). The code runs much less in IonMonkey, more in Baseline and even quite a bit in interpreter mode. The code gets garbage collected and throw away way too often and thus has to be recompiled more often, resulting in more compilation overhead and more time running in less optimized form.

Conclusion

My conclusion based on these results is that using gcPreserveCode together with a regular shrinking GC seems to be a good compromise between memory usage and performance. It might be worth to confirm that the shrinking GCs really don't have a significant (negative) effect on performance with additional measurements.

post-7202-0-89939500-1415290201_thumb.pn

  • Like 4
Link to comment
Share on other sites

Thanks Yves, that's a deep investigation. Probably I miss something, so let me ask: The pie charts express a percentage, is this the time spend running or where the tracer got caught? If it is the latter a difference of ~6% re IonMonkey _might_ still indicate a pretty big overall difference, given the huge optimization potential of native versus baseline. Also, is it a game with bots or only simulation?

Edited by agentx
Link to comment
Share on other sites

Thanks Yves, that's a deep investigation. Probably I miss something, so let me ask: The pie charts express a percentage, is this the time spend running or where the tracer got caught? If it is the latter a difference of ~6% re IonMonkey _might_ still indicate a pretty big overall difference, given the huge optimization potential of native versus baseline. Also, is it a game with bots or only simulation?

The Tracelogger shows the time spent running. The game is a 2vs2 AI game on the map Greek Acropolis (4) this time. The difficulty is really that the difference of running more in IonMonkey might have a bigger effect on other maps or when the game plays out differently. Part of of the decision is still guessing.

Link to comment
Share on other sites

Hi Yves, once again great work on these investigations. Judging from the results of profiling, how much time is actually spent in Javascript, I'd say that this would be extremely worthy to achieve. To be honest I'd even say that 75MB more memory consumption is neglectable, I'm curious how discussions about that will proceed.

Link to comment
Share on other sites

> a massive speedboost

While the blog is quite excellent, the boost is only about generator functions. They are tricky because they maintain state after exiting via yield. And there is no simple way to get this state out of a generator function to e.g. save it somewhere. And if you don't use yield, well, then normal functions do the same thing much faster.

The other thing is while it is elegant and memory preserving to implement let's say a generator which provides a million Fibonacci numbers or primes you need another iterator function to get the iterator protocol running and the whole thing becomes really slow.

So, you gain readable code and small memory footprint at the cost of speed...

Edited by agentx
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

 Share

×
×
  • Create New...