Is there MT in zero-k?

Page of 3 (57 records)

sort

bhaktivedanta

4 years ago

I only found very old threads about multithreading in the forum. What is the status today? Is there any? If so, do I need to turn it on via some setting?

+1 / -0

DeinFreund

4 years ago
(edited 4 years ago)

Disclaimer: I'm not an engine Dev.

Zero-K doesn't need multi threading. All the performance you need could be gained from more efficient Lua/less Lua.

If you're not talking about performance but GUI there's LoadingMT which makes your load screen responsive at the cost of crashing at times. I'm not sure if it's still supported at all. In general I don't think there's support for asynchronous rendering, i.e. the frame rate is locked to the simulation rate when the game is laggy. On the other hand it can go above the simulation rate when the game is running smoothly, so I'm not really sure what's happening.

+2 / -0

bhaktivedanta

4 years ago

I did not have rendering in mind - CPU appears to be a bottleneck even on a quite performant platform. This cannot be rendering alone, as I mostly play zoomed out with units as Icons (where there is hopefully no rendering of unit models) and I still lag.

I suspect lua as a culprit for the performace issues. I do not know any specifics about the lua engine in zero-k, but in a setting where individual units are controlled by a default AI, I could imagine that there would be a theoretical possibility of performance gains via MT as each unit's AI forms an isolated subtask that will lead to a unit action as output and thus the calculation of that output could in theory be delegated to a worker thread. Note that I write "in theory". If the lua engine is not designed to do that and instead performs other optimisations that collide with parallel design, the whole idea is pretty much dead.

I am well aware that implementing MT in a mature ST program as complex as the Spring engine is most likely harder than a complete rewrite. On the other hand it is extremely frustrating to see an upper-mid-tier hexacore i7 go to its knees with an engine that predates the CPU by 15 years ("only" ten years when we consider ZK as mod).

+1 / -0

esainane

4 years ago

The engine itself is technically multithreaded. You'll see a main thread, an audio thread, OpenAL threads, possibly threads from your audio library, and some number of worker threads, which is controlled by the springsettings.cfg setting "WorkerThreadCount".

In practice, this doesn't matter much, as the bottleneck will always be lua. While it's theoretically possible to change the API exposed to lua so that lua could make use of multiple threads, and there was some effort to do so a long time ago, this would mean impractically large changes to the code of every game that uses spring. There is also a good chance that this would mean lua code would need to be written to avoid logic that could desync, rather than lua being given a safe API that should never cause any desyncs, which historically has not worked out well.

+3 / -0

DeinFreund

4 years ago
(edited 4 years ago)

I feel that it'd make more sense to either improve the lua engine with some kind of jit, or move expensive game logic into C++. The former would not require any work from game devs, whereas game logic in C++ would give us a nicer language to work with. It would probably cost us some devs though due to the increased language complexity.

Multi threaded lua would both make the work of game devs more complex since they now have to think about synchronization and require a good bit of work from engine devs, so it feels like the worst of both.

As an experiment, try playing spring games with minimal Lua. There, performance tends to be through the roof.

+1 / -0

malric

4 years ago

Since I got a new computer (i7-8700, 32GB RAM) I do not remember getting lag in casual team games (except network based one, of course).

As such I am not sure why do you ask the question. Are you interested in performance? Then maybe you should specify the computer you use and the types of games you play. Probably chicken games has different performance requirements than team games.

I have seen occasionally projects which people tried to optimize more or less for the sake of technology without a real impact. (ex: let's use a GPU to do this computations! Ended up paying 2x cost of server for 0.5 computation time, while just buying a similar server would have taken much less time as code could run easily distributed)

+1 / -0

bhaktivedanta

4 years ago

I am using a core i7 8750H with 64GB of ram. It does have some heat problems as it is a notebook. Let me point out however that I use the same machine for algorithm development in computational mathematics and machine learning, where it does not display any performance problems.

In Zero-K I get a red CPU indicator in games with many players after a while. That is mostly but not consistently related to the number of entities on the map, sometimes it does appear quite early though, so I suspect that also the type and behaviour of units present plays a major role.

I had a short glance over the dependencies of Spring on Linux. It does not appear to depend on any lua-related libraries, so I assume that lua-engine is completely homebrewn? If so I believe that there is simply no manpower to optimise lua. Stuff like jit is really hard to do right and performance gain to be expected is very likely not worth the pain.

The difficulty of implementing MT will vary greatly, depending on what kind of code should run parallel.
My example of a parallel unit-AI would be a rather easy case to do in parallel. The reason why it is easy (in comparison) is that the calculation results of - say - a skirmish AI need not be ready and synced with the main program at any fixed point.
The skirm skript would simply read the position of units, calculate a movement vector and issue the move command. The only result relevant to the main engine is the unit's movement command, which can simply be stored in a list for the main loop to process it. If the command list is written to and read in separate chunks this can be done with very few mutexes (read: overhead).

Main problem with parallelizing very many small scripts will be access to unit-states. A naive implementation would require a mutex to be set for each parameter to be read by each instance of the unit-script. Workarounds either come at a memory overhead or would require to use atomic read/write while introducing the concept of volatile data into lua (beating the purpose of a script language).

On the other hand (@esainane) I can see MT lua be done without needing to rework the mod scripts: Anything threading related should be hidden from the Lua interface by definition (you dont want modders to have to worry about MT). Restricting parallel execution to distribution of entire scripts across working threads (as opposed to break a script itself into parallel parts) would make MT a lot easier and eliminate any need to expose thread-related stuff to the modders.

Desyncs can only occur with code that interferes with the calculation of the main simulation. So far, everything I suggested would have no effect on the main simulation: If a lua script calculates behaviour of unit AIs, the main loop only needs to be notified about the resulting unit actions et the end. Those actions are technically equivalent to manual player commands, so no new syncing related issues will be introduced here.

+0 / -0

malric

4 years ago

quote:
I am using a core i7 8750H with 64GB of ram. It does have some heat problems as it is a notebook. Let me point out however that I use the same machine for algorithm development in computational mathematics and machine learning, where it does not display any performance problems.

Mine is a desktop, never tried to track heat. Machine looks powerful, seems strange that you get lag. I mostly run in the last year in Linux, but don't remember lags in Windows either (same desktop).

quote:
So far, everything I suggested would have no effect on the main simulation: If a lua script calculates behaviour of unit AIs, the main loop only needs to be notified about the resulting unit actions et the end

Maybe someone could comment on the status of the profiling, to be able to identify what is slow. If I remember correctly at some point performance was bad due to something not directly related to the code/structure/etc (garbage collection in Lua? Not sure..), which needed a configuration better adapted to the game.

+0 / -0

bhaktivedanta

4 years ago

Also I could actually try playing on Linux. Performance is often better there, except for GPU stuff.
I had tried playing with lua GC setting but results were inconclusive. Also there is not description what setting has which effect on GC. "Fast" and "slow" can mean anything. Is the GC frequency fast/slow meaning more/less runs or what do they mean? I suppose many runs have an overhead per run while few runs may take longer per run and thus block the engine.

Also the mere fact that GC is being used points to how dated the engine is. Overhead for smart pointers is usually miniscule if data structure is designed well. Modern systems usually have several times more ram than necessary while having very limited single thread performance. Combining a single threaded engine with garbage collection shifts the cpu/ram tradeoff even more in a cpu-heavy direction. This is really not a good design decision if Spring is to be around in 10 years.

+0 / -0

malric

4 years ago

I disagree. Garbage collection is an essential mechanism to enable easy change of complex code bases (thinking Python, Lua, etc.). The problem is not how to write good code if you are a good programmer that understands the whole project, but rather how to prevent novice programmers to break the application without a nice and clear error message. Using a scripting language (with garbage collection) helps a lot with that.

I once had a laptop that was not going to full frequency because my fan was full of dirt. What I observed was degrading performance in time without an apparent reason. Once I realized frequency was staying low, opened it and cleaned, next reboot computer was working ~50% faster... That was also because of Spring btw, although probably at that time ZK was not around :-). The idea is: trouble shooting problems can sometimes lead to unexpected conclusions.

+2 / -0

Presstabstart

4 years ago
(edited 4 years ago)

I believe this strategy has been discussed before.

The main problems with multithreaded unit scripts are
1. you would need a lua context for every unit. You may have 64GB of RAM, but most people don't. The Spring engine is already a massive memory hog
2. you would need to communicate with the global game state via message passing, incurring even bigger overhead than Lua->C interop
3. you would need to execute in sync over a network- meaning 'messages' to SyncedRead/SyncedCtrl will need to be in the same order for every client

1 can be somewhat alleviated via some artificial environment and 'pinning' said environment to a specific thread context, but 2 can't. 3 is going to be a big problem.

Besides, the other main performance issue with Zero-K is related to CPU-side rendering. There are a lot of draw calls, and maybe some things could be instanced, but I personally haven't touched the engine in a long time. If Spring Lua-side performance got too slow, then LuaJIT could be considered given it can be made to sync.

GC is not a dated concept. It hasn't been phased out, many languages still rely on GC instead of manual memory management- many with great performance. Regarding mark and sweep GC (what Lua uses), how else will you deal with cyclic tables? Lua is a scripting language, and if I have to begin worrying about cyclic tables, then that scripting language isn't very good. The performance of GC vs reference counting (smart pointers) is debatable. Smart pointers are quite slow.

Fun fact, there exists a garbage collector for C/C++: https://www.hboehm.info/gc/

+1 / -0

bhaktivedanta

4 years ago

quote:
Garbage collection is an essential mechanism to enable easy change of complex code bases. ...
Using a scripting language (with garbage collection) helps a lot with that.

I was not talking about GC versus manual memory management but GC versus all internal pointers are smart. What do I mean by "internal pointers"? I assume we are both talking about pointer-free (scripting) languages, where the programmer has no way of using pointers and references explicitly. Lua for example is entirely reference based and I am very sure that pointers are used "under the hood". Any language that uses GC needs some kind of mechanism to keep track which data has gone out of scope, as a memory location can be reclaimed/freed if and only if there is no longer any reference (or pointer) in any live scope to that data.

So smart pointers (plus a free/reclaim method) can actually be interpreted as a special case of GC: Keeping track is done by a counter and collection can be done immediately. Even more so, delaying the reclaim/free and doing it in a dedicated memory management cycle and possibly in a separate thread is rather trivial. We can get very close to GC behaviour with smart pointers.
This is a classical case of memory/cpu tradeoff, smart pointers being very slightly heavier on the memory side.

So if a reference counting implementation is to be very similar to GC in the end, why not just use GC? There is nothing wrong with GC in general, but when a huge amount of objects (units!) continuously are being created and going out of scope (to be collected) it is very easy to introduce trouble in GC. I am not saying it must always be a problem, but without a design that is carefully tailored to the application parameters performance can be abysmal. This should come as no surprise as memory management has always been a central part of performant programming.
Having a GC sweep through a graph with thousands of nodes each run already does not sound like a brilliant solution. But certainly those node-entries are in an array, eg. a consecutive chunk of memory so that access is hardware-optimised...? Nope, sorry, they are not. They are in fact a linked list...

The whole GC/smart pointers debate has sadly been dominated by "researchers" who had a solution looking for a problem and knew their results a priori. You have "research" papers claiming that their specific pet GC "can be" [insert high number] times as fast as reference-counting based memory management. The problem is that nobody cares about how fast a GC showcase benchmark runs versus the poorest amateur c++ code some research assistant could come up with.
Programmers have been tweaking their memory management for decades for a reason. Now GC comes along as a single answer to all memory questions - always being just a little wrong appears to be mostly good enough. The whole thing reminds me very much of the java-jit advertising in the late 90s. The jit proponents were so vocal that by now any number-crunching apps should be written in a jit compiled language. Can't see that.

Among the many proposed advantages of GC I can only think of two two that are actually plausible: Smart pointers leaking cyclic references and immediate same-thread execution of the destructor. Cyclic references are actually a non-issue, as anyone who creates them should re-take that course on software architecture (actually memory-leaks are a mild and fair punishment for writing cyclic code)! Execution-timing and thread assignment of destructor code however is a major issue that can only be worked around by implementing task-specific memory management routines that might even look GC-ish.
(I still have a good laugh though when thinking of that GC guy who created a stack overflow due to recursive destructor calls in order to demonstrate how much superior GC is... c'mon, seriously?)

+0 / -0

bhaktivedanta

4 years ago

malric: I know that likely there is a heatsink problem with my laptop. I suspect there is a tiny dust layer that prevents effective heat-disposal. However that still does not explain the immense CPU load. even at 50% that machine should be sufficient.

+0 / -0

bhaktivedanta

4 years ago

Interesting observation: When the engine exits, cpu load on core 6 increases for a few sec, but only by a few percent. That core is being used by spring.exe, even when I am in the lobby only. (I confirmed by switching spring's affinity, which shifts the load curve to another core).
Core 6 is reported at 45%-48% whenever the engine OR lobby are running. The game-end peek raises this very close to 50%. As this is a hyperthreading CPU, that number actually means the core is running next to 100% all the time without using hyperthreading (which makes sense as HT requires threads and the system will not balance another application's thread to a core that is already at its limit).

As there is a significant strain on the core, when the game ends this might indicate that the high cpu load is related to memory management, the end-game peek being generated by the "destructors" (or whatever you want to call lua's on-GC routines).

Another interesting observation is that loads on GPU-core and GPU framebuffer are really high at the beginning of each game and then fall off exponentially towards a constant limit. This probably is due to the GPU reusing data later in the game that has been computed during the early phase, which is fine.

+0 / -0

PRO_Dregs

4 years ago

As a small addon to the discussion, I'm using an i9-9900K cpu and a 2080TI gpu - ZK is one of very few games that makes my fan louder than a motherfucker instantly. Unbearably loud. And I've tweaked my fan curves.

ZK mining crypto on my hardware?

+2 / -0

DeinFreund

4 years ago
(edited 4 years ago)

PRO_Dregs compare laggy mid-game big teams with 1v1 before game start. I bet it's much worse before start of 1v1.

[Spoiler]

+6 / -0

PRO_Dregs

4 years ago

Thanks Dein <3

+1 / -0

bhaktivedanta

4 years ago

DeinFreund: I have fps limited globally and the games I played were all 1v1 which does not create a perceivable CPU-lag as opposed to teams. So the reduction in GPU-load cannot be simply due to the GPU overworking at game start. (Jeez, I had not even thought about that! That PC is already heavy on my electricity bill when it is doing useful stuff...)
I would guess that initial high load might be due to shaders not being cashed or something similar, but on the other hand I am using minimal shaders and they might even be cashed s/w.

+0 / -0

Fealthas

4 years ago

https://zero-k.info/Forum/Post/238769#238769

This is a really old issue, like years old. Lobby just idle-ing maxes out a core. Not sure what it is doing.

+1 / -0

bhaktivedanta

4 years ago

My first guess would be some brute-force polling going on. This can be an UI loop missing a wait instruction between the runs processing the (empty) event queue or even a GC running wild (just kidding, as the subject arose above)...

+0 / -0

Page of 3 (57 records)

Forum index > General discussion >