The stack is indeed a much better way to allocate memory in terms of cache usage and runtime overhead - memory access is minimized, fragmentation is impossible and there's no runtime mechanisms to keep track of what's allocated beyond a couple of registers that are heavily optimized within the processor. However, when you can't use the stack for whatever reason (dynamic size, data's scope doesn't follow the stack's behavior nicely, small stack, etc.) you need some kind of dynamic allocation.
Using malloc/free for each object is the least efficient, because each object can be a different size, each uses a few bytes to keep track of size/usage and eventually you end up with a fragmented heap without a lot of contiguous free memory. Allocating larger chunks for memory pools and free lists can help alleviate this problem and is often a good idea. However, you can still get bad cache usage with objects from different pools or far apart in their pools.
You can manually put objects together in memory by carefully tuning allocation, but (compacting/generational) garbage collection is sometimes a better way to do this because it automatically puts things allocated close together in the same chunk of memory. It also has the ability to rearrange the heap, so that even when the original assumption that objects allocated together are used together is wrong, objects that stay in use and do get accessed together get placed close together after a collection or two.
The tradeoffs between GC and manual memory management should always be taken into consideration - besides the end result of performance, which is complicated enough, there's also the cost in programmer time. In the real world it's often worth it to have somewhat sub-optimal performance rather than pay programmers more than that extra performance would earn you with extra sales. The same applies to scripting languages - people are willing to trade runtime performance for ease of use because overall the task gets done much more quickly.
---
LLVM supports more than a shadow stack - that's just the simplest way to do it because it's already 90% set up. Their
garbage collector documentation explains what's possible beyond that. The Boehm collector has to work around a lot of limitations in C++'s semantics, but some of these could be removed by compiler support. While it's impossible to find all the pointers in a C++ program because of its weak typing, it is possible to be completely sure about things that stay within the type system.
Finally, JIT has nothing to do with bounds checking. The issue here is that people confuse JIT with all the type checking that most JIT languages include - you could easily write a JIT for C that could improve performance over statically compiled programs. Also, startup time is only a problem for some types of applications. Servers that are designed to stay running forever may start out slower but will just keep getting better and better as they run. Better system designs can help avoid repeating startup delays for normal desktop applications as well. By itself, JIT really can improve performance, not just "keep up."