Best of All Worlds Architecture (BoAWA)
1. Core
A. Standard RL training. Select an RL algorithm (E.G. PPO, DQN) and explore an environment for maximum reward signal
B. Hyper-parameter Tuning. Apply a coding agent to automate the process of tuning hyper-parameters instead of manual adjustment and seek maximum rewards for best result from step 1.
C. Algorithm Selection. Apply a second coding agent to replace the outer loop out of a pre-defined catalogue of RL implementations and seek maximum rewards from step 1 and 2.
D. Store the optimal RL algorithm with optimal hyper-parameters in cache.
2. Solution Library
A. When a new sandbox is present, an outer modal system analyzes key characteristics or "environment fingerprint" to 1D.
B. A similarity score is applied to the new environment's "fingerprint from a library of "fingerprints"
C. Solutions are "warm started" from running 1A to 1D by similarity score and what optimal RL algorithms with optimal hyper-parameters worked with similar environments.
3. MoE Librarians
A. In powering the Solution Library is a tiered Hybrid library with Mixture of Experts
B. A Quick Reference section of common optimal Cores that is accessed first for "Quick Thinking".
C. A Categorized Shelves hierarchical database containing data from all RL runs, organized by fingerprint characteristics
D. A Deep Archive of sub-optimal logs and intermediate results for "Deep thinking"
E. Hierarchical Mixture of Experts trained on certain "fingerprints" are called upon by a Gating Network if the Hierarchical Database or Deep Archives are required.
4. Emergent Qualia
A. It gets faster, smarter and more adaptable with more environment exposures.
B. It's meta-training, meta-reasoning/meta-cognition in an architecture.
C. Knowing when to "think fast" or "deep think" becomes an emergent qualia
D. Generalization adaptation to tasks as Solutions library grows becomes more likely over time.