Training
Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch
A new reinforcement learning system has been deployed at DoorDash for adapting dispatch objective weights in a three-sided marketplace, utilizing delayed operational feedback like delivery speed and courier utilization. The system employs a store-level policy that selects a discrete multiplier to optimize the tradeoff between delivery quality and batching efficiency, trained using centralized offline data with Double Q-learning and a conservative regularizer to mitigate value overestimation. This approach demonstrates the potential for safely adapting decision policies in real-time using feedback from complex economic and logistics environments, enhancing batching and reducing costs without compromising delivery quality.
reinforcement-learningmarketplacefeedback