ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

Abstract

Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, millisecond-latency dispatcher faces three intertwined constraints rarely addressed together: logs exceed any LLM's context window by orders of magnitude; most users are long-tail; and surface-fluent profiles do not necessarily improve downstream prediction utility.

We present ProfiLLM, an agentic LLM data pipeline with two modules. (1) Tool-Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform-scale data, producing reusable global knowledge, adaptive clustering rules, and regional supply–demand priors. (2) Utility-Aligned Profile Exploration generates candidate profiles per cluster, evaluates them via a lightweight downstream-utility proxy, iteratively refines the best, and constructs preference pairs for DPO fine-tuning. A strict offline–online contract keeps all LLM reasoning offline; online serving reduces to a cached cluster-embedding lookup with sub-millisecond overhead and zero online LLM inference. Deployed on DiDi's production dispatcher, ProfiLLM achieves up to +6.14% AUC, up to +4.35% simulation GMV, and a 14-day A/B with +0.47% GMV, +0.33% CR, and −0.82% Cancel-Before-Accept.

Highlights

Behavioral signals that structured features miss are decisive for dispatch, yet an LLM cannot run inside a 2-second matching loop. ProfiLLM mines logs with an agentic toolkit, DPO-aligns cluster profiles to a downstream-utility proxy, and keeps all LLM reasoning offline — so the online path is a cached lookup that lifts prediction and matching quality at sub-millisecond cost.

Prediction quality · offline

+6.14%

Outcome-prediction AUC

+4.35%

GMV in dispatching simulation

Live 14-day A/B · DiDi production

Serving cost

Added latency / OD pair

10.6×

Cheaper offline refresh

First deployed LLM profiler for a production ride-hailing dispatcher (2-s cycle, ~200 ms budget).
Agentic knowledge mining — an LLM agent chains 27 tools under Explore–Deepen–Validate–Synthesize.
Utility-aligned profiles — selected and DPO-aligned to a downstream-utility proxy.
Strict offline–online contract — batch-offline reasoning; serving is a cached lookup with zero online LLM inference.

Method Overview

ProfiLLM materializes a strict offline–online decoupling as a three-layer pipeline (see the overview figure above): all LLM reasoning runs in offline batch jobs, and the latency-critical dispatcher consumes only pre-computed artifacts.

1Offline

Tool-Augmented Knowledge Mining

An LLM agent equipped with 27 analytical tools mines platform-scale logs under an Explore–Deepen–Validate–Synthesize paradigm, producing global knowledge \(\mathcal{K}\), an interpretable clustering rule set \(\mathcal{A}\), and regional supply–demand priors \(\mathcal{R}\).

2Offline

Utility-Aligned Profile Exploration

For each cluster, candidate profiles are generated, scored by a lightweight LOGIC-rule utility proxy, iteratively refined on prediction-error feedback, and distilled via DPO into a single-pass generator; each profile is encoded once into a \(d\)-dimensional embedding.

3Online

Outcome Prediction & Matching

Per OD pair, serving performs only a deterministic cluster-rule lookup and a cached embedding fetch, concatenated with structured features for the production multi-task predictor. Zero online LLM inference; under 0.01 ms added per pair.

The only artifacts crossing the offline–online boundary are the rule set \(\mathcal{A}\) and the cluster-embedding table \(\{\mathbf{e}_a\}\) — the structural reason ProfiLLM fits within DiDi's 200 ms dispatch budget without modifying the matching stack.

Key Results

ProfiLLM consistently outperforms traditional and naive-LLM baselines across three cities. Below are the full result tables (transcribed for on-page reading) and the headline figures. Bold burgundy = per-column best; tinted rows are our methods; gray = negative.

Table 1 — Dispatching simulation: relative improvement (%) over pickup-distance KM (Overall)

Static transcription of Table 1 (Overall columns). Morning/Noon/Evening breakdowns are in the full appendix.
Method	City A		City B		City C
Method	GMV	CR	GMV	CR	GMV	CR
TVal	+2.24	+2.14	+1.87	+1.63	+2.56	+2.48
GRC	+0.73	−3.42	+1.15	−2.18	+0.41	−1.87
Llama-3.3-70B	+2.34	+2.76	+1.92	+2.31	+2.68	+3.12
Qwen3-Next-80B	+2.41	+2.54	+2.08	+2.12	+2.75	+2.89
DeepSeek-R1	+2.53	+4.57	+2.17	+3.89	+2.91	+4.93
Kimi-K2	+1.96	+4.77	+1.63	+4.05	+2.24	+5.18
GPT-OSS-120B	+2.44	+5.75	+2.06	+5.12	+2.79	+6.08
Gemini-3-Flash	+1.41	+4.62	+1.08	+3.94	+1.72	+4.95
Gemini-3-Pro	+2.95	+5.48	+2.51	+4.83	+3.28	+5.81
ProfiLLM-DPO	+4.02	+6.03	+3.58	+5.47	+4.35	+6.41
ProfiLLM	+3.52	+7.10	+3.14	+6.52	+3.87	+7.53

Table 2 — Multi-task prediction AUC improvement (%) over Structured-Only

Static transcription of Table 2. The headline **+6.14%** is ProfiLLM on P-Cancel, City A (highlighted). Note that several off-the-shelf backbones *degrade* prediction (e.g., Qwen3-Next-80B −7.57% Success / City B, Kimi-K2 −6.33% P-Cancel / City B): profile fluency is not a reliable proxy for utility.
Method	City A				City B				City C
Method	Acc	D-Can	P-Can	Succ	Acc	D-Can	P-Can	Succ	Acc	D-Can	P-Can	Succ
Llama-3.3-70B	−1.10	−0.71	+0.19	−1.14	−0.64	+0.38	−0.38	−0.45	−0.01	−0.34	+0.25	+0.01
Qwen3-Next-80B	−0.22	−0.38	+1.65	+0.02	−0.52	−0.40	−5.71	−7.57	−0.03	−0.16	+0.27	−0.06
DeepSeek-R1	+0.06	+0.23	+2.05	+0.25	+0.31	+1.85	+1.06	+0.48	+0.21	−0.13	+0.04	+0.14
Kimi-K2	−0.17	+0.82	+2.11	−0.07	−2.44	−0.44	−6.33	−1.91	+0.50	−0.11	+0.40	+0.45
Gemini-3-Flash	+0.10	+0.53	+1.83	+0.42	+0.24	+1.76	−0.11	+0.38	+0.03	−0.26	+0.37	+0.04
Gemini-3-Pro	−0.08	−0.68	+2.37	+0.56	−0.44	+0.50	+0.24	−0.31	+0.02	−0.03	+0.10	+0.05
GPT-OSS-120B	−0.02	+0.14	+1.83	+0.17	+0.11	+1.64	+0.63	+0.29	−0.09	−0.02	+0.44	−0.06
ProfiLLM-DPO	+1.51	+2.76	+6.02	+1.72	+2.25	+4.98	+5.55	+2.58	+0.65	+5.93	+5.30	+2.37
ProfiLLM	+1.56	+3.88	+6.14	+1.80	+2.26	+4.98	+6.00	+2.60	+0.84	+5.95	+5.65	+2.48

14-day online A/B bar chart: GMV and CR rise; CBA, PCR, DCR, BER fall. — 14-day online A/B (City A): every realized rate moves in the desirable direction. [PDF]

Ablation bar chart: removing each component degrades metrics; clustering and global knowledge matter most. — Ablation (City A): Global Knowledge & User Clustering give the largest gains. [PDF]

Cost-quality scatter of nine LLM backbones; ProfiLLM-DPO at the Pareto frontier. — Cost–quality across nine backbones: ProfiLLM-DPO is Pareto-best. [PDF]

Histogram of per-user order counts showing a heavy long tail. — Long-tail: 96% of City A passengers appear in ≤30 orders. [PDF]

Claim → evidence map

Headline claim	Value	Where to verify
Outcome-prediction AUC	+6.14%	Table 2 above (P-Cancel, City A)
Dispatching simulation GMV	+4.35%	Table 1 above (City C, ProfiLLM-DPO)
Online A/B GMV / CBA	+0.47% / −0.82%	Appendix O — Extended 14-day A/B
Added online latency	<0.01 ms/pair	Appendix N — Complexity analysis
Offline refresh cost	10.6× cheaper	Appendix M — Offline system cost
Cluster coverage	96 / 348,464	Appendix M — Offline system cost (≈3,630×)

Code & Reproducibility

A sanitized, symbolic reference implementation of the ProfiLLM pipeline is released at github.com/ProfiLLM/ProfiLLM. It runs end-to-end on synthetic mock data with a local mock LLM (CPU-only, no build pipeline), and mirrors the paper's components: the 27-tool catalog, the Explore–Deepen–Validate–Synthesize mining agent, the profile-exploration loop with the LOGIC-rule utility proxy and DPO preference-pair construction, the prompt templates, and a replay-simulator interface.

Data & IP notice. To comply with enterprise data-governance and privacy policy, the public repository conveys the method and interfaces only. It contains no real ride-hailing data, user/driver/order identifiers, GPS traces, proprietary feature schemas, trained model weights, internal service endpoints, or deployment scripts. The published experimental results were produced on internal infrastructure (1× NVIDIA L20 GPU; Gemini-class analyst LLM) that is not part of this release.