Motivated by the tradeoff between exploitation and exploration in reinforcement learning, we study a continuous-time entropy-regularized mean variance portfolio selection problem in the presence of jumps. A first key step is to derive a suitable formulation of the continuous-time problem. In the existing literature for the diffusion case (e.g., Wang, Zariphopoulou and Zhou, Mach. Learn. Res. 2020), the conditional mean and the conditional covariance of the controlled dynamics are heuristically derived by a law of large numbers argument. In order to capture the influence of jumps, we first explicitly model distributional controls on discrete-time partitions and identify a family of discrete-time integrators which incorporate the additional exploration noise. Refining the time grid, we prove convergence in distribution of the discrete-time integrators to a multi-dimensional Levy process. This limit theorem gives rise to a natural continuous-time formulation of the exploratory control problem with entropy regularization. We solve this problem by adapting the classical Hamilton-Jacobi-Bellman approach. It turns out that the optimal feedback control distribution is Gaussian and that the optimal portfolio wealth process follows a linear stochastic differential equation, whose coefficients can be explicitly expressed in terms of the solution of a nonlinear partial integro-differential equation. We also provide a detailed comparison to the results derived by Wang and Zhou (Math. Finance, 2020) for the exploratory portfolio selection problem in the Black-Scholes model. The talk is based on joint work with Thuan Nguyen (Saarbrücken).