Abstract: Tracking multiple objects based on textual queries is a challenging task that requires linking language understanding with object association across frames. Previous works typically train ...