Abstract: Grounding textual expressions on scene objects from first-person views is a truly demanding capability in developing agents that are aware of their surroundings and behave following ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results