F3Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos

Analyzing Fast, Frequent, and Fine-grained (F³) events presents a significant challenge in video analytics and multi-modal LLMs. Current methods struggle to identify events that satisfy all the F³ criteria with high accuracy due to challenges such as motion blur and subtle visual discrepancies. To advance research in video understanding, we introduce F³Set, a benchmark that consists of video datasets for precise F³ event detection. Datasets in F³Set are characterized by their extensive scale and comprehensive detail, usually encompassing over 1,000 event types with precise timestamps and supporting multi-level granularity. Currently F³Set contains several sports datasets, and this framework may be extended to other applications as well. We evaluated popular temporal action understanding methods on F³Set, revealing substantial challenges for existing techniques. Additionally, we propose a new method, F³ED, for F³ event detections, achieving superior performance.

F³Set: Towards Analyzing Fast, Frequent, and Fine-grained Events from Videos

Example of detecting fast, frequent, and fine-grained events with precise moments.

Abstract

Video Presentation

Poster

BibTeX