Chinese Researchers Overcome Long Video Processing Challenges with HiCo and VideoChat-Flash Innovations

2025-01-20

Multi-modal large language models have demonstrated one of their most remarkable and advanced capabilities in the field of video processing - long-context video modeling. This capability enables models to handle content that spans several hours, including movies, documentaries, and live streams. Despite significant progress made by large language models (LLMs) in areas like subtitle generation and question answering for video understanding, challenges persist in processing ultra-long videos, with the primary issue being the effective comprehension of complex contexts within these lengthy videos.

Although extensive research has been conducted in this area, from training on large-scale text and frame corpora to building efficient training systems supporting long context parallelism and data packing, these extended multi-modal contexts significantly reduce model training and inference efficiency. Additionally, frame redundancy further complicates the learning process. Video token compression, an intriguing direction in this domain, shows great potential but requires careful consideration regarding detailed representation.

This article presents a recent study on methods for compressing multi-modal long-context modeling. Researchers from the Shenzhen Institute of Advanced Technology introduced a hierarchical video token compression method called HiCo, along with a practical context modeling system named VideoChat-Flash, specifically designed for handling long-context videos. By compressing tokens from segments to video levels, HiCo effectively expands the context window while minimizing computational load and retaining all essential data. VideoChat-Flash employs a multi-stage short-to-long learning scheme and is equipped with a rich dataset of real-world long videos, serving as a training infrastructure for MLLM long-video understanding, supporting high-level sequence parallelism.

HiCo achieves dense token representations and enlarges the context window through hierarchical compression. The researchers segmented long videos into shorter clips for MLLM processing. Compression is based on spatiotemporal redundancy; HiCo combines compressed tokens with user queries, utilizing semantic correlations between segments and real-world embeddings to efficiently reduce the number of tokens.

In the VideoChat-Flash system, researchers adopted a multi-stage short-to-long learning approach, starting with supervised fine-tuning on short videos and associated subtitles and QA, gradually moving towards longer videos, and ultimately training on mixed-length corpora. Short videos significantly enhance basic visual perception and concise expression of long videos. To support this, researchers provided an extensive fine-tuning dataset comprising 300,000 hours of video and 2 billion words of annotations.

The paper also introduces an innovative version of the "needle in a haystack" task (NIAH) for multi-hop video configuration modification. Traditional NIAH tasks primarily assess performance by requiring models to locate specific images, find target words, or answer questions within videos. However, this method often relies on visual distinctions rather than contextual understanding. To address this limitation, researchers proposed a new benchmark - "multi-hop needle in a video haystack," which requires models to locate a series of interrelated indicative images where subsequent images can only be found using clues from the first image.

Experimental results show that this approach achieved up to two orders of magnitude reduction in computation. Particularly on mainstream short and long video benchmarks at scales of 2B and 7B, VideoChat-Flash performed exceptionally well. At the 7B scale, this method surpassed all others, setting new frontiers in short video understanding. In long video understanding, the model also outperformed previous open-source MLLMs across multiple benchmarks, achieving state-of-the-art (SOTA) results. Furthermore, the model demonstrated strong temporal localization capabilities, with zero-shot performance exceeding many renowned MLLMs. In tests involving over 10,000 frames in NIAH, VideoChat-Flash achieved an accuracy rate of 99.1%.

In summary, researchers presented a hierarchical compression technique, HiCo, and an MLLM trained using an innovative multi-stage scheme, VideoChat-Flash. This approach not only advances compression technology to reduce computational demands for long-context videos but also surpasses current SOTA models in terms of precision.