VisKnow: Constructing Visual Knowledge Base for Object Understanding

Key Laboratory of AI Safety of CAS, Institute of Computing Technology,
Chinese Academy of Sciences(CAS), Beijing, China

This project page is being actively updated.

PDF Snapshot

Multi-modal AnimalKB constructed by the proposed VisKnow framework can be applied in various aspects, including enhancing knowledge-related visual tasks, providing annotations required for constructing benchmarks, and assisting downstream scenarios.

Abstract

Understanding objects is fundamental to computer vision. Beyond object recognition that provides only a category label as typical output, in-depth object understanding represents a comprehensive perception of an object category, involving its components, appearance characteristics, inter-category relationships, contextual background knowledge, etc. Developing such capability requires sufficient multi-modal data, including visual annotations such as parts, attributes, and co-occurrences for specific tasks, as well as textual knowledge to support high-level tasks like reasoning and question answering. However, these data are generally task-oriented and not systematically organized enough to achieve the expected understanding of object categories. In response, we propose the Visual Knowledge Base that structures multi-modal object knowledge as graphs, and present a construction framework named VisKnow that extracts multi-modal, object-level knowledge for object understanding. This framework integrates enriched aligned text and image-source knowledge with region annotations at both object and part levels through a combination of expert design and large-scale model application. As a specific case study, we construct AnimalKB, a structured animal knowledge base covering 406 animal categories, which contains 22K textual knowledge triplets extracted from encyclopedic documents, 420K images, and corresponding region annotations. A series of experiments showcase how AnimalKB enhances object-level visual tasks such as zero-shot recognition and fine-grained VQA, and serves as challenging benchmarks for knowledge graph completion and part segmentation. Our findings highlight the potential of automatically constructing visual knowledge bases to advance visual understanding and its practical applications.

Automatic Construction Pipeline

Construction Pipeline Diagram
Overview of the VisKnow construction pipeline. The process integrates multi-modal knowledge from both textual and visual sources to build a comprehensive visual knowledge base.

AnimalKB Knowledge Graph

AnimalKB Visual Annotation

Data Preview
Examples of the animals and parts in VisKnow.

KB-assisted Downstream Tasks

Table 2. Zero-shot performance of CLIP with KB

Setting ViT-B/16 ViT-L/14
CLIP Baseline 67.58 74.26
CLIP Subclass 69.38 76.13
CLIP Subclass + Visual Knowledge 69.92 77.17
CLIP Subclass + Non-visual Knowledge 70.33 77.32
CLIP Subclass + All Knowledge 70.49 77.55

Table 3. Performance of closed-source MLLMs with KB

Model Visual Non-visual All
without KB
gpt-4o (1120) 78.95 76.81 77.88
gpt-4o (0806) 79.36 77.13 78.25
gpt-4o-mini 69.78 66.86 68.32
gemini-2.0-flash 84.95 81.96 83.46
gemini-1.5-pro 78.68 78.68 79.71
claude-3.5-sonnet 81.05 77.82 79.44
claude-3-haiku 63.82 59.14 61.48
with KB
gpt-4o (1120) 98.26 97.87 98.06
gpt-4o (0806) 98.68 97.79 98.24
gpt-4o-mini 94.04 95.07 94.56
gemini-2.0-flash 98.16 97.65 97.90
gemini-1.5-pro 97.01 96.54 96.78
claude-3.5-sonnet 97.43 96.72 97.07
claude-3-haiku 92.06 92.21 92.13

Table 4. Performance of open-source MLLMs with KB

Model Size Visual Non-visual All
without KB
DeepSeek-VL2 27.5 B 73.75 75.69 74.72
DeepSeek-VL2-Small 16.1 B 66.13 68.70 67.41
Qwen2.5-VL-7B 8.3 B 74.44 71.25 72.84
Qwen2-VL-7B 8.3 B 74.56 72.84 73.70
InternVL2.5-8B 8.1 B 71.27 69.09 70.18
InternVL2.5-26B 25.5 B 76.52 72.55 74.53
LLaVA-v1.6-Vicuna-7B 7.1 B 65.76 60.91 63.33
LLaVA-Next-Llama3-8B 8.4 B 65.47 63.75 64.61
LLaVA-OV-Qwen2-7B 8.0 B 71.94 65.27 68.60
with KB
DeepSeek-VL2 27.5 B 84.66 85.10 84.88
DeepSeek-VL2-Small 16.1 B 85.39 84.73 85.06
Qwen2.5-VL-7B 8.3 B 87.06 85.42 86.24
Qwen2-VL-7B 8.3 B 87.67 86.91 87.29
InternVL2.5-8B 8.1 B 87.23 86.47 86.85
InternVL2.5-26B 25.5 B 87.99 88.04 88.01
LLaVA-v1.6-Vicuna-7B 7.1 B 80.71 80.86 80.78
LLaVA-Next-Llama3-8B 8.4 B 83.41 84.19 83.80
LLaVA-OV-Qwen2-7B 8.0 B 83.28 81.81 82.55

KB as Benchmark

Table 5. Performance of KGC models on link prediction task

Model MRR HITS@1 HITS@3 HITS@10
Embedding-based methods
TransE 18.1 11.5 22.0 29.8
ComplEx 12.0 10.7 12.2 14.7
DistMult 11.0 9.2 11.3 14.1
RotatE 19.6 15.8 21.4 26.5
Text-based methods
KG-Bert 21.1 12.3 23.9 38.2
StAR 30.9 23.6 33.7 44.6
SimKGC 38.6 32.4 40.7 50.2
LLM-based methods
gpt-4o (0806) 42.9 34.3 46.0 60.8
gemini-1.5-pro 44.2 28.8 52.3 81.1

Table 6. Performance comparison of models on instance segmentation task

Model Finetune Total AP AP50 AP75 head torso leg
vlpart 5.09 8.86 5.23 18.12 4.33 4.79
✔️ 16.07 27.38 16.70 27.39 24.32 5.84
partglee 14.50 25.96 14.28 43.16 17.91 19.33
✔️ 31.98 50.63 32.78 55.21 41.36 31.01

Table 7. Performance comparison of models on mIoU, fwIoU, IoU-part metrics

Model Finetune mIoU fwIoU head torso leg
clipseg 7.47 16.16 35.38 0.26 40.72
✔️ 14.40 37.61 55.87 52.74 42.41
partglee 24.24 37.97 55.84 28.78 56.44
✔️ 36.87 44.96 48.62 51.91 62.29
vlpartsem 7.43 9.19 15.77 4.35 7.50
✔️ 25.77 42.49 45.98 58.46 53.53

BibTeX


@article{yao2025visknow,
  title = {VisKnow: Constructing Visual Knowledge Base for Object Understanding},
  author = {Yao, Ziwei and Wan, Qiyang and Wang, Ruiping and Chen, Xilin},
  journal = {arXiv preprint arXiv:2512.08221},
  year = {2025}
}