Close
2023.12 - Present
Researcher, Apple
Multi-Modal Generative Modeling
Cupertino, CA
2023.08 - 2023.12
Senior Researcher, Tencent
Graphics, Image-to-3D
Palo Alto, CA
2022.06 - 2023.07
Research Scientist, FAIR Labs
Meta AI
Menlo Park, CA
2021.06 - 2022.06
Applied Scientist II, Amazon
M5 Search Science & AI
Visual Search & AR
Palo Alto, CA
2017.09 - 2021.05
University of Southern California
ECE PhD
2020.05 - 2021.01
Amazon Applied Science Intern
Semi/Self-Supervised Learning
2019.06 - 2019.09
Amazon A9 Internship
Graph-Convolution based Recommendation


Self/Semi-Supervised Metric Learning
NeRF & SDF
Developments in Detection

Personal Introduction

I am a researcher with a strong track record in Computer Vision and Artificial Intelligence. I am particularly interested in multi-modality learning, foundation models, and generative AI. I am into the research for how these models fundamentally work but also hands on production stacks within the leading industry standards to marriage technology with products. My latest research and engineering focus is on multi-modal generative modelling involving generation and understanding of languages, images, and videos.

Academic Service

Associate Editor for APSIPA Transaction on Information and Signal Processing

Reviewers: CVPR, ECCV, ICCV, ICML, NeurIPS, ACL, EMNLP, TOMM, RA-L, ICIP

Recent News

One paper to appear in Neurips 2022!
One paper to appear in ICPR 2022!
One paper to appear in ICMR 2022 as Oral Presentation!
Two of our papers got accepted to CVPR 2022!

Selected Publications




Why do We Need Large Batchsizes in Contrastive Learning? A Gradient-Bias Perspective
Contrastive learning (CL) has been the de facto technique for self-supervised representation learning (SSL), with impressive empirical success such as multi-modal representation learning. However, traditional CL loss only considers negative samples from a minibatch, which could cause biased gradients due to the non-decomposibility of the loss. For the first time, we consider optimizing a more generalized contrastive loss, where each data sample is associated with an infinite number of negative samples. We show that directly using minibatch stochastic optimization could lead to gradient bias. To remedy this, we propose an efficient Bayesian data augmentation technique to augment the contrastive loss into a decomposable one, where standard stochastic optimization can be directly applied without gradient bias. Specifically, our augmented loss defines a joint distribution over the model parameters and the augmented parameters, which can be conveniently optimized by a proposed stochastic expectation-maximization algorithm.
Changyou Chen, Jianyi Zhang, Yi Xu, Liqun Chen, Jiali Duan, Yiran Chen, Son Tran, Belinda Zeng, Trishul Chilimbi
NeurIPS, 2022



Representation Codebook for Multi-Modal Alignment
Aligning signals from different modalities is an important step in vision-language representation learning as it affects the performance of later stages such as cross-modality fusion. Since image and text typically reside in different regions of the feature space, directly aligning them at instance level is challenging especially when features are still evolving during training. In this paper, we propose to align at a higher and more stable level using cluster representation. Specifically, we treat image and text as two ``views'' of the same entity, and encode them into a joint vision-language coding space spanned by a dictionary of cluster centers (codebook). We contrast positive and negative samples via their cluster assignments while simultaneously optimizing the cluster centers. To further smooth out the learning process, we adopt a teacher-student distillation paradigm, where the momentum teacher of one view guides the student learning of the other. We evaluated our approach on common vision language benchmarks and obtain new SoTA on zero-shot cross modality retrieval while being competitive on various other transfer tasks.
Jiali Duan*, Liqun Chen*, Son Tran, Jinyu Yang, Yi Xu, Belinda Zeng, Trishul Chilimbi
CVPR, 2022



Vision-Language Pre-Training with Triple Contrastive Learning
In this paper, we propose triple contrastive learning (TCL) for vision-language pre-training by leveraging both cross-modal and intra-modal self-supervision. Besides CMA, TCL introduces an intra-modal contrastive objective to provide complementary benefits in representation learning. To take advantage of localized and structural information from image and text input, TCL further maximizes the average MI between local regions of image/text and their global summary. To the best of our knowledge, ours is the first work that takes into account local structure information for multi-modality representation learning. Experimental evaluations show that our approach is competitive and achieve the new state of the art on various common down-stream vision-language tasks such as image-text retrieval and visual question answering.
Jinyu Yang, Jiali Duan, Son Tran, Liqun Chen, Yi Xu, Belinda Zeng, Trishul Chilimbi
CVPR, 2022



Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics
Language modality within the vision language pre training framework is innately discretized, endowing each word in the language vocabulary a semantic meaning. In contrast, visual modality is inherently continuous and high-dimensional, which potentially prohibits the alignment as well as fusion between vision and language modalities. We therefore propose to “discretize” the visual representation by joint learning a codebook that imbues each visual token a semantic. We then utilize these discretized visual semantics as self-supervised ground-truths for building our Masked Image Modeling objective, a counterpart of Masked Language Modeling which proves successful for language models. To optimize the codebook, we extend the formulation of VQ- VAE which gives a theoretic guarantee. Experiments validate the effectiveness of our approach across common vision-language benchmarks.
Xiaoyuan Guo*, Jiali Duan*, C.-C. Jay Kuo, Judy Wawira Gichoya, and Imon Banerjee
ICPR, 2022



SLADE: A Self-Training Framework For Distance Metric Learning
Most existing distance metric learning approaches use fully labeled data to learn the sample similarities in an embedding space. We present a self-training framework, SLADE, to improve retrieval performance by leveraging additional unlabeled data. We first train a teacher model on the labeled data and use it to generate pseudo labels for the unlabeled data. We then train a student model on both labels and pseudo labels to generate final feature embeddings. We use self-supervised representation learning to initialize the teacher model. To better deal with noisy pseudo labels generated by the teacher network, we design a new feature basis learning component for the student network, which learns basis functions of feature representations for unlabeled data. The learned basis vectors better measure the pairwise similarity and are used to select high-confident samples for training the student network. We evaluate our method on standard retrieval benchmarks: CUB-200, Cars-196 and In-shop. Experimental results demonstrate that our approach significantly improves the performance over the state-of-the-art methods.
Jiali Duan, Yen-Liang Lin, Son Tran, Larry Davis and C.-C. Jay Kuo
CVPR 2021



PortraitGAN for Flexible Portrait Manipulation
Previous methods have dealt with discrete manipulation of facial attributes such as smile, sad, angry, surprise etc, out of canonical expressions and they are not scalable, operating in single modality. In this paper, we propose a novel framework that supports continuous edits and multi-modality portrait manipulation using adversarial learning. Specifically, we adapt cycle-consistency into the conditional setting by leveraging additional facial landmarks information. This has two effects: first cycle mapping induces bidirectional manipulation and identity preserving; second pairing samples from different modalities can thus be utilized. To ensure high-quality synthesis, we adopt texture-loss that enforces texture consistency and multi-level adversarial supervision that facilitates gradient flow. Quantitative and qualitative experiments show the effectiveness of our framework in performing flexible and multi-modality portrait manipulation with photo-realistic effects.
Jiali Duan, Xiaoyuan Guo, and C.-C. Jay Kuo
APSIPA 2020



Robot Learning via Human Adversarial Games
Much work in robotics has focused on “human-in-the-loop” learning techniques that improve the efficiency of the learning process. However, these algorithms have made the strong assumption of a cooperating human supervisor that assists the robot. In reality, human observers tend to also act in an adversarial manner towards deployed robotic systems. We show that this can in fact improve the robustness of the learned models by proposing a physical framework that leverages perturbations applied by a human adversary, guiding the robot towards more robust models. In a manipulation task, we show that grasping success improves significantly when the robot trains with a human adversary as compared to training in a self-supervised manner.
Jiali Duan, Qian Wang, Lerrel Pinto, C.-C. Jay Kuo, and Stefanos Nikolaidis
IROS 2019 (Best Paper Finalist), USC Media



Interpretable Convolutional Neural Networks via Feedforward Design
The model parameters of convolutional neural networks (CNNs) are determined by backpropagation (BP). In this work, we propose an interpretable feedforward (FF) design without any BP as a reference. The FF design adopts a data-centric approach. It derives network parameters of the current layer based on data statistics from the output of the previous layer in a one-pass manner. To construct convolutional layers, we develop a new signal transform, called the Saab (Subspace approximation with adjusted bias) transform. It is a variant of the principal component analysis (PCA) with an added bias vector to annihilate activation’s nonlinearity. Multiple Saab transforms in cascade yield multiple convolutional layers. As to fully-connected (FC) layers, we construct them using a cascade of multi-stage linear least squared regressors (LSRs). The classification and robustness (against adversarial attacks) performances of BP- and FF-designed CNNs applied to the MNIST and the CIFAR-10 datasets are compared. Finally, we comment on the relationship between BP and FF designs.
C.-C. Jay Kuo, Min Zhang, Siyang Li, Jiali Duan and Yueru Chen
JVCI 2019 (Best Paper Award)



A Unified Framework for Multi-Modal Isolated Gesture Recognition
In this paper, we focus on isolated gesture recognition and explore different modalities by involving RGB stream, depth stream and saliency stream for inspection. Our goal is to push the boundary of this realm even further by proposing a unified framework which exploits the advantages of multi-modality fusion. Specifically, a spatial-temporal network architecture based on consensus-voting has been pro- posed to explicitly model the long term structure of the video sequence and to reduce estimation vari- ance when confronted with comprehensive inter-class variations. In addition, a 3D depth-saliency con- volutional network is aggregated in parallel to capture subtle motion characteristics.
Jiali Duan, Jun Wan, Shuai Zhou, Xiaoyuan Guo, and Stan Z. Li
ACM-TOMM 2017


Multi-Modality Fusion based on Consensus-Voting and 3D Convolution for Isolated Gesture Recognition
We propose a convolutional two-stream consensus voting network (2SCVN) which explicitly models both the short-term and long-term structure of the RGB sequences. To alleviate distractions from background, a 3d depth-saliency ConvNet stream (3DDSN) is aggregated in parallel to identify subtle motion characteristics. These two components in an unified framework significantly improve the recognition accuracy. On the challenging Chalearn IsoGD benchmark, our proposed method outperforms the first place on the leader-board by a large margin (10.29%) while also achieving the best result on RGBD-HuDaAct dataset (96.74%).
Jiali Duan, Shuai Zhou, Jun Wan, Xiaoyuan Guo, and Stan Z. Li
Technical Report, 2016


Face Classification: A Specialized Benchmark Study
We conduct a specialized benchmark study in this paper, which focuses on face classifica tion. We start with face proposals, and build a benchmark dataset with about 3.5 million patches for two-class face/non-face classification. Results with several baseline algorithms show that, without the help of post-processing, the performance of face classification itself is still not very satisfactory, even with a powerful CNN method. We’ll release this benchmark to help assess performance of face classification only, and ease the participation of other related researchers.
Jiali Duan, Shengcai Liao, Shuai Zhou, and Stan Z. Li
CCBR 2016 (Best Student Paper)
Face Detection by Aggregating Visible Components
In this paper, we propose a novel face detection method called Aggregating Visible Components (AVC), which addresses pose variations and occlusions simultaneously in a single framework with low complexi- ty. The main contributions of this paper are: (1) By aggregating visible components which have inherent advantages in occasions of occlusions, the proposed method achieves state-of-the-art performance using only hand-crafted feature; (2) Mapped from meanshape through component- invariant mapping, the proposed component detector is more robust to pose-variations (3) A local to global aggregation strategy that involves region competition helps alleviate false alarms while enhancing localiza- tion accuracy.
Jiali Duan, Shengcai Liao, Xiaoyuan Guo, and Stan Z. Li
ACCV Workshop 2016 (Oral)


Activity

Autumn 5th/November/2014: I was awarded second-place for
English Speaking Competition held by University of Chinese Academy of Sciences
Winter December/2012: Shanghai Final on behalf of ECUST for
21st Century Coca-Cola Cup National English Speaking Competition
2012/2013 I was awarded with Second prize and honorable mention for
National Mathematical Contest in Modeling and MCM/ICM respectively