triadacore.blogg.se - Piccollage online

Piccollage online how to#

In other words, before the following set of operations: image_vectors /= np.linalg.norm(image_vectors, axis=-1, keepdims=True) cosine_similarities = text_vector image_vectors Let’s call this vector the textness_bias vector. If we could find this direction, we could use a vector pointing in this direction and add it to all the image vectors (or the text vector) before normalizing them and calculating cosine similarities. We began with a hypothesis: there exists a direction in the shared vector space in which the “textness” property of images varies a lot whereas other (semantic) properties remain invariant. Solutions to over-emphasis on text in images However we can’t control the text typed in by a user so using prompts may not be a good solution here. We found that CLIP tends to give higher scores to textually similar images.Ī similar problem has been discussed online already and the problem was even solved by Yannic Kilcher. When building a search functionality, one might prefer semantic similarity to textual similarity. We can see the similarity between a search term and an image can be “similar” in two ways: i) the image contains text similar to the search term: let’s refer to it as textual similarity ii) the semantic meanings of the image and search term are similar: let’s refer to it as semantic similarity Finally we converted the distilled models to CoreML format to run on iOS and observed negligible difference between the search results of FP16 and FP32 versions. The sizes of the original and the distilled models are 350MB and 48MB (24MB) with FP32 (FP16) precision respectively. We also used model distillation to reduce the size of CLIP model (the ViT model to be specific, not the language model) and got promising results. We came up with two ways to resolve this issue and control the amount of "textness" in search results. Searching for cat using CLIP returns roughly two kinds of results: i) images containing the text cat or something similar ii) images containing the actual cat itself CLIP tends of have a higher score for first type. We also performed model distillation of CLIP and ran the distilled models on iOS devices. We dealt with the issue of over-emphasis on text (described below) using CLIP. Reduce of the size of CLIP by using model distillation.

Piccollage online how to#

How to reduce the emphasis on textual similarity in order to get more relevant search results.This article expands on two experiments with CLIP performed at PicCollage: Given the magnitude of the dataset and compute required, it seemed like a daunting task but we wanted to give it a shot anyway. Given how powerful the model was, we also wanted to reduce its size and explore the possibility of deploying it on edge. However we soon began to notice a quirk of the model: it seemed to prioritize textual similarity to semantic similarity for a search query. It was VERY impressive - better than anything we had earlier. CLIP came in handy and we tested its performance on some of our content. It was trained to learn “visual concepts from natural language supervision” on more than 400 million image-text pairs using an impressive amount of compute (256 GPUs for 2 weeks).Īt PicCollage we have been researching ways to combine text and images.

Distillation of CLIP model and other experiments IntroductionĬLIP is a model released by OpenAI earlier this year.