CLIP Provider¶
Cross-modal embedding generation using CLIP models.
What is CLIP?¶
CLIP (Contrastive Language-Image Pre-training) encodes both images and text into the same vector space. This enables:
- Searching images with text queries
- Searching text with image queries
- Finding similar images
- Zero-shot image classification
Requirements¶
# Using virtualenv with uv (recommended)
./scripts/setup_venv.sh
uv pip install transformers torch pillow --python .venv/bin/python
# Or install manually
pip install transformers torch pillow
Configuration¶
%% Using virtualenv (recommended)
{ok, State} = barrel_embed:init(#{
embedder => {clip, #{
venv => "/absolute/path/to/.venv",
model => "openai/clip-vit-base-patch32", % default
timeout => 120000 % default, ms
}}
}).
%% Using system Python
{ok, State} = barrel_embed:init(#{
embedder => {clip, #{
python => "python3", % default
model => "openai/clip-vit-base-patch32", % default
timeout => 120000 % default, ms
}}
}).
Options¶
| Option | Type | Default | Description |
|---|---|---|---|
venv |
string | undefined |
Path to virtualenv (recommended) |
python |
string | "python3" |
Python executable (if no venv) |
model |
string | "openai/clip-vit-base-patch32" |
Model name |
timeout |
integer | 120000 |
Timeout in milliseconds |
Supported Models¶
| Model | Dimensions | Notes |
|---|---|---|
openai/clip-vit-base-patch32 |
512 | Default, fast |
openai/clip-vit-base-patch16 |
512 | Higher quality |
openai/clip-vit-large-patch14 |
768 | Best quality |
laion/CLIP-ViT-B-32-laion2B-s34B-b79K |
512 | LAION trained |
API¶
Text Embedding¶
%% Text embeddings (same space as images)
{ok, TextVec} = barrel_embed:embed(<<"a photo of a cat">>, State).
{ok, TextVecs} = barrel_embed:embed_batch([<<"cat">>, <<"dog">>], State).
Image Embedding¶
Images must be base64-encoded:
%% Read and encode image
{ok, ImageData} = file:read_file("photo.jpg").
ImageBase64 = base64:encode(ImageData).
%% Get embedding
{ok, ImageVec} = barrel_embed_clip:embed_image(ImageBase64, Config).
%% Batch
{ok, ImageVecs} = barrel_embed_clip:embed_image_batch([Img1, Img2], Config).
Example: Image Search with Text¶
%% Initialize
{ok, State} = barrel_embed:init(#{embedder => {clip, #{}}}).
{_, Config} = hd(maps:get(providers, State)).
%% Index images (do once)
Images = [<<"img1.jpg">>, <<"img2.jpg">>, <<"img3.jpg">>],
ImageVecs = lists:map(fun(Path) ->
{ok, Data} = file:read_file(Path),
{ok, Vec} = barrel_embed_clip:embed_image(base64:encode(Data), Config),
{Path, Vec}
end, Images).
%% Search with text query
{ok, QueryVec} = barrel_embed:embed(<<"a sunset over the ocean">>, State).
%% Find most similar images
Scores = [{Path, cosine_similarity(QueryVec, ImgVec)}
|| {Path, ImgVec} <- ImageVecs],
Ranked = lists:reverse(lists:keysort(2, Scores)).
Example: Similar Image Search¶
%% Find images similar to a reference image
{ok, RefData} = file:read_file("reference.jpg").
{ok, RefVec} = barrel_embed_clip:embed_image(base64:encode(RefData), Config).
%% Compare with other images
Scores = [{Path, cosine_similarity(RefVec, ImgVec)}
|| {Path, ImgVec} <- ImageVecs],
Similar = lists:reverse(lists:keysort(2, Scores)).
Example: Zero-Shot Classification¶
%% Classify an image into categories
Categories = [<<"a photo of a cat">>,
<<"a photo of a dog">>,
<<"a photo of a bird">>],
{ok, CategoryVecs} = barrel_embed:embed_batch(Categories, State).
{ok, ImageData} = file:read_file("mystery_animal.jpg"),
{ok, ImageVec} = barrel_embed_clip:embed_image(base64:encode(ImageData), Config).
%% Find best matching category
Scores = lists:zipwith(fun(Cat, CatVec) ->
{Cat, cosine_similarity(ImageVec, CatVec)}
end, Categories, CategoryVecs),
[{BestCategory, _} | _] = lists:reverse(lists:keysort(2, Scores)).
Use Cases¶
- Image search: Find images by description
- Reverse image search: Find similar images
- Content moderation: Classify images automatically
- Multi-modal retrieval: Combined text and image search