Skip to content

Embed (package)

A stable, blazing fast and easy-to-use inference library with a focus on a sync-to-async API

Installation

pip install embed

Why embed?

Embed makes it easy to load any embedding, classification and reranking models from Huggingface. It's a wrapper around SyncEngine from infinity_emb, but updated less frequently and disentrangles pypy and docker releases of infinity.

from embed import BatchedInference

# Run any model
register = BatchedInference(model_id=[
  # sentence-embeddings
  "michaelfeil/bge-small-en-v1.5",
  # sentence-embeddings and image-embeddings
  "wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M",
  # classification models
  "philschmid/tiny-bert-sst2-distilled",
  # rerankers
  "mixedbread-ai/mxbai-rerank-xsmall-v1"
],
# engine to `torch` or `optimum`
engine="torch",
# device `cuda` (Nvidia/AMD) or `cpu`
device="cpu"
)

sentences = ["Paris is in France.", "Berlin is in Germany.", "I love SF"]
images = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
question = "Where is Paris?"

register.embed(sentences=sentences, model_id="michaelfeil/bge-small-en-v1.5")
register.rerank(query=question, docs=sentences, model_id="mixedbread-ai/mxbai-rerank-xsmall-v1")
register.classify(model_id="philschmid/tiny-bert-sst2-distilled", sentences=sentences)
register.image_embed(model_id="wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M", images=images)

# Always manually stop the register upon termination to free model memory.
register.stop()

All functions return Futures(vector_embedding, token_usage), enables you to wait for them and removes batching logic from your code.

>>> embedding_fut = register.embed(sentences=sentences, model_id="michaelfeil/bge-small-en-v1.5")
>>> print(embedding_fut)
<Future at 0x7fa0e97e8a60 state=pending>
>>> time.sleep(1) and print(embedding_fut)
<Future at 0x7fa0e97e9c30 state=finished returned tuple>
>>> embedding_fut.result()
([array([-3.35943862e-03, ..., -3.22808176e-02], dtype=float32)], 19)