Generate Embeddings Online

Context and Problem Statement

In order to perform a question and answering (Q&A) session over research papers with large language model (LLM), we need to process each file: each file should be converted to string, then this string is split into chunks, and for each chunk an embedding vector should be generated.

Where these embeddings should be generated?

Considered Options

Local embedding model with langchain4j
OpenAI embedding API

Decision Drivers

Embedding generation should be fast
Embeddings should have good performance (performance mean they “catch the semantics” good, see also MTEB)
Generating embeddings should be cheap
Embeddings should not be of a big size
Embedding models and library to generate embeddings shouldn’t be big in distribution binary.

Decision Outcome

Chosen option: “OpenAI embedding API”, because the distribution size of JabRef will be nearly unaffected. Also, it’s fast and has a better performance, in comparison to available in langchain4j’s model all-MiniLM-L6-v2.

Pros and Cons of the Options

Local embedding model with `langchain4j`

Good, because works locally, privacy saved, no Internet connection is required
Good, because user doesn’t pay for anything
Neutral, because how fast embedding generation is depends on chosen model. It may be small and fast, or big and time-consuming
Neutral, because local embedding models may have less performance than OpenAI’s (for example). *Actually, most embedding models suitable for use in JabRef are about ~50% performant)
Bad, because embedding generation takes computer resources
Bad, because the only framework to run embedding models in Java is ONNX, and it’s very heavy in distribution binary

OpenAI embedding API

Good, because we delegate the task of generating embeddings to an online service, so the user’s computer is free to do some other job
Good, because OpenAI models have typically have better performance
Good, because JabRef distribution size will practically be unaffected
Bad, because user should agree to send data to a third-party service, Internet connection is required
Bad, because user pay for embedding generation (see also OpenAI embedding models pricing)