llamaR provides R bindings to llama.cpp for running
Large Language Models locally, with optional Vulkan GPU acceleration via
ggmlR. This vignette
walks through the core workflow: get a model, load it, generate text,
tokenize, and extract embeddings. For the chat/server side see
vignette("chat-and-agents").
llamaR works with GGUF files. Download one from the Hugging Face Hub
(cached under ~/.cache/llamaR/ by default):
# List the GGUF files in a repo
llama_hf_list("TheBloke/Mistral-7B-Instruct-v0.2-GGUF")
# Download one (by filename or by quantization pattern)
path <- llama_hf_download(
"TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
pattern = "Q4_K_M"
)Or point at any GGUF file you already have on disk.
A model holds the weights; a context holds the working state (KV cache) for one generation session. Both are external pointers with GC finalizers, so explicit freeing is optional.
model <- llama_load_model(path, n_gpu_layers = -1L) # -1 = offload all layers
ctx <- llama_new_context(model, n_ctx = 4096L)
llama_model_info(model) # size, n_params, context length, heads, ...n_gpu_layers = -1L offloads every layer to the GPU when
Vulkan is available, and falls back to CPU otherwise.
Sampling is controlled by arguments (set temp = 0 for
greedy decoding):
llama_generate(
ctx, "Write a haiku about autumn.",
max_new_tokens = 64L,
temp = 0.7,
top_p = 0.9,
top_k = 40L,
repeat_penalty = 1.1
)Pass with_timings = TRUE to get token throughput
alongside the text.
Instruction-tuned models expect their prompt wrapped in a chat
template ([INST]…[/INST], <|im_start|>…,
etc.). llama_chat_apply_template() builds that prompt from
a list of role/content messages:
messages <- list(
list(role = "system", content = "You are a helpful assistant."),
list(role = "user", content = "Name three primary colors.")
)
prompt <- llama_chat_apply_template(messages) # uses the model's built-in template
llama_generate(ctx, prompt, max_new_tokens = 64L)For multi-turn chat with history management, use
chat_llamar() instead — see
vignette("chat-and-agents").
When tokenizing a prompt that already contains role markers from a
chat template, set parse_special = TRUE so markers like
[INST] become single control tokens rather than literal
characters:
prompt <- llama_chat_apply_template(list(list(role = "user", content = "hi")))
llama_tokenize(ctx, prompt, parse_special = TRUE)Create the context in embedding mode, then extract vectors. Single text:
emb_model <- llama_load_model("embedding-model.gguf")
emb_ctx <- llama_new_context(emb_model, embedding = TRUE)
v <- llama_embeddings(emb_ctx, "The quick brown fox")
length(v)A batch of texts in one call:
m <- llama_embed_batch(emb_ctx, c("first text", "second text", "third text"))
dim(m) # one row per inputembed_llamar() is a higher-level helper that loads the
model for you and returns a provider suitable for
ragnar_store_create(embed = ...). Called with a model only,
it returns a closure (partial application); called with text, it returns
a matrix.
library(ragnar)
store <- ragnar_store_create(
location = "store.duckdb",
embed = embed_llamar(model = "embedding-model.gguf", n_gpu_layers = -1L)
)
ragnar_store_insert(store, documents)
ragnar_store_build_index(store)
ragnar_retrieve(store, "search query")Combine this with a local chat_llamar() for a fully
local RAG stack — see vignette("chat-and-agents").
To talk to a model over HTTP, or to use it through the ellmer/ragnar
toolchain, see vignette("chat-and-agents"):
llama_serve_openai() — OpenAI-compatible HTTP
server.chat_llamar() — an ellmer::Chat backed by
a local model.vignette("chat-and-agents") — server, ellmer, ragnar,
OpenCode.?llama_generate,
?llama_chat_apply_template, ?embed_llamar