Fine tune your own Siri

Lesson 3: When boring becomes essential

and

Jul 24, 2025

🚨 Alert. This article is dangerously practical; no AI buzzword bingo here. Oh, and did I mention? We have CODE. 🧑‍💻

Oh, and one more thing… we’re now on 🔗 LinkedIn! Follow for snippets of knowledge.

Recap

“Build Your Own Siri” is back, this time diving into the most exciting part yet: fine‑tuning the model.

Even though it’s been a while since our last lesson, we’ve got tons to cover. And before we jump in, let’s have a quick recap of what we did last time.

In our last lesson, we used Salesforce’s xlam-function-calling-60k dataset as a few-shot prompt to generate our own function-tooling dataset.

Here’s what we did:

Generated fresh examples based on the original structure: query, available tools, and answers;
Paraphrased instructions to have a more human like response;
Took out all the unnecessary data, keeping only examples that enhanced training quality;
Injected human-like grounding by referencing patterns from real-world assistant datasets;
Formatted everything using ShareGPT-style dialogs.

With a solid and realistic dataset in place, we’re finally ready to fine-tune our model.

Let’s jump into the most exciting part!

This is a 4-part course.

Don’t forget to read the other lessons of this course:

Build your own Siri. Locally. On-Device. No Cloud.

Alexandru Vesa

Apr 23

Read full story

Data preparation for function tooling is boring

Alexandru Vesa

May 3

Read full story

Fine-Tuning Architecture Overview
Model Selection and Fine-Tuning Strategy
Model Quantization and Deployment
Inference Testing with Streamlit

1. Fine-Tuning Architecture Overview

Fine-tuning doesn’t start with code. It starts with intent and the definition of what problem the AI system solves.

Fine-tuning is a tedious process that requires patience (and I know you would like to see results within a few minutes). However, this is not actually possible here, even though I do have an ace up my sleeve: Unsloth.

In order to have a good model that does what it's supposed to do, we need to provide it with good, reliable and consistent information. For this, we will use the dataset generated by us in the last lesson and use it to train our model.

I’ve prepared a clear, step-by-step workflow that outlines exactly what needs to be done and how to approach it.

Even though we are coming back to the dataset, we need to understand that this is the most important part of fine-tuning your model.

Think of it like this: if you have a model that was originally trained to recognize animals, but now you want it to recognize cars, continuing with that same model won’t work effectively — even if you change the dataset.

The knowledge embedded in the model is tailored to animals, not vehicles, and it may struggle to adapt. That’s why choosing the right starting model — one that’s already been trained on data similar to yours — is just as important as preparing the right dataset.

That’s why we need to identify a base model that’s already been trained on a dataset similar to ours — something that aligns with our objectives and is well-suited for edge deployment.

2. Model Selection and Fine-Tuning Strategy

For this research, I identified two models that meet the outlined requirements: LLaMA 3.1 8B Instruct and Gemma 2 2B.

Both models support function-calling tasks and are instruction-tuned, which is essential for building a reliable assistant that can interpret and respond to structured prompts consistently.

They are both small language models (SLMs), making them suitable for deployment on edge devices. An additional selection criterion was compatibility with Unsloth, which allows for efficient fine-tuning on limited hardware.

To determine the best model, we fine-tuned and tested both models on a different dataset that we generated. We counted the number of valid and invalid data generated, and also checked whether the generated sample corresponded to the expected one.

But before we jump into the comparison, let's dive into the fine-tuning strategy.

Fine-Tuning Strategy

We start with a base model (LLaMA 3.1 8B or Gemma 2 2B) and attach LoRA adapters for efficient parameter tuning.

import torch
from unsloth import FastLanguageModel

max_seq_length = 2048     # Unsloth auto supports RoPE Scaling internally!
dtype = None              # None for auto detection
load_in_4bit = True      # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct", #(or "unsloth/gemma-2b-it")
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,   # LoRA rank - suggested values: 8, 16, 32, 64, 128
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,   # Supports any, but = 0 is optimized
    bias="none",      # Supports any, but = "none" is optimized
    use_gradient_checkpointing="unsloth",  # Ideal for long context tuning
    random_state=3407,
    use_rslora=False,   # Disable rank-sensitive LoRA for simpler tasks
    loftq_config=None   # No LoftQ, for standard fine-tuning
)

Afterwards, we adapted the dataset to match the chat template format required by our chosen model, ensuring that the prompts and responses aligned with the model’s expected input structure. This step is crucial for enabling the model to correctly interpret the instruction and produce valid function-call outputs.

from unsloth.chat_templates import get_chat_template

# Initialize the tokenizer with the chat template and mapping
tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3", #(or "gemma")
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
    map_eos_token = True,        # Maps <|im_end|> to <|eot_id|> instead
)

def formatting_prompts_func(examples):
    convos = []

    # Iterate through each item in the batch 
    for query, tools, answers in zip(examples['query'], examples['tools'], examples['answers']):
        tool_user = {
            "content": f"You are a helpful assistant with access to the following tools or function calls. Your task is to produce a sequence of tools or function calls necessary to generate response to the user utterance. Use the following tools or function calls as required:\n{tools}",
            "role": "system"
        }
        ques_user = {
            "content": f"{query}",
            "role": "user"
        }
        assistant = {
            "content": f"{answers}",
            "role": "assistant"
        }
        convos.append([tool_user, ques_user, assistant])

    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return {"text": texts}


# Apply the formatting on dataset
dataset = dataset.map(formatting_prompts_func, batched = True,)

We now begin the fine-tuning process, training our model for three epochs. To manage this, we defined a custom training setup using Hugging Face’s TrainingArguments, which controls the key aspects of the training loop such as learning rate, batch size, precision settings, and logging frequency. Additionally, we’ve integrated Weights & Biases for real-time experiment tracking, enabling us to monitor the training progress and evaluate performance throughout the entire fine-tuning session.

from transformers import TrainingArguments

args = TrainingArguments(
        per_device_train_batch_size = 8,  # Controls the batch size per device
        gradient_accumulation_steps = 2,  # Accumulates gradients to simulate a larger batch
        warmup_steps = 5,
        learning_rate = 2e-4,             # Sets the learning rate for optimization
        num_train_epochs = 3,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        optim = "adamw_8bit",
        weight_decay = 0.01,              # Regularization term for preventing overfitting
        lr_scheduler_type = "linear",     # Chooses a linear learning rate decay
        seed = 3407,
        output_dir = "outputs",
        report_to = "wandb",              # Enables Weights & Biases (W&B) logging
        logging_steps = 1,                # Sets frequency of logging to W&B
        logging_strategy = "steps",       # Logs metrics at each specified step
        save_strategy = "no",
        load_best_model_at_end = True,    # Loads the best model at the end
        save_only_model = False           # Saves entire model, not only weights
    )

And now that we have all the needed elements, we can start fine-tuning our model. This part is pretty straightforward, thanks to Unsloth.

We chose Unsloth because it’s designed specifically for fast and memory-efficient fine-tuning of instruction-tuned models. Compared to traditional Hugging Face approaches, Unsloth offers:

2–5× faster training for shorter sequences
Reduced memory usage with 8-bit optimizers
Built-in LoRA and quantization support
Streamlined setup, working seamlessly with trl.SFTTrainer

This means we can achieve high-quality fine-tuning in significantly less time, making it ideal for rapid development — and of course, perfect for engineers who want to skip the long, tedious “training” phase and get straight to results.

from trl import SFTTrainer

trainer = SFTTrainer(
    model = model,
    processing_class = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,        # Can make training 5x faster for short sequences.
    args = args
)

from unsloth import unsloth_train

trainer_stats = unsloth_train(trainer)
print(trainer_stats)

And now we push our model to Hugging Face to preserve a versioned checkpoint and make it accessible for future use. We also stop the Weights & Biases tracking to wrap up the fine-tuning session cleanly.

wandb.finish()

model.push_to_hub("YourUser/YourModel", token = hf_token)
tokenizer.push_to_hub("YourUser/YourTokenzier", token = hf_token)

Model Selection

Now that both models are fine-tuned, it's time for model selection. To determine which one performs better in function-calling scenarios, we run a structured evaluation against our dataset.

Each model is prompted with queries and a list of available tools, and it must generate function calls strictly in JSON format.

We check performance using a function that validates the following:

The output is written in a correct JSON format
The structure matches our expectations
The function names and arguments match the expected ones

Using this function allows us to determine which model best meets our needs.

import json
from tqdm import tqdm
import re

def extract_json_arrays(text):
    # Find all JSON array-like blocks
    array_pattern = r"\[\s*{.*?}\s*]"  # non-greedy match
    matches = re.findall(array_pattern, text, re.DOTALL)

    # Convert each match into an actual list
    parsed = []
    for match in matches:
        try:
            parsed.extend(json.loads(match))
        except json.JSONDecodeError as e:
            print("Error decoding:", match, e)
    return parsed

dataset = load_dataset("valex95/siri-function-calling", split="train", token=hf_token)

def evaluate_model_on_dataset(dataset, model, tokenizer, max_samples=10):
    total = 0
    correct_calls = 0
    valid_json = 0
    wrong_structure = 0

    results = []

    for example in tqdm(dataset.select(range(min(max_samples, len(dataset))))):
        query = example["query"]
        expected_calls = example["answers"] if isinstance(example["answers"], list) else json.loads(example["answers"])
        tools = example["tools"] if isinstance(example["tools"], list) else json.loads(example["tools"])

        # Prompt model with tools
        chat = [
            {"role": "system", "content": (
                "You are a function-calling assistant that only replies with JSON.\n"
                f"Available functions:\n{json.dumps(tools, indent=2)}"
            )},
            {"role": "user", "content": query}
        ]

        # Generate model response
        inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
        output = model.generate(inputs, max_new_tokens=512)

        try:
            response = tokenizer.batch_decode(output)[0]
            generated_calls = extract_json_arrays(response)
            valid_json += 1
        except Exception:
            generated_calls = []
            wrong_structure += 1

        # Compare only first function call (for now)
        expected = expected_calls[0] if expected_calls else None
        generated = generated_calls[1] if generated_calls else None

        match = (
            [generated.get("name") if generated is not None else 0] == [expected.get("name") if expected is not None else 0]
            and [generated.get("arguments") if generated is not None else 0] == [expected.get("arguments") if expected is not None else 0]
        )

        if match:
            correct_calls += 1

        total += 1
        results.append({
            "query": query,
            "expected": expected,
            "generated": generated,
            "match": match
        })

    # Print summary
    print("\n📊 Evaluation Summary:")
    print(f"Total Samples Evaluated: {total}")
    print(f"✅ Correct Function Calls: {correct_calls} ({correct_calls/total:.2%})")
    print(f"🧠 Valid JSON Outputs: {valid_json} ({valid_json/total:.2%})")
    print(f"❌ Invalid JSON/Structure: {wrong_structure} ({wrong_structure/total:.2%})")

    return results

results = evaluate_model_on_dataset(dataset, model, tokenizer)

For this evaluation, we used a custom dataset consisting of approximately 900 samples.
The results were as follows:

Gemma 2 achieved around 70% JSON validity and 70% correct function calls.
LLaMA 3.1, on the other hand, delivered a perfect score, with 100% valid JSON and 100% accurate function calls.

These results clearly indicate that LLaMA 3.1 is better suited for our specific function-calling use case.

3. Model Quantization and Deployment

Now that we’ve selected the best-performing model for our Siri-like application, the next step is to quantize it.

Quantization significantly reduces the size of the model and its memory footprint while maintaining performance, making it a good fit for an edge deployment.

To achieve this, we use Ollama, a tool that simplifies the packaging and deployment of quantized models. You can install it with the following command:

curl -fsSL https://ollama.com/install.sh | sh

After installing Ollama and quantizing our model, we upload it to the Hugging Face Hub in GGUF format, which is optimized for on-device execution.

The following command uploads multiple quantized versions of the model (q4_k_m, q5_k_m and q8_0), providing a varied selection ranging from 4-bit to 8-bit quantization.

This helps us choose the most suitable model for our needs, as some edge devices may be compatible with an 8-bit model:

model.push_to_hub_gguf(
 repo_id="YourUser/YourOllama",
 tokenizer=tokenizer,
 quantization_method=["q4_k_m", "q5_k_m", "q8_0"],
 token=hf_token )

This step involves pushing our quantized versions of the model onto Hugging Face, which means that we can now use them in the Streamlit application and see how they will do.

4. Inference Testing with Streamlit

Now, let's test our fine-tuned and quantized model using a Streamlit interface to evaluate its behavior in a real-time setting and observe how it handles function-calling prompts from users (STT or just a simple prompt)

The complete code for this Streamlit interface will be available on GitHub for anyone interested in running or extending the project. All the fine-tuned and quantized models used throughout this research can be found on HuggingFace.

Wrapping Up

If you’ve followed this far (drop a 🔥 in the comments, it’s been a journey), you've gone through the complete process of building and fine-tuning a lightweight, function-calling assistant from scratch. By this point, you should have:

A solid understanding and modular workflow for fine-tuning models efficiently with Unsloth;
An evaluation pipeline that validates performance through JSON structure and function call accuracy;
A quantized version of the model, optimized for edge deployment;
A Streamlit interface to easily test and interact with the assistant in real-time.

Coming Up Next: Deployment on the edge

In the following article, we’ll discuss about how to deploy your model on the edge.

We’ll show you how to:

Build and run the application locally
Integrate the model for on-device (edge) deployment
Test your own Siri-like assistant in action

It’ll be hands-on.
It’ll include real code, real configs, and real gotchas.
And this is the moment your assistant becomes what you built it to be—a true Siri-like companion, responding intelligently and effortlessly.

🔗 Check out the code on GitHub and support us with a ⭐️

Hyperplane

Build your own Siri. Locally. On-Device. No Cloud.

Data preparation for function tooling is boring

Discussion about this post

Hyperplane

Fine tune your own Siri

Lesson 3: When boring becomes essential

Recap

This is a 4-part course.

Build your own Siri. Locally. On-Device. No Cloud.

Data preparation for function tooling is boring

Table of contents

1. Fine-Tuning Architecture Overview

2. Model Selection and Fine-Tuning Strategy

Fine-Tuning Strategy

Model Selection

3. Model Quantization and Deployment

4. Inference Testing with Streamlit

Wrapping Up

Coming Up Next: Deployment on the edge

Discussion about this post