So, for the past couple of days I've been downloading small models to test them on my small APU-based machine with Llamafile.
Llamafile is a one-click solution for Local Large Language Models (LLMs.) Officially, Llamafile comes in various distributions containing one model each, but you can download the application and models separately. In which case, the model must be passed to Llamafile via the -m MODELNAME
terminal arguments. Llamafile accepts the *.gguf
model format.
For my machine, any model bigger than 4 Billion parameters would be very slow, so 8 Billion was out of question, (even with quantization, though I'd like to test a model in Q2 format,) so I downloaded models of 3B sizes and lower. Of course, I never expected they'd be any good compared to models available online.
In this article, I'll talk about how I organize my workflow. I collected multiple models, (mostly abliterated ones as I'm less likely to run them online,) so I put them in a sub-folder to avoid cluttering, so here's my folder structure:
AI_STUFF_FOLDER
├── bin
│ ├── FluentlyQwen3-1.7B.i1-Q5_K_M.gguf
│ ├── gemma-3-1b-it-abliterated-GRPO.Q6_K.gguf
│ ├── ggml-model-i2_s.gguf
│ ├── huihui-ai.Qwen3-0.6B-abliterated.Q6_K.gguf
│ ├── llamafile-0.9.3
│ └── Phi-3.5-mini-instruct_Uncensored-Q5_K_L.gguf
└── llamafile.sh
I also didn't want to copy/paste the file names every time, so I used Venice Coding Model to create a script that runs Llamafile for me. I created both the .sh
version for Linux, and .bat
version for Windows.
The main way to run Llamafile is to go into Terminal/Powershell and type ./llamafile-0.9.3 -m MODELNAME
(Replace 0.9.3 with your version number.) Since I want to be able to use any version of Llamafile and any model, so the terminal script finds the most recent version of llamafile to run it, and lets me choose between the models to run. Here's the script in its current state:
#!/bin/bash
##############################################
# CONFIGURABLE DEFAULTS - CHANGE THESE VALUES #
##############################################
LLAMAFILE_VERSION="llamafile-0.9.3"
DEFAULT_HOST="127.0.0.1"
DEFAULT_PORT="8080"
DEFAULT_MODEL_DIR="bin"
##############################################
# Check if -m or --model flag exists in arguments
has_model_flag=0
has_server_flag=0
for arg in "$@"; do
if [ "$arg" = "-m" ] || [[ "$arg" == --model* ]]; then
has_model_flag=1
elif [ "$arg" = "--server" ] || [ "$arg" = "-s" ]; then
has_server_flag=1
fi
done
if [ $has_model_flag -eq 1 ]; then
# Direct execution with all arguments
exec "./bin/$LLAMAFILE_VERSION" "$@"
else
# Find all GGUF models in model directory
models=()
while IFS= read -r -d $'\0' model; do
models+=("$model")
done < <(find "$DEFAULT_MODEL_DIR" -maxdepth 1 -type f -name '*.gguf' -print0 2>/dev/null)
# Handle no models found
if [ ${#models[@]} -eq 0 ]; then
echo "ERROR: No .gguf models found in $DEFAULT_MODEL_DIR folder" >&2
exit 1
fi
# Auto-select if only one model exists
if [ ${#models[@]} -eq 1 ]; then
selected_model="${models[0]}"
else
# Prepare model selection menu
basenames=()
for model in "${models[@]}"; do
basenames+=("$(basename "$model")")
done
while true; do
clear
echo "=== AVAILABLE MODELS ==="
for i in "${!basenames[@]}"; do
printf "%3d) %s\n" $((i+1)) "${basenames[i]}"
done
echo -ne "\nSelect model by number or partial name: "
read choice
# Process selection
if [[ "$choice" =~ ^[0-9]+$ ]]; then
idx=$((choice-1))
if (( idx >= 0 && idx < ${#models[@]} )); then
selected_model="${models[idx]}"
break
else
echo -e "\nInvalid number. Please enter a number between 1 and ${#models[@]}."
echo "Press any key to continue..."
read -rsn1
fi
else
# Fuzzy search by name (case-insensitive)
matches=()
choice_lower=$(echo "$choice" | tr '[:upper:]' '[:lower:]')
for i in "${!basenames[@]}"; do
base_lower=$(echo "${basenames[i]}" | tr '[:upper:]' '[:lower:]')
if [[ "$base_lower" == *"$choice_lower"* ]]; then
matches+=("$i")
fi
done
if [ ${#matches[@]} -eq 1 ]; then
selected_model="${models[${matches[0]}]}"
break
elif [ ${#matches[@]} -eq 0 ]; then
echo -e "\nNo matches found for '$choice'."
else
echo -e "\nMultiple matches found (${#matches[@]}):"
for i in "${matches[@]}"; do
printf " %3d) %s\n" $((i+1)) "${basenames[i]}"
done
fi
echo "Press any key to continue..."
read -rsn1
fi
done
fi
# Build final command with server defaults if needed
if [ $has_server_flag -eq 1 ]; then
has_host=0
has_port=0
# Check if host/port already specified in arguments
for ((i=1; i<=$#; i++)); do
arg="${!i}"
if [[ "$arg" == "--host"* || "$arg" == "-h"* ]]; then
has_host=1
elif [[ "$arg" == "--port"* || "$arg" == "-p"* ]]; then
has_port=1
fi
done
# Add default host if not specified
if [ $has_host -eq 0 ]; then
set -- "$@" "--host" "$DEFAULT_HOST"
fi
# Add default port if not specified
if [ $has_port -eq 0 ]; then
set -- "$@" "--port" "$DEFAULT_PORT"
fi
fi
# Execute with all arguments
exec "./bin/$LLAMAFILE_VERSION" -m "$selected_model" "$@"
fi
There are more features I want to add. For example, adding arguments that choose the smallest model or largest one, or even a random one. A feature that passed different values for top_k
and temperature
depending on the model, and a model-specific config file... But all of these ideas may be way too complex, so I'll refrain for now.
What do you think?
In the future, I'd like to test these models and find the best use-cases for them. I'd love to buy a stronger machine that could run models of sizes 25 Billion (or even 70 Billion) but that's currently out of my price range.
I also downloaded a couple MoE (Mixture of Experts) models, but they didn't work for some reason... They may be incompatible with Llamafile, or they may be broken... Both models are Qwen-based and created by the same person, so I'd like to test a MoE model created by a different account in the future.
Posted Using INLEO