Needle: Tiny AI Model Runs Fast on Devices
Summary
Cactus Compute has open-sourced Needle, a new 26-million-parameter tool-calling model. This model is built for devices like phones, watches, and glasses. Needle challenges the idea that every AI agent action needs a large model. It focuses on choosing the right tool and filling in its arguments locally, quickly, and cheaply. The company released Needle with weights on Hugging Face and code on GitHub. Needle runs at 6,000 tokens per second for prefill and 1,200 tokens per second for decode on consumer devices. It was trained for single-shot function calling, not general chat. Function calling is when a model returns structured output for an application to use external systems. Cactus Compute states Needle was pretrained on 200 billion tokens and then post-trained on 2 billion synthetic function-calling tokens. This synthetic data covers 15 categories, including timers, messaging, and smart home tasks. What's interesting is Needle's architecture, called a Simple Attention Network, which uses attention and gating without MLP or feed-forward layers. This approach suggests that many AI agent workflows might be using models that are too large for the task. The bottom line: A small, local model like Needle could offer a more cost-effective way for startups to build agentic apps by reducing reliance on expensive cloud inference for routine actions.
This is an AI-generated audio summary. Always check the original source for complete reporting.