Best of LLM — February 2026

1
Article
daily.dev Changelog·13w
A live leaderboard for AI coding tools
The Arena is a real-time leaderboard on daily.dev that tracks developer mindshare for AI coding agents and LLMs using a custom metric called the D-Index, which combines mention volume and sentiment. It covers tools like Cursor, Claude Code, Copilot, Codex, Windsurf, and LLMs like Claude, GPT, DeepSeek, and Gemini. Five spotlight crowns highlight category leaders including Developer's Choice, Most Loved, Fastest Rising, Most Discussed, and Most Controversial. Rankings refresh every 60 seconds and include sentiment scores, 24h mention volume, momentum indicators, and 7-day sparklines. A live highlights feed surfaces notable developer posts with sentiment context.
281
48
2
Article
Hackaday·16w
How Vibe Coding Is Killing Open Source
Research suggests LLM-assisted 'vibe coding' may harm open source ecosystems by reducing direct interaction with projects, decreasing website visits and documentation usage, and eliminating organic library selection. The practice replaces developer engagement with chatbot interactions, potentially starving projects of community participation, bug reports, and revenue from sponsorships. Studies show AI coding assistants introduce 41% more bugs and reduce experienced developer productivity by 19%, while degrading cognitive skills. The statistical nature of LLMs means only the most prevalent dependencies in training data get used, similar to how 80% of Spotify artists receive minimal plays and compensation.
273
49
3
Article
Addy Osmani·13w
Stop Using /init for AGENTS.md
Auto-generated AGENTS.md files (produced via /init) hurt AI coding agent performance and inflate costs by 20%+ because they duplicate information agents can already discover by reading the codebase. Two 2026 research papers show LLM-generated context files reduce task success while increasing cost, whereas human-written files help only when they contain non-discoverable information like tooling gotchas, non-obvious conventions, and hidden landmines. The right mental model is to treat AGENTS.md as a minimal, living list of codebase friction points that can't be inferred—not a comprehensive onboarding document. Every discoverable line is noise that competes with the actual task via context dilution. A better architecture involves a routing layer with dynamically loaded, task-specific context rather than a monolithic static file, though tooling support for this is still lacking.
75
5
4
Article
monday Engineering·16w
How We Use AI to Turn Figma Designs into Production Code
Monday.com built an AI-powered system to convert Figma designs into production-ready code that adheres to their design system. Instead of directly generating code, they created a design-system MCP (Model Context Protocol) that exposes structured knowledge about components, tokens, and accessibility rules. An agentic workflow built with LangGraph breaks down design-to-code into 11 focused steps, analyzing layout, resolving tokens, identifying components, and planning implementation. The agent returns structured context rather than code, allowing different teams to generate code in their own style while ensuring design system compliance, accessibility, and proper component usage from the start.
60
5
5
Article
Hackaday·13w
What About The Droid Attack On The Repos?
Open source maintainers are increasingly overwhelmed by AI-generated 'slop' pull requests submitted by autonomous agents. Jeff Geerling and Daniel Stenberg (curl) are among those raising alarms, with GitHub now offering options to disable PRs entirely or restrict them to invited collaborators only. While the root cause is human behavior—someone configured the agent—the flood of low-quality AI submissions is eroding the collaborative openness that made open source strong. Maintainers may be forced to close off their projects, sacrificing the serendipitous contributions from unknown developers that historically helped squash bugs.
56
8
6
Video
Theo - t3․gg·15w
GLM-5 is unbelievable (Opus for 20% the cost??)
GLM-5, a new open-weight AI model from Chinese lab Zhipu AI, delivers performance comparable to Claude Opus 4.5 and Codex 5.2 at roughly 20% of the cost. With 744 billion parameters (40B active via mixture-of-experts), it excels at long-running agentic tasks, successfully completing hour-long code migrations that previously required closed-weight models. The model achieves the lowest hallucination rate on benchmarks to date, costs $3 per million output tokens versus $15-18 for top closed models, and is MIT-licensed without usage restrictions. While lacking multimodal image support, it demonstrates strong capabilities in code refactoring, UI generation, and extended autonomous work sessions.
55
6
7
Video
Theo - t3․gg·13w
Delete your CLAUDE.md (and your AGENT.md too)
A study found that CLAUDE.md and AGENT.md context files used with AI coding agents either marginally improve performance (+4%) when developer-written, or slightly hurt it (-3%) when LLM-generated, while increasing costs by over 20%. The core argument is that modern LLMs are already good at exploring codebases autonomously, so bloated context files distract rather than help. Best practice is to keep these files minimal—only documenting consistent failure patterns the agent exhibits—and to focus instead on improving codebase structure, tests, and tooling. The author also shares unconventional prompting tricks like intentionally misleading agents to steer behavior, and recommends deleting auto-generated init files entirely.
55
6
8
Article
ByteByteGo·16w
How Grab Built a Vision LLM to Scan Images
Grab built a custom 1B-parameter Vision LLM to extract information from Southeast Asian documents for eKYC verification. Starting with Qwen2-VL 2B, they progressed from LoRA fine-tuning to full parameter training, then built a lightweight model from scratch combining Qwen2-VL's vision encoder with Qwen2.5's compact language decoder. The four-stage training process included projector alignment, vision enhancement, language-specific visual training on synthetic OCR data, and task-specific fine-tuning. The final model achieved comparable accuracy to the 2B version while delivering 48-56% faster latency, addressing challenges with non-Latin scripts and diverse document formats across the region.
54
9
Article
Claude·15w
Claude Enterprise, now available self-serve
Claude Enterprise is now available for self-serve purchase with a seat-plus-usage pricing model. The offering provides organization-wide access to Claude, Claude Code, and Cowork with enterprise security features including SSO, SCIM provisioning, audit logs, custom data retention policies, and usage analytics. It integrates with Microsoft 365, Slack, Excel, and PowerPoint through connectors and built-in chat sidebars. Organizations use it across sales, engineering, marketing, product, and finance teams to accelerate workflows and handle complex tasks with large codebases and document sets.
51
1
10
Article
Tailscale·13w
LM Link: Access models on your powerful devices you own, as if they were local
Tailscale and LM Studio have partnered to launch LM Link, a feature that lets users securely access open-weight LLMs running on remote hardware they own as if those models were local. Built on Tailscale's tsnet (a userspace Go library), LM Link creates end-to-end encrypted peer-to-peer connections between devices without exposing anything to the public internet. Setup requires just a few clicks or terminal commands. Use cases include home power users offloading to a beefy desktop GPU, teams sharing large models internally, regulated industries keeping data on-prem, and developers running CI tests against large models privately.
45
3
11
Article
Nx·14w
Why we deleted (most of) our MCP tools
Nx shifted from MCP tools to agent skills for AI assistants after recognizing that modern agentic workflows made many MCP tools redundant. Agents can now execute CLI commands directly and process outputs themselves, making context-dumping MCP tools token-inefficient. Skills provide domain-specific knowledge about when and how to use Nx features, while MCP remains valuable for authenticated APIs and process communication. Benchmarks show skills outperform MCP-only approaches, especially for smaller models, with agents using generators more consistently and validating their work more often.
41
4
12
Article
Snowflake Community·15w
SKILLs MD for Analytics: How We Made Snowflake Intelligence Agents Reliable for Production
PDQ solved AI agent hallucinations in production analytics by encoding Standard Operating Procedures as SKILLs—version-controlled, agent-executable contracts that define inputs, logic, validation, and guardrails. Instead of scaling multiple specialized agents with long prompts, they built one agent with a library of SKILLs deployed via Git-Ops to Snowflake Dynamic Tables and indexed through Cortex Search. This approach eliminated inconsistent answers, improved quality through mandatory validation steps, and made agent reasoning auditable like code. The architecture separates SKILL discovery (lightweight semantic search) from execution (loading complete SOPs) to preserve context windows while ensuring deterministic analytical workflows.
41
1
13
Article
Machine Learning Mastery·15w
The 7 Biggest Misconceptions About AI Agents (and Why They Matter)
AI agents are conditional automation systems, not truly autonomous entities. Common misconceptions lead to production failures: agents require explicit boundaries and guardrails, prototypes differ vastly from production-ready systems, more tools and context often degrade performance, behavior is non-stationary requiring continuous monitoring, most failures stem from system design rather than model limitations, and evaluation must focus on behavioral metrics like tool-selection accuracy rather than text quality. Successful deployments treat agents as engineered systems with constraints, not intelligent entities that self-regulate.
41
5
14
Article
InfoWorld·15w
Is AI killing open source?
AI-generated pull requests are overwhelming open source maintainers with low-quality contributions that take seconds to create but hours to review. Tools like Claude Code can autonomously submit patches, creating an unsustainable asymmetry where maintainers drown in "slop PRs" lacking context and understanding. Small utility libraries are becoming obsolete as developers generate code on-demand instead of using dependencies. This shift is forcing projects toward stricter contribution barriers and smaller, more exclusive communities where human judgment and relationships matter more than volume. The future of open source may belong to projects that are hardest to contribute to, prioritizing care and curation over accessibility.
39
7
15
Article
Ramp Engineering·13w
We fixed ~100 security issues in 6 days with 0 humans
Ramp's security engineering team built a multi-agent pipeline that autonomously found, validated, and patched nearly 100 security vulnerabilities in their backend codebase in under a week, with no human involvement until PR review. The system used specialized detector agents for specific vulnerability classes (e.g., IDOR), adversarial manager agents to filter false positives (rejecting 40% of initial findings), a validator agent that wrote integration tests to confirm real issues, and a fixer agent that applied patches using test-driven development. The approach uncovered novel high-severity issues missed by penetration testing, bug bounties, and 10+ commercial scanning tools. The entire setup required only a four-hour hackathon and one week of work by a single engineer.
36
3
16
Video
Gamefromscratch·14w
The Slop Apocalypse: How AI is Breaking Game Engines
AI-generated code contributions are overwhelming Godot's open-source maintainers with low-quality pull requests, draining their capacity and morale. Meanwhile, Unity's CEO is making bold AI announcements—promising to generate full casual games from natural language prompts—largely to prop up a stock that dropped from $43 to $18 after Google's Genie 3 demo spooked investors. The author argues Unity's announced AI features already exist in the engine, the announcements are stock-market theater, and that Genie 3 is an impressive interactive video system but not a real game engine. The broader point: AI is disrupting game engines from two opposite directions—flooding open-source projects with slop contributions while pushing public companies into AI hype cycles.
36
2
17
Article
LangChain·13w
Agent Observability Powers Agent Evaluation
Agent observability differs fundamentally from traditional software observability because agents are non-deterministic — you can't predict behavior until runtime. This post explains why debugging agents means debugging reasoning rather than code, introduces three core observability primitives (runs, traces, threads), and shows how these primitives map directly to three levels of agent evaluation: single-step (unit tests for decisions), full-turn (end-to-end trajectory), and multi-turn (context persistence across sessions). Production traces serve triple duty: manual debugging, building offline evaluation datasets from real failures, and powering continuous online evaluation. The key insight is that observability and evaluation are inseparable for agents — traces are the only source of truth for what an agent actually did.
34
1
18
Video
bycloud·15w
LLM’s Billion Dollar Problem
Token consumption in LLMs has exploded with thinking models and AI agents, creating scalability challenges. Standard attention mechanisms scale quadratically with context length, making long contexts prohibitively expensive. Three approaches attempt to solve this: sparse attention (restricts which tokens interact), linear attention (accumulates information in shared memory), and compressed attention (compresses tokens before comparison). While sparse and compressed attention help, only linear attention can theoretically scale past 1M context windows. Recent developments show hybrid approaches combining linear attention with standard or compressed attention achieving promising results, with Google's Gemini 3 Flash demonstrating breakthrough performance at 1M context length.
35
19
Article
ploeh blog·16w
Code that fits in a context window
LLMs struggle with large codebases due to context window limitations, similar to how human short-term memory constrains programming. The author suggests that architectural patterns like Fractal Architecture—organizing code into small, nested components at every abstraction level—could help both humans and AI systems manage complexity more effectively. These principles from "Code That Fits in Your Head" may be equally valuable for making code more accessible to LLMs.
34
3
20
Article
Machine Learning Mastery·13w
Introduction to Small Language Models: The Complete Guide for 2026
Small language models (SLMs), typically under 10 billion parameters, are increasingly preferred in production AI systems due to their cost, latency, and privacy advantages over large models. Modern SLMs like Phi-3 Mini, Llama 3.2 3B, and Mistral 7B achieve competitive performance through techniques like knowledge distillation, high-quality training data, quantization, and architectural optimizations. For 80% of predictable, repeated production tasks, SLMs can cut costs by up to 95% and respond in 50–200ms locally. Real-world use cases include customer support, code assistance, document processing, and mobile apps. A hybrid router pattern—SLMs for routine queries, LLMs for complex ones—is emerging as the practical production standard. Getting started requires only Python skills, domain-specific data, and a few hours of GPU time using tools like Ollama and Hugging Face Transformers.
34
2
21
Video
developedbyed·14w
AI Coding Is here to stay
A developer shares personal reflections on how AI coding tools have reshaped software development, including killing the traditional tutorial YouTube niche. Practical tips are offered: reset context windows after 100k-200k tokens, avoid pre-made MCP/agent configs in favor of project-specific rules files, ask follow-up questions to learn from AI implementations, and run two parallel agent instances at most. The author also promotes their own React ASCII animation library (AskGen) built in under two days using Claude Opus and Codex, and teases an AI-first interactive coding learning platform.
32
5
22
Article
Windsurf·16w
Windsurf Tab v2: 25-75% more accepted chars with Variable Aggression
Windsurf Tab v2 introduces a completely rewritten autocomplete model that increases accepted characters by 25-75% through improved context engineering and a new "variable aggression" feature. The team optimized the system prompt (76% reduction in length), refined the data pipeline, and used reinforcement learning to train models that predict more code per suggestion while maintaining acceptance rates. Users can now choose between different aggression levels to match their preferences, from conservative suggestions to bolder multi-line predictions. The update focuses on maximizing total keystrokes saved rather than just optimizing for acceptance rate alone.
32
1
23
Article
Netflix TechBlog·14w
Scaling LLM Post-Training at Netflix
Netflix built an internal post-training framework to scale LLM fine-tuning from experimentation to production. The framework abstracts infrastructure complexity across four dimensions: data (streaming, sequence packing, loss masking), model (sharding, LoRA, architecture support), compute (distributed job orchestration, checkpointing, MFU monitoring), and workflow (supporting both SFT and on-policy RL). Key engineering decisions include staying Hugging Face-compatible for interoperability, maintaining optimized internal model implementations for performance, and evolving from SPMD-only execution to hybrid orchestration for RL workflows. The system enables researchers to focus on modeling rather than distributed systems plumbing.
32
24
Article
sean goedecke·14w
LLM-generated skills work, if you generate them afterwards
LLM-generated "skills" (explanatory prompts for specific tasks) work better when created after solving a problem rather than before. A recent paper found that pre-generated skills provide no benefit because they bake in incorrect assumptions from training data. The effective approach is to have the LLM solve the problem through iteration first, then distill that learned experience into a reusable skill document. This captures knowledge gained from millions of tokens of problem-solving rather than just regurgitating existing training data.
30
3
25
Video
AICodeKing·14w
MiniMax M2.5 (Fully Tested): I've been testing it for the last 4 days and it is AMAZING!!!
MiniMax M2.5 is a 230 billion parameter language model that delivers performance comparable to Claude Opus at 1/30th the cost. The model excels at agentic coding tasks, costs $1 per hour at 100 tokens/second, and successfully completed multiple complex development projects including Expo apps, Go terminal calculators, and full-stack web applications. Testing showed it performs particularly well in code generation workflows with planning mode enabled, completing tasks in minutes with proper error correction.
29
2

See all LLM archives