Best of NLP — 2025

1
Article
Collections·1y
Train Your Own Large Language Model: A Comprehensive Course
A new comprehensive course by freeCodeCamp teaches learners how to develop their own large language models (LLMs) from scratch. The course covers fundamental concepts, tokenization, Transformer architecture, and fine-tuning techniques like Low-Rank Adaptation (LoRA). Practical applications include working with chat data and developing models for underrepresented languages. Extensive resources provided include slides, notebooks, and code examples.
255
8
2
Article
Hacker News·1y
is-even-ai
Explore the is-even-ai package which utilizes OpenAI's GPT-3.5-turbo model to check if numbers are even or odd. It offers various functions such as checking equality, greater than, and less than comparisons with examples of implementation. Users can adjust the AI model and temperature for more sophisticated uses.
191
58
3
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
12 Powerful Tools For AI Agents
A comprehensive guide listing 12 powerful tools included in the CrewAI framework for building AI agents. The tools range from file reading and writing, code interpreting, and web scraping to advanced functionalities like RAG-powered searches and natural language to SQL conversion. Additionally, the post highlights a full crash course on AI agents, covering everything from fundamentals to production optimization.
162
4
Article
Daily Dose of Data Science | Avi Chawla | Substack·51w
48 Most Popular Open ML Datasets
A comprehensive compilation of 48 widely-used open machine learning datasets organized by domain including computer vision (ImageNet, COCO), natural language processing (SQuAD, GLUE), recommendation systems (MovieLens, new Yambda-5B), tabular data (UCI datasets, Titanic), reinforcement learning (OpenAI Gym), and multimodal learning (LAION-5B, VQA). Each dataset is briefly described with its primary use case and key characteristics, serving as a reference guide for researchers and practitioners selecting appropriate datasets for their ML projects.
140
1
5
Article
ByteByteGo·40w
How LLMs See Images, Audio, and More
Modern AI systems process images, audio, and video by converting them into discrete tokens, similar to text processing. Images use patch embeddings (dividing into grid squares), vector quantization (learning visual codebooks), or contrastive embeddings. Audio employs neural codecs for quality preservation, ASR transcription for semantic content, or hierarchical approaches for multi-scale representation. Each tokenization method involves trade-offs between computational efficiency, information preservation, and semantic understanding, with the optimal choice depending on specific use cases and requirements.
138
6
Article
Machine Learning Mastery·1y
3 Easy Ways to Fine-Tune Language Models
The post discusses three methods to fine-tune language models: full fine-tuning, parameter-efficient fine-tuning (PEFT), and instruction tuning. Full fine-tuning updates all model parameters, offering state-of-the-art performance but requiring significant computational power. PEFT, including techniques like LoRA, updates only a small portion of parameters, making it resource-efficient. Instruction tuning uses diverse task instructions, enhancing the model's ability to generalize. Code examples and detailed steps are provided for each method.
121
1
7
Article
Friedrich WT·33w
AI Engineers then Vs Now
112
15
8
Article
Hugging Face·1y
The NLP Course is becoming the LLM Course!
Hugging Face is upgrading its NLP course by renaming it to the LLM course, reflecting the latest advancements in AI. The revamped course will include new chapters on fine-tuning LLMs and building reasoning models, alongside maintaining and updating existing NLP content. The goal is to make cutting-edge research accessible and community-driven, with interactive exercises and live sessions available where beneficial.
97
9
Video
freeCodeCamp·1y
DeepSeek-R1 Crash Course
Angrew Brown's crash course introduces DeepSeek, a platform for utilizing and running large language models (LLMs) such as DeepSeek R1 and V3 on local hardware. He demonstrates downloading and setting up the models using tools like AMA, Studio LM, and Hugging Face, stressing the importance of having capable hardware such as an Intel lunar Lake AI PC dev kit or a workstation with an RTX 480 GPU. Troubleshooting tips and the potential for running models with distributed computing are also discussed.
69
10
Article
Machine Learning Mastery·1y
10 Useful LangChain Components for Your Next RAG System
LangChain is a robust framework designed to simplify the development of LLM-powered applications, particularly useful for building retrieval augmented generation (RAG) systems. The post outlines 10 key components of LangChain, such as document loaders, text splitters, embeddings, vector stores, retrievers, LLM wrappers, chains, memory usage, interaction tools, and evaluation tools. These components facilitate data ingestion, text processing, similarity-based search, and interaction with external systems. A simplified Python example demonstrates their use in a question-answering workflow.
64
11
Article
Ars Technica·39w
College student’s “time travel” AI experiment accidentally outputs real 1834 history
A computer science student created TimeCapsuleLLM, an AI language model trained exclusively on Victorian-era London texts from 1800-1875. When prompted with "It was the year of our Lord 1834," the model unexpectedly generated text referencing real historical protests and Lord Palmerston's actions from that exact year. The student discovered through fact-checking that these were actual historical events, demonstrating how AI models trained on period texts can inadvertently capture and reproduce authentic historical information. This project is part of a growing field of Historical Large Language Models (HLLMs) that aim to recreate linguistic patterns and knowledge frameworks from past eras.
56
16
12
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
A crash course on RAG systems—Part 9
Part 9 of the crash course on RAG systems provides a comprehensive guide to building powerful RAG systems with a focus on vision language models. It includes a detailed breakdown of ColPali, a state-of-the-art RAG system, showcasing its scalability, accuracy, and integration with binary quantization for low latency applications. The series is beginner-friendly and covers everything from fundamentals to advanced optimization and multimodal applications.
56
13
Article
Sebastian Raschka·1y
The State of LLM Reasoning Models
The post explores recent research advancements in reasoning-optimized large language models (LLMs), focusing on inference-time compute scaling methods. It discusses how various techniques, such as chain-of-thought reasoning and test-time preference optimization, improve the reasoning abilities of LLMs without altering underlying model weights. The article highlights the importance of increasing computational resources during inference to enhance performance, making even smaller models more capable. It also touches on other methods like reinforcement learning and supervised fine-tuning that contribute to improved reasoning in LLMs.
53
14
Video
Theo - t3․gg·1y
I can't believe this is real
The rapid advancements in AI, particularly with OpenAI, have been surprising, with changes in pricing and performance among different models. OpenAI's recent 01 Pro API is much more expensive compared to alternatives, posing challenges for content creators and developers. The post highlights the comparative advantages of the 03 Mini model, emphasizing its cost-effectiveness and performance. Additionally, issues with user experience and pricing structures of various AI models are discussed, leading to frustrations among users. Savala, a sponsor mentioned in the post, offers easy deployment solutions for developers.
48
5
15
Article
ElixirStatus·49w
fuelen/html2text
HTML2Text is a high-performance Elixir library that converts HTML documents to plain text using Rust NIFs. It leverages Rust's html2text crate for fast parsing while maintaining content structure and readability. The library offers a simple API with HTML2Text.convert/2 function that accepts HTML content and line width parameters, supporting features like heading conversion, list formatting, table rendering, and link preservation.
44
16
Article
openSUSE·33w
GSoC 2025, Building a Semantic Search Engine for Any Video
A GSoC 2025 project that built an end-to-end semantic video search engine capable of finding specific moments within videos using natural language queries. The system uses a two-part architecture: an ingestion pipeline that processes videos with AI models (TransNetV2, WhisperX, BLIP, VideoMAE) to extract shots, transcripts, captions, and actions, then segments them intelligently and enriches them with LLM-generated summaries; and a search application with FastAPI backend that performs hybrid text-visual searches using ChromaDB vector database and Reciprocal Rank Fusion for result ranking, paired with a Streamlit frontend for user interaction.
40
1
17
Article
Hacker News·1y
klara-research/klarity: See Through Your Models
Klarity is a tool designed to analyze uncertainty in generative model outputs. It combines raw probability analysis with semantic understanding for deeper insights. Key features include dual entropy analysis, semantic clustering, JSON output for generation patterns, and AI-powered analysis. Compatible with Hugging Face Transformers, supports models like Qwen2.5-7B, and offers comprehensive generative text analysis.
39
18
Article
Daily Dose of Data Science | Avi Chawla | Substack·43w
How Do LLMs Work?
Large Language Models work by predicting the next word in a sequence using conditional probability. They calculate probabilities for each possible next word given the previous context, then select the most likely candidate. To avoid repetitive outputs, LLMs use temperature sampling which adjusts the probability distribution - low temperature produces focused, predictable text while high temperature creates more random, creative outputs. The models learn high-dimensional probability distributions over word sequences, with trained weights serving as the parameters of these distributions.
31
19
Article
Sebastian Raschka·1y
Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch
The post provides a comprehensive guide on implementing a Byte Pair Encoding (BPE) tokenizer from scratch for educational purposes. It explains the main idea behind BPE, how to build a vocabulary, and steps for encoding and decoding. Additionally, it includes Python code for the BPE tokenizer, showcasing training, encoding, and decoding processes, and offers insights on saving and loading the tokenizer. The post also demonstrates how to load the original GPT-2 BPE tokenizer from OpenAI.
31
20
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
A crash course on RAG systems—Part 8
Part 8 of the crash course on building RAG systems focuses on improving rerankers with an in-depth architectural breakdown of the ColBERT model, which balances scalability and accuracy for reranking modules. The series covers foundational components, evaluation, optimization, multimodality, and graph-based RAG systems, designed to help beginners implement reliable RAG systems effectively.
31
21
Article
Machine Learning Mastery·1y
RAG Hallucination Detection Techniques
Large language models (LLMs) can provide factually incorrect answers, often termed hallucinations. Retrieval augmented generation (RAG) mitigates this by retrieving data from a knowledge base, but hallucinations can still occur. The post discusses techniques to detect these hallucinations using metrics from the DeepEval library, the G-Eval framework, and RAG-specific metrics like faithfulness. Practical examples include the installation and usage with code snippets that evaluate the outputs for accuracy, consistency, and relevance.
30
22
Video
Explosion·1y
Best Way to OCR a PDF in Python - spaCy Layout
spaCy layout, a new package from Explosion AI, integrates seamlessly with the spaCy pipeline to enable OCR processing of PDFs in a single line of code. It offers features such as bounding box detection, region detection, table detection, and image processing. The package enhances spaCy’s native capabilities like part-of-speech tagging and named entity recognition, making it particularly useful for handling structured and unstructured data within PDFs. Users can convert tables to formats like Markdown or pandas data frames, facilitating easier downstream processing tasks.
29
23
Article
vLLM·23w
Token-Level Truth: Real-Time Hallucination Detection for Production LLMs
HaluGate is a real-time hallucination detection system for production LLMs that identifies when models generate claims contradicting provided context. It uses a two-stage pipeline: first classifying whether queries need fact-checking (96.4% accuracy, 12ms latency), then performing token-level detection with NLI explanation for factual queries (76-162ms overhead). Built with ModernBERT and native Rust/Candle integration, it runs without Python dependencies, adding negligible latency compared to LLM generation times. The system integrates with vLLM's Signal-Decision Architecture, exposing results via HTTP headers for downstream policy enforcement. Unlike LLM-as-judge approaches, HaluGate provides explainable, consistent verification specifically for extrinsic hallucinations where tool/RAG context exists.
27
1
24
Article
freeCodeCamp·1y
What is Semantic Matching? How to Find Words in a Document Using NLP
Searching for specific words or phrases in a document can be cumbersome if the exact term isn't present. Semantic matching in NLP uses contextual meaning, rather than exact form, to improve search results. By leveraging techniques like word embedding and cosine similarity, it's possible to find similar words or phrases within text documents. Tools like KeyBERT can streamline keyword extraction, further enhancing the search process.
24
25
Article
Product Hunt·46w
Emergent 2.0: World's first agentic vibe-coding platform
Emergent 2.0 introduces an AI-powered platform that generates production-ready applications from natural language descriptions without requiring traditional coding skills. The platform uses what they call 'vibe-coding' to translate user intent into functional software applications.
21
5

See all NLP archives