Best of NLP — January 2025

1
Article
Machine Learning Mastery·1y
3 Easy Ways to Fine-Tune Language Models
The post discusses three methods to fine-tune language models: full fine-tuning, parameter-efficient fine-tuning (PEFT), and instruction tuning. Full fine-tuning updates all model parameters, offering state-of-the-art performance but requiring significant computational power. PEFT, including techniques like LoRA, updates only a small portion of parameters, making it resource-efficient. Instruction tuning uses diverse task instructions, enhancing the model's ability to generalize. Code examples and detailed steps are provided for each method.
121
1
2
Video
freeCodeCamp·1y
DeepSeek-R1 Crash Course
Angrew Brown's crash course introduces DeepSeek, a platform for utilizing and running large language models (LLMs) such as DeepSeek R1 and V3 on local hardware. He demonstrates downloading and setting up the models using tools like AMA, Studio LM, and Hugging Face, stressing the importance of having capable hardware such as an Intel lunar Lake AI PC dev kit or a workstation with an RTX 480 GPU. Troubleshooting tips and the potential for running models with distributed computing are also discussed.
69
3
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
A crash course on RAG systems—Part 9
Part 9 of the crash course on RAG systems provides a comprehensive guide to building powerful RAG systems with a focus on vision language models. It includes a detailed breakdown of ColPali, a state-of-the-art RAG system, showcasing its scalability, accuracy, and integration with binary quantization for low latency applications. The series is beginner-friendly and covers everything from fundamentals to advanced optimization and multimodal applications.
56
4
Article
Sebastian Raschka·1y
Implementing A Byte Pair Encoding (BPE) Tokenizer From Scratch
The post provides a comprehensive guide on implementing a Byte Pair Encoding (BPE) tokenizer from scratch for educational purposes. It explains the main idea behind BPE, how to build a vocabulary, and steps for encoding and decoding. Additionally, it includes Python code for the BPE tokenizer, showcasing training, encoding, and decoding processes, and offers insights on saving and loading the tokenizer. The post also demonstrates how to load the original GPT-2 BPE tokenizer from OpenAI.
31
5
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
A crash course on RAG systems—Part 8
Part 8 of the crash course on building RAG systems focuses on improving rerankers with an in-depth architectural breakdown of the ColBERT model, which balances scalability and accuracy for reranking modules. The series covers foundational components, evaluation, optimization, multimodality, and graph-based RAG systems, designed to help beginners implement reliable RAG systems effectively.
31
6
Article
Machine Learning Mastery·1y
RAG Hallucination Detection Techniques
Large language models (LLMs) can provide factually incorrect answers, often termed hallucinations. Retrieval augmented generation (RAG) mitigates this by retrieving data from a knowledge base, but hallucinations can still occur. The post discusses techniques to detect these hallucinations using metrics from the DeepEval library, the G-Eval framework, and RAG-specific metrics like faithfulness. Practical examples include the installation and usage with code snippets that evaluate the outputs for accuracy, consistency, and relevance.
30
7
Video
Explosion·1y
Best Way to OCR a PDF in Python - spaCy Layout
spaCy layout, a new package from Explosion AI, integrates seamlessly with the spaCy pipeline to enable OCR processing of PDFs in a single line of code. It offers features such as bounding box detection, region detection, table detection, and image processing. The package enhances spaCy’s native capabilities like part-of-speech tagging and named entity recognition, making it particularly useful for handling structured and unstructured data within PDFs. Users can convert tables to formats like Markdown or pandas data frames, facilitating easier downstream processing tasks.
29
8
Article
freeCodeCamp·1y
What is Semantic Matching? How to Find Words in a Document Using NLP
Searching for specific words or phrases in a document can be cumbersome if the exact term isn't present. Semantic matching in NLP uses contextual meaning, rather than exact form, to improve search results. By leveraging techniques like word embedding and cosine similarity, it's possible to find similar words or phrases within text documents. Tools like KeyBERT can streamline keyword extraction, further enhancing the search process.
24
9
Article
Towards AI·1y
Fine-tuning Embeddings for RAG applications
Fine-tuning embeddings can significantly improve the accuracy and relevance of Retrieval-Augmented Generation (RAG) applications. This involves pre-training embeddings to align closely with the types of questions users might ask, optimizing for better performance in real-world scenarios. This approach is validated by experimental results showing enhanced retrieval accuracy. Code repositories and methods for fine-tuning, such as TripletMarginLoss and CosineEmbeddingLoss, are provided for further experimentation.
14
10
Article
Medium·1y
Meta Large Concept Models (LCM): End of LLMs?
Meta has introduced Large Concept Models (LCMs), a novel approach to language modeling that operates at the concept level rather than the token level, like traditional Large Language Models (LLMs). LCMs predict entire ideas or sentences, making them more efficient and effective for tasks requiring higher-level reasoning, and they support multiple languages and modalities. This new approach could revolutionize applications such as summarization, story generation, and multimodal reasoning.
13
11
Article
Nick Janetakis·1y
Combine grep and sed to Recursively Replace Text in a Pattern of Files — Nick Janetakis
Learn how to use grep and sed to recursively replace text across multiple files. This guide explains how to filter files with grep or ripgrep, preview changes, and then perform the replacements using xargs and sed. Examples include handling different sed syntax on Linux and macOS, as well as using perl for cross-platform compatibility.
12
12
Article
GoPenAI·1y
From Messy Text to Model-Ready Data: A Guide to NLP Preprocessing
NLP preprocessing transforms raw text into structured data ready for machine learning models. Key steps include text cleaning, tokenization, stopword removal, lemmatization, part-of-speech tagging, named entity recognition, and text vectorization. Effective preprocessing enhances model performance, making it crucial for tasks like sentiment analysis, chatbots, and language translation.
12
1

See all NLP archives