Best of Data Processing — November 2024

1
Article
Hacker News·2y
IronCalc
Spreadsheets have been vital for decades, yet finding a universally accessible and high-quality engine remains difficult. IronCalc aims to provide an open-source spreadsheet engine to assist SaaS developers, enable automated spreadsheet processing, support global collaboration, and allow bloggers to embed interactive spreadsheets. Beyond code, IronCalc focuses on advancing spreadsheet technology through research, community collaboration, and building a knowledge base for future developers.
54
2
Article
NVIDIA Developer·2y
Mastering LLM Techniques: Data Preprocessing
Large language models (LLMs) significantly enhance efficiency by automating tasks, but their performance heavily depends on high-quality data. Effective data preprocessing—such as text cleaning, deduplication, and quality filtering—is crucial to ensure optimal model accuracy. Techniques like leveraging synthetic data generation and tools like NVIDIA NeMo Curator can help overcome common challenges such as data scarcity, reducing toxics, and managing vast datasets efficiently. NeMo Curator's use of GPU-accelerated libraries enhances the speed and efficiency of data processing workflows.
29
3
Article
InfoSec Write-ups·1y
Python for Security Engineers
Python is an essential skill for cybersecurity professionals due to its simple syntax and versatile use cases. This guide covers basic programming skills, working with APIs, data processing, and creating custom scripts. Practical suggestions include standing up Flask apps and building CLI tools, which are crucial skills for automating processes and solving specific challenges in cybersecurity.
14
4
Article
Data Engineer Things·2y
I spent 4 hours learning Apache Spark Resource Allocation
An overview of Apache Spark's resource allocation mechanisms and scheduling modes. It covers static and dynamic resource allocation, highlighting how dynamic allocation uses heuristics for acquiring and removing executors. It also compares FIFO and fair scheduling, explaining how the latter ensures equal resource sharing among jobs. Additionally, considerations for gracefully decommissioning executors and the usage of an external shuffle service are discussed.
13
5
Article
DuckDB·1y
DuckDB Tricks – Part 3
This blog post delves into various advanced features and performance optimization techniques for DuckDB, particularly focusing on convenient methods for handling table operations and improving the processing speed of Parquet and CSV files. It includes practical examples using the Dutch railway services dataset, demonstrating column renaming with pattern matching, data loading with globbing, reordering Parquet files, and employing Hive partitioning to speed up queries significantly.
12
6
Article
Metadata·2y
DDIA: Chp 10. Batch Processing
Batch processing allows large-scale data transformations, and Google's MapReduce framework simplified parallel processing by abstracting network communication and failure handling. While Hadoop MapReduce leverages HDFS for distributed storage, newer dataflow engines like Spark and Flink address some limitations of MapReduce by offering more flexible operator connections and optimized computational resources.
10

See all Data Processing archives