Best of Data Engineering2024

  1. 1
    Article
    Avatar of medium_jsMedium·2y

    How Did LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka?

    LinkedIn uses Apache Kafka to manage and process up to 7 trillion messages daily. They achieve reliability and scalability through a multi-tiered Kafka deployment across multiple data centers, leveraging local and aggregate clusters. LinkedIn ensures message completeness with an internal auditing tool that tracks sent and consumed messages. They maintain a close relationship with the open-source Kafka community by regularly integrating features and patches from their internal branches into the upstream Kafka branch.

  2. 2
    Article
    Avatar of medium_jsMedium·2y

    Roadmap to Learn Data Engineering: How I Would Start Again

    A roadmap for learning data engineering, covering Python, SQL, command line, data warehouse, data modeling, data storage, data processing, data transformation, data orchestration, advanced topics, and staying updated.

  3. 3
    Article
    Avatar of tinybirdTinybird·2y

    How to choose the right type of database

    Understanding the different types of databases, factors to consider when choosing a database, and the implications of the CAP theorem on database selection.

  4. 4
    Article
    Avatar of kdnuggetsKDnuggets·2y

    5 Tips for Improving SQL Query Performance

    Strong SQL skills are crucial in data roles, where optimizing query performance can significantly impact application efficiency. Key tips include avoiding SELECT * by specifying columns, using GROUP BY instead of SELECT DISTINCT, limiting query results, and employing indexes with caution. Balancing these techniques can improve query performance and ensure efficient database operations.

  5. 5
    Article
    Avatar of communityCommunity Picks·2y

    The State of Data Engineering 2024

    The 2024 State of Data Engineering report discusses the influence of GenAI on software infrastructure, the expansion of product offerings due to the economic downturn, and the impact of open table formats and their catalogs in the data lake industry. It also highlights the importance of data version control and observability in AI/ML systems.

  6. 6
    Article
    Avatar of devgeniusDev Genius·1y

    Building a Python-Based Data Lake

    Data lakes are vital for modern data ecosystems, allowing organizations to store and analyze large volumes of varied data without requiring a predefined schema. This guide details setting up a Python-based data lake using MinIO, PyIceberg, PyArrow, and Postgres, ideal for small to medium setups due to its simplicity. The step-by-step instructions cover installation of libraries, configuring SQL catalogs, data transformation using Pandas and PyArrow, and querying data. Advanced operations using DuckDB are also explored, showcasing robust data handling with flexibility and scalability.

  7. 7
    Article
    Avatar of kdnuggetsKDnuggets·2y

    10 Built-In Python Modules Every Data Engineer Should Know

    Python's standard library includes built-in modules that are essential for data engineering tasks. Key modules like os, pathlib, shutil, csv, json, pickle, sqlite3, datetime, re, and subprocess enable efficient file and directory management, data handling and serialization, database interaction, text processing, and more. Utilizing these modules can streamline your data engineering workflows, providing essential functionality without relying on external libraries.

  8. 8
    Article
    Avatar of kdnuggetsKDnuggets·2y

    10 GitHub Repositories to Master Data Engineering

    Learn data engineering through free courses, tutorials, books, tools, guides, roadmaps, practice exercises, projects, and other resources.

  9. 9
    Article
    Avatar of towardsdevTowards Dev·2y

    Building a Serverless Data Pipeline: A Step-by-Step Guide

    The guide provides step-by-step instructions to build a serverless data pipeline using AWS services. Key components include AWS Lambda for data extraction from the Colombo Stock Market Index API, Amazon Kinesis Data Firehose for data ingestion, Amazon S3 for storage, and AWS Glue for ETL orchestration with Athena for querying data. The pipeline uses event-driven architectures with SQS notifications and Glue crawlers for efficient data processing.

  10. 10
    Article
    Avatar of swirlaiSwirlAI·1y

    What is AI Engineering?

    AI Engineering is a rapidly evolving role focused on developing and deploying AI systems that utilize Large Language Models (LLMs) to solve business problems. AI Engineers differ from Software Engineers and Machine Learning Engineers in that they deal extensively with non-deterministic systems and require skills in prompt engineering, infrastructure, and data integration. The field is witnessing the rise of Agentic systems, which are advanced AI systems capable of performing complex tasks with a degree of autonomy. AI Engineering is poised to become one of the most in-demand roles in the tech industry with high salaries and growing opportunities.

  11. 11
    Article
    Avatar of kdnuggetsKDnuggets·2y

    Project Ideas to Master Data Engineering

    To effectively learn data engineering, working on projects is essential. Key skills to focus on include data transformation, data visualization, building data pipelines, and implementing data storage solutions like data lakes and data warehouses. The post suggests six project ideas to cover these aspects: building an end-to-end data pipeline, transforming data sets, implementing a data lake, creating a data warehouse, processing real-time data, and visualizing data with dashboards.

  12. 12
    Article
    Avatar of mlnewsMachine Learning News·1y

    Meet MegaParse: An Open-Source AI Tool for Parsing Various Types of Documents for LLM Ingestion

    MegaParse is an open-source tool designed to efficiently parse various types of documents (PDF, Word, Excel, CSV, etc.) for ingestion into large language models (LLMs). It saves users significant time and effort by automating the conversion process while retaining information integrity. The tool is highly versatile, handling different document elements such as tables and images, and supports customizable output formats. Installation is straightforward via pip, with additional setups for dependencies like Poppler, Tesseract, and libmagic. MegaParse also provides advanced usage options and benchmarking capabilities, making it a reliable choice for developers and enterprises looking to streamline their AI data pipeline.

  13. 13
    Article
    Avatar of bytebytegoByteByteGo·2y

    Trillions of Indexes: How Uber’s LedgerStore Supports Such Massive Scale

    Uber's LedgerStore is a custom-built solution to manage trillions of financial transaction records efficiently. It ensures data immutability and supports various types of indexes including strongly consistent, eventually consistent, and time-range indexes. The migration from DynamoDB to LedgerStore for Uber's payment data was driven by the need for cost savings, simplified architecture, improved performance, and tailored features for financial data management. This transition involved handling 1.2 PB of compressed data with zero data inconsistencies detected over six months.

  14. 14
    Article
    Avatar of kdnuggetsKDnuggets·2y

    7 Python Libraries Every Data Engineer Should Know

    Discover some essential Python libraries for data engineers, including Requests for API data extraction, BeautifulSoup for web scraping, Pandas for data manipulation, SQLAlchemy for database work, Airflow for workflow orchestration, PySpark for big data processing, and Kafka-Python for real-time data processing.

  15. 15
    Article
    Avatar of medium_jsMedium·2y

    The Most Important Soft Skill in Tech

    Interviewing is the most important soft skill in tech. Building rapport, demonstrating value, and projecting professional growth are key objectives in an interview.

  16. 16
    Article
    Avatar of communityCommunity Picks·2y

    Top 5 Data Engineering Projects for Beginners 2024

    Get hands-on experience with five beginner-friendly data engineering projects that enhance skills in data processing, analytics, and visualization using tools like Apache Kafka, Python, and cloud platforms. These projects prepare you for real-world challenges and make your resume stand out.

  17. 17
    Article
    Avatar of mlnewsMachine Learning News·2y

    OmniParse: An AI Platform that Ingests/Parses Any Unstructured Data into Structured, Actionable Data Optimized for GenAI (LLM) Applications

    OmniParse is an AI platform designed to convert various unstructured data types, including documents, images, audio, video, and web content, into structured, actionable data. It supports around 20 different file types and operates entirely locally, ensuring data privacy. OmniParse deploys easily using Docker and Skypilot and works with platforms like Colab. It uses advanced models such as Surya OCR and Whisper, achieving high accuracy and efficiency in data conversion, optimizing it for Generative AI applications.

  18. 18
    Article
    Avatar of kdnuggetsKDnuggets·2y

    5 Free Online Courses to Learn Data Engineering Fundamentals

    Explore five free online courses designed to teach the fundamentals of data engineering. These courses range from beginner-friendly introductions to comprehensive professional certificates. Key areas covered include data pipelines, databases, Python and Pandas, cloud computing, and big data tools like Hadoop and Apache Spark.

  19. 19
    Article
    Avatar of mlnewsMachine Learning News·2y

    Top Data Engineering Courses in 2024

    Data engineering is crucial for organizations relying on data-driven insights. This post lists top courses for mastering data engineering skills such as building scalable data solutions, ETL processes, and leveraging technologies like Apache Spark and cloud platforms. Courses include IBM’s Data Engineering Foundations, Meta Database Engineer Professional Certificate, and Google Cloud Database Engineer Specialization, among others.

  20. 20
    Article
    Avatar of detlifeData Engineer Things·1y

    ETL and ELT

    The author reflects on their journey from chasing the latest data engineering tools to focusing on foundational concepts, emphasizing the shift from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform). The traditional ETL process, necessitated by the high costs and limitations of early data warehouses, is contrasted with the modern ELT approach, facilitated by advancements in cloud data warehousing. ELT offers greater flexibility and efficiency by loading raw data into the warehouse and handling transformations within the warehouse, aligning better with agile development practices.

  21. 21
    Article
    Avatar of detlifeData Engineer Things·1y

    PoC Data Platform project utilizing modern data stack (Airflow, Spark, DBT, Trino, Hive metastore, Lightdash, Delta Lake)

    The PoC Data Platform demonstrates extracting, loading, and transforming data using modern data technologies like Airflow, Spark, DBT, Trino, Hive Metastore, Lightdash, and Delta Lake. It utilizes AdventureWorks data within a data lake environment and offers insights into configuring these tools for data engineering and system design. The platform provides a comprehensive Docker setup with detailed instructions, making it a valuable resource for both beginners and professionals in data systems.

  22. 22
    Article
    Avatar of kdnuggetsKDnuggets·2y

    Landing a Data Engineer Role: Free Courses and Certifications

    Training for a data engineer role doesn't have to be expensive. A curated list of 10 free data engineering courses offers quality education at no cost. Courses cover key areas such as SQL, Python, cloud data engineering, ETL and data pipelines, data warehousing, and Apache Spark. Many courses are provided by edX, and some require prior knowledge in SQL and relational databases. The article encourages that with dedication and persistence, one can achieve their data engineering goals through these free resources.

  23. 23
    Article
    Avatar of taiTowards AI·2y

    SQL Interview Problem — Solution.

    The post provides a step-by-step solution to an SQL interview problem where the task is to determine the second highest employee-manager pair average salary. It details how to observe the expected output, identify conditions like the Employee-Manager pair, use self-join to fetch necessary data, calculate average salaries, and assign rankings to filter for the needed result.

  24. 24
    Article
    Avatar of communityCommunity Picks·2y

    Bridging Backend and Data Engineering: Communicating Through Events

    In modern software development, seamless communication between backend services and data engineering pipelines is crucial. Traditional methods like REST APIs and batch processing often fall short for real-time demands. An event-driven architecture (EDA) offers a solution by using asynchronous event communication, enabling integration of diverse systems. A practical approach is setting up a Pub/Sub system where services broadcast and consume events via standardized formats. This method allows for selective event subscription and facilitates efficient asynchronous communication without overhauling the infrastructure.

  25. 25
    Article
    Avatar of newstackThe New Stack·1y

    5 Python Libraries Every Data Engineer Should Know

    Python is a powerful language for data engineering, enhanced by essential third-party libraries. For beginners, Beautiful Soup 4 and Requests are ideal for web scraping and sending HTTP requests. Intermediate users may benefit from Apache Airflow for workflow automation and Boto3 for integrating AWS services. Advanced users can leverage Pandas for comprehensive data manipulation and analysis.