Best of Big DataJanuary 2025

  1. 1
    Article
    Avatar of devtoDEV·1y

    How Programming Will Look In the Future?

    Programming has largely stuck to the von Neumann paradigm since the 1940s, but modern hardware with multiple cores faces challenges with this model. Traditional concurrent programming solutions like Go's goroutines introduce complexities. Data flow programming offers an alternative by treating programs as networks of independent nodes that pass data, avoiding race conditions and allowing natural parallelism. Nevalang is a new language built around this paradigm, offering a promising future for programming. However, it is still in development and looking for contributors.

  2. 2
    Article
    Avatar of bigdataboutiqueBigData Boutique blog·1y

    Elasticsearch vs OpenSearch - 2025 update

    An in-depth 2025 update comparing Elasticsearch and OpenSearch, touching on project status, performance, licensing, vector search capabilities, cost efficiency, and ecosystem solutions. OpenSearch has gained traction with open-source governance and additional vector search engines, while Elasticsearch maintains proprietary features and extensive integration solutions.

  3. 3
    Article
    Avatar of detlifeData Engineer Things·1y

    End to End Data Engineering

    This post details the tools, technologies, and concepts essential for data engineering, emphasizing different paths for success based on roles and backgrounds. It highlights the importance of both analytics and infrastructure sides and mentions popular tools like Airflow and Snowflake. The significance of software engineering principles and specific data engineering roles is also discussed.

  4. 4
    Article
    Avatar of detlifeData Engineer Things·1y

    I spent 6 hours learning AWS Glue. Here is what I found

    AWS Glue is a serverless data integration service that simplifies and automates the ETL process, enabling users to integrate data from various sources, preprocess and transform it, and make it available for analytics. It seamlessly integrates with AWS services like S3, Redshift, and Athena and supports cost-effective and scalable data processing. Key components include Glue Studio, Glue ETL Library with DynamicFrames, and serverless execution with auto-scaling. The Glue Data Catalog acts as a central repository for metadata, facilitating efficient data discovery and management.

  5. 5
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·1y

    FireDucks vs. Pandas vs. DuckDB vs. Polars

    FireDucks is an optimized alternative to Pandas with the same API, requiring just an import replacement to use. It demonstrates a significant speed boost for big data operations, achieving an average speed-up of 125x over Pandas. FireDucks' lazy execution builds and optimizes a logical execution plan, unlike Pandas' immediate execution. It can be used with IPython, Jupyter Notebooks, or within existing Pandas pipelines by replacing import statements. Detailed benchmarks and usage examples are provided, showing substantial performance improvements in practical scenarios.

  6. 6
    Article
    Avatar of decuberssDecube·1y

    Introducing Decube's Public API

    Decube has released its Public API to streamline data governance workflows. The API facilitates bulk management of glossaries, manual lineages, and user groups, enhancing efficiency and scalability. It also ensures full accountability through secure audit logging. Upcoming features include data quality scores and monitor configuration, furthering Decube's mission to empower data teams.

  7. 7
    Article
    Avatar of detlifeData Engineer Things·1y

    Why I Love Python as Data Engineer

    Python is favored by data engineers for its versatility, simplicity, and rich library ecosystem. It excels in both small and large-scale data tasks, making data manipulation and automation easier. Despite some limitations like slower execution speed and memory consumption, its readable code and efficient debugging make it a preferred choice for many. Python integrates well with tools like Apache Spark and libraries for data visualization, adding to its appeal.

  8. 8
    Article
    Avatar of decuberssDecube·1y

    S3 Tables with Apache Iceberg: Manage Data at Scale

    Discover how integrating S3 Tables with Apache Iceberg can enhance your data management strategy, providing reliable and scalable systems. Learn about key components like the Iceberg catalog and table, and understand the benefits of using Apache Iceberg with Amazon S3, including improved data scalability, reliability, and cost-efficiency. Explore best practices for managing large-scale deployments, optimizing resources, and ensuring secure data governance.

  9. 9
    Article
    Avatar of flipkartFlipkart Tech·1y

    Real-Time Data Propagation with HBase: Exploring Change Data Capture and Its Challenges

    Change Data Capture (CDC) in HBase enables tracking and capturing data changes in real time and making them available for other systems. HBase, a distributed non-relational data store, uses its Write Ahead Log (WAL) to implement CDC. This process supports various business use cases like ad campaigns and e-commerce transactions at Flipkart. The post discusses the architecture, methods of data propagation—Mutation and Cell Based Change Propagation—filters applied, and the challenges encountered in using these methods, providing insights into efficient data tracking and propagation.

  10. 10
    Article
    Avatar of netguruNetguru·1y

    Is Java Still Used? Current Trends and Market Demand in 2025

    Java remains widely used in 2025, favored by over 90% of Fortune 500 companies. Its adaptability to modern tech trends like cloud computing, big data, and IoT, along with strong community support and robust developer tools, ensure its relevance. Java is crucial for high-performance, scalable applications, particularly in enterprise settings. It excels in fields like finance, healthcare, and manufacturing, leveraging its security, scalability, and cross-platform capabilities.

  11. 11
    Article
    Avatar of tdsTowards Data Science·1y

    4 Things I Learned Building a Data Platform using Medallion Architecture in the Last 4 Years

    Celebrating four years of working with a medallion architecture data platform, the author shares key lessons learned. These include the importance of flexibility in applying the architecture, the potential need for additional data layers, the significance of proper data cataloging, and the balance between flexibility and maintainability. These insights aim to help others working with similar data organization approaches.