You might also like : Data Science Scala

Spark Release 200 | Apache Spark 410 retweets

Spark Release 2.0.0 | Apache Spark

Apache Kafka Goes 10 409 retweets

The mission-critical deployments, the robust feature set, the long history all say that Kafka is an Enterprise-capable product. Apache Kafka is going 1.0!

Exactly-once Support in Apache Kafka 274 retweets

On Thursday we released a new version of Apache Kafka that dramatically strengthens the semantic guarantees it provides. This release came at the tail end of several years of thinking through how to…

Introducing MLflow: an Open Source Machine Learning Platform 228 retweets

View the MLflow Spark+AI Summit keynote Everyone who has tried to do machine learning development knows that it is complex. Beyond the usual concerns in the software development, machine learning (ML) development comes with multiple new challenges. A...

Need ApacheSparkPython locally? Just* pip install pyspark 192 retweets

I'm excited to announce PySpark is on PyPI - Need ApacheSpark+Python locally? Just* pip install pyspark

Exactly once Semantics is Possible: Here's How Apache Kafka Does it 184 retweets

Exactly once is a hard problem to solve, but we've done it. Available now in Apache Kafka 0.11, exactly once semantics.

Scala Center at EPFL 182 retweets

Thrilled to finally announce the Scala Center! Community-first & for-the-good-of-all :)

Spark/Scala MOOC Capstone Project Now Live on Coursera! 168 retweets

Excited to announce that our "Big Data Analysis with Scala and Spark" course is now LIVE on Coursera! +Capstone too!

Add a time based log index 167 retweets

Kafka now has time-based indexes to allow seeking to a particular point in time

Introducing the Confluent Operator: Apache Kafka on Kubernetes Made Si... 161 retweets

With the Confluent Operator, we are productizing years of Apache Kafka experience with Kubernetes expertise to offer our users the best way to deploy Kafka and Confluent Platform on Kubernetes.

Awesome Error Messages for Dotty 128 retweets

Sexy error messages have arrived in Dotty! (the new Scala compiler) FelixMulder explains

Apache Kafka: The Definitive Guide | Confluent 120 retweets

Learn how to take full advantage of Apache Kafka, understand how Kafka works and how it’s designed with this comprehensive book.

Scala Library Index Reaches Beta! 118 retweets

Hey! The Scala Index (Scaladex) is now ready to use! Check it out & update your libs!

2016 Spark Summit East Keynote: Matei Zaharia 112 retweets

Databricks CTO and Spark creator Matei Zaharia's keynote at Spark Summit East 2016: Planned major expansions to Apache Spark

Building a Robust and Scalable Streaming Pipeline Using Kafka and Akka... 108 retweets

We posted the videos for all the Kafka Summit talks

The Apache Software Foundation Announces Apache™ Spark™ as a Top-Level... 102 retweets

The Apache Software Foundation Announces Apache™ Spark™ as a Top-Level Project : The Apache Software Foundation Blog

Apache Kafka and Stream Processing O’Reilly Book Bundle 101 retweets

Some reading for the long weekend: - apachekafka: The Definitive Guide by gwenshap nehanarkhede bonkoif - Designing Event-Driven Systems by benstopford - I ❤ Logs by jaykreps - Making Sense of Stream Processing by martinkl

Daniel Tunkelang's answer to What should everyone know about machine l... 100 retweets

"You can have machine learning without sophisticated algorithms, but not without good data."

Lessons from Running Large Scale Spark Workloads 99 retweets

Spark "hall of fame": 8000+ nodes in a single cluster, 1PB+/day ingest, mapping the brain at scale, shuffling 1PB

prettydirect, Jon Pretty's blog 99 retweets

My closing remarks at Scala World discussed division, complexity and confidence in the Scala community. My new blog:

TensorFlow 1.0.0-alpha is out! API changes may be annoying, but it's g... 98 retweets

TensorFlow 1.0.0-alpha is out! API changes may be annoying, but it's great that the API is more NumPy-like now!

The Unreasonable Effectiveness of Deep Learning on Apache Spark 97 retweets

For the past three years, our smartest engineers at Databricks have been working on a stealth project. Today, we are unveiling DeepSpark, a major new milestone in Apache Spark.

Verizon Open Source Engineering 97 retweets

Oh wow, check out all of the open source Scala projects by verizon <3 so cool

Running Spark on a Cluster: The Basics 96 retweets

And... I started a blog! First up, something my students recently needed that I realized most other people need too. Full walkthroughs on how to run Spark on a cluster. - Starting the cluster & spark-shell - Sending real jobs to the cluster...

Machine learning & Kafka KSQL stream processing — bug me when I’ve lef... 94 retweets

Household power consumption fluctuates throughout the day. However, the pattern of electricity use follows a typical curve and becomes predictable after sufficient observations. Would it surprise you…

The Curious Case of the Broken Benchmark: Revisiting Apache Flink® vs 93 retweets

data Artisans took a closer took at a recent Apache Flink vs. Databricks Runtime benchmark carried out by Databricks and arrived at quite a different result

Exactly-once, once more – Jay Kreps – Medium 93 retweets

My last post on exactly-once in Apache Kafka provoked a couple of replies from smart people. Never one to let someone else get the last “Well actually…” I thought I’d give a quick rundown. Henry…

Introducing Scalafix: a code migration tool for Scala 91 retweets

Proud to announce the 1st release of Scalafix! Migrate your code from Scala 2.x to Dotty automagically! by olafurpg

MapReduce and Spark - Cloudera Blog 88 retweets

About a week ago, I posted an article on Cloudera’s strategy on SQL in the Apache Hadoop ecosystem. In the article, I argued that a special-purpose distributed query processing engine will perform better than one that translates work into a general-p...

Apache Flink: Introducing Docker Images for Apache Flink 85 retweets

Apache Flink: Introducing Docker Images for Apache Flink

Apache Kafka: Online Talk Series 83 retweets

Excited to announce we're kicking off a series of free online talks on Kafka, streaming data, & stream processing

Keeping CALM: when distributed consistency is easy 81 retweets

Keeping CALM: when distributed consistency is easy Hellerstein & Alvaro, arXiv 2019 The CALM conjecture (and later theorem) was first introduced to the world in a 2010 keynote talk at PODS. Behind its simple formulation there’s a deep lesson to be le...

A Closer Look with Kafka Streams, KSQL, and Analogies in Scala 79 retweets

I was being poked a lot about my lack of personal blogging so, after a four year hiatus, I can finally share a new article: "Of Streams and Tables in apachekafka and Stream Processing", ft. Kafka Streams & KSQL. Let me know if you find it useful...

How Events Are Reshaping Modern Systems 79 retweets

Event-driven architecture and design have been getting a lot of attention in recent years. It’s an old concept that has been around for decades, so why this sudden peak of interest?  In this talk, we will explore the nature of events, what it mean...

Databricks Delta: A Unified Data Management System for Real-time Big D... 78 retweets

Simply large-scale data management with Databricks Delta, a unified data management system that combines the best of data warehouses, data lakes, and streaming.

Building a Real-Time Streaming ETL Pipeline in 20 Minutes 78 retweets

A new ETL paradigm is here. Build and implement real-time streaming ETL pipeline using Kafka Streams API, Kafka Connect API, Avro and Schema Registry.

Introducing Azure Databricks - The Databricks Blog 77 retweets

99% of organizations still struggle to get valuable analytics from their Big Data and achieve the full potential of AI. Today I’m excited to announce a new partnership with Microsoft that represents a major leap forward in achieving Databricks’ missi...

Mahout just announced they are putting aside MapReduce and focusing on... 76 retweets

Mahout just announced they are putting aside MapReduce and focusing on ApacheSpark and Scala. Pretty cool!

Apache Spark officially sets a new record in large-scale sorting 75 retweets

A month ago, we shared with you our entry to the 2014 Gray Sort competition, a 3rd-party benchmark measuring how fast a system can sort 100 TB of data (1 trillion records). Today, we are happy to announce that our entry has been reviewed by the bench...

ML Pipelines: A New High-Level API for MLlib 74 retweets

MLlib’s goal is to make practical machine learning (ML) scalable and easy. Besides new algorithms and performance improvements that we have seen in each release, a great deal of time and effort has been spent on making MLlib easy. Similar to Spark Co...

Databricks to run two massive online courses on Apache Spark 73 retweets

In the age of ‘Big Data,’ with datasets rapidly growing in size and complexity and cloud computing becoming more pervasive, data science techniques are fast becoming core components of large-scale data processing pipelines. Apache Spark offers analys...

Why We Open Sourced our Books - Underscore 71 retweets

Advanced Scala with Cats book by underscoreio's davegurnell and noelwelsh now open source:

Introducing Research for Practice - ACM Queue 71 retweets

Very excited to unveil Research for Practice in ACMQueue; think: papers_we_love style reading guides from experts:

Can Spark Streaming survive Chaos Monkey? 67 retweets

With Spark Streaming as our choice of stream processor, we set out to evaluate and share the resiliency story for Spark Streaming in the AWS cloud environment.

Announcing Confluent Cloud: Apache Kafka as a Service 66 retweets

Announcing Confluent Cloud: Apache Kafka as a Service, the simplest, fastest, most robust and cost effective way to run Apache Kafka in the public cloud.

Databricks Launches MOOC: Data Science on Apache Spark 66 retweets

For the past several months, we have been working in collaboration with professors from the University of California Berkeley and University of California Los Angeles to produce two freely available Massive Open Online Courses (MOOCs). We are proud t...

Though we’ve been silent for quite a while recently, it doesn’t mean w... 65 retweets

Though we’ve been silent for quite a while recently, it doesn’t mean we’ve not been busy, so today we’re presenting the new features for Scala plugin RC for IntelliJ IDEA 2016.3 that we’ve just cre…

Messaging as the Single Source of Truth 63 retweets

Learn how to blend Apache Kafka, event sourcing and a microservice architecture to create a single source of truth any service can dip into.

Apache Flink: Apache Flink in 2017: Year in Review 63 retweets

Apache Flink 2017 by the numbers: • 3 major version releases • 10 new committers • More than 2300 commits and 1800 issues resolved We recapped all this and more in our 2017 Year in Review: Thanks to everyone for a most excellent 12 months apachefl...

Introduction to Apache Flink | MapR 61 retweets

My book on ApacheFlink by O'Reilly together with Ellen_Friedman is now available to download from MapR!

Traditional enterprise messaging systems are being eclipsed by apachek 60 retweets

Traditional enterprise messaging systems are being eclipsed by apachekafka. Kafka is not just another queue, it's a streaming platform: it scales company-wide, allows durable long-term message persistence, and rich stream processing.

Benchmarks for Low Latency (Streaming) solutions including Apache Sto... 60 retweets

GitHub - yahoo/streaming-benchmarks: Benchmarks for Low Latency (Streaming) solutions including Apache Storm, Apache Spark, Apache Flink, ...

Scala Tutorial | Terms And Types 60 retweets

NEW: The 2 progfun MOOCs materials now exist also as interactive scalaexercises by julienrf! Thank you 47deg!

Python Stream Processing 60 retweets

1/ Faust is a python library from for stream processing with apachekafka from RobinhoodApp. I think it's really cool. It highlights one of the things I think we got right with Kafka Streams: supporting stream processing in Kafka at the protocol lev...

I'm happy to announce High Performance Spark is available in print fro... 60 retweets

I'm happy to announce High Performance Spark is available in print from OReillyMedia - :D

Artificial Intelligence & Apache Spark Conference 59 retweets

Sessions | Artificial Intelligence & Apache Spark Conference

The Scala Programming Language 58 retweets

Woot, the Scala team EPFL was approved to be a Google Summer of Code mentor! Students, get paid work with Scala!

Running Kafka At Scale | LinkedIn Engineering 58 retweets

If data is the lifeblood of high technology, Apache Kafka is the circulatory system in use at LinkedIn. We use Kafka for moving every type of data around between systems, and it touches virtually every server, every day. The complexity of the infrast...

Announcing Scala with Cats - Underscore 58 retweets

Announcing Scala with Cats - Underscore

Databricks Scala Coding Style Guide 57 retweets

Databricks Scala Coding Style Guide. Contribute to databricks/scala-style-guide development by creating an account on GitHub.

Spark DataFrames: Simple and Fast Analytics on Structured Data at Spar... 57 retweets

Thanks to everyone who came out to my talk on Spark SQL DataFrames at SparkSummit today! Slides can be found here:

ETL Is Dead, Long Live Streams 57 retweets

Neha Narkhede talks about the experience at LinkedIn moving from batch-oriented ETL to real-time streams using Apache Kafka and how the design and implementation of Kafka was driven by this goal of acting as a real-time platform for event data. She c...

Spark wins Daytona Gray Sort 100TB Benchmark 56 retweets

Spark officially sets the 2014 sort benchmark record! 30x the per-node perf of last year's Hadoop record.

Announcing Apache Spark Packages 56 retweets

Today, we are happy to announce Apache Spark Packages ( a community package index to track the growing number of open source packages and libraries that work with Apache Spark. Spark Packages makes it easy for users to find...

Python Kafka Client Benchmarking 56 retweets

Benchmark of Python clients for apachekafka. Summary: you can produce ~180k msgs/sec and consume 260k msg/sec.

We syslogs: Real-time syslog Processing with Apache Kafka and KSQL 55 retweets

In this post, we’re going to see how to use the Confluent Apache Kafka Python client to easily do some push-based alerting driven by the live streams of filtered syslog data that KSQL is populating.

Blogged about automatic flamegraph generation from java scala benchm 55 retweets

Blogged about automatic flamegraph generation from java scala benchmarks using jmh and new tooling shipped in sbt-jmh, though it could be used without the plugin as well!

Hacking on scalac — 0 to PR in an hour 54 retweets

0 to PR in an hour: a writeup of working with the Scala compiler's SBT build using my SI-2712 fix as an example:

KSQL: Streaming SQL for Apache Kafka 54 retweets

KSQL, recently announced by Confluent, is a streaming SQL engine on top of Kafka allowing the definition of streams and tables.