Sal
Peter Hoffmann Director Data Engineering at Blue Yonder. Python Developer, Conference Speaker, Mountaineer

PyData 2015 Berlin - Introduction to the PySpark DataFrame API

This Talk from PyData 2015 Berlin gives an overview of the PySpark Data Frame API.

This talk from PyData 2015 Berlin will give an overview of the PySpark DataFrame API. While Spark core itself is written in Scala and runs on the JVM, PySpark exposes the Spark programming model to Python. The Spark DataFrame API was introduced in Spark 1.3. DataFrames envolve Spark's Resiliant Distributed Datasets model and are inspired by Pandas and R data frames. The API provides simplified operators for filtering, aggregating, and projecting over large datasets. The DataFrame API supports diffferent data sources like JSON datasources, Parquet files, Hive tables and JDBC database connections.