Data-intensive Computing with Spark & Hadoop

Content:  Data sets of increasing volume and complexity are often very difficult to process with ‘standard’ HPC or DBMS technology. Large-scale data processing is particularly popular in the fields of linguistics, data mining, machine learning, bioinformatics and the social sciences, but certainly not limited to those disciplines. Open-source frameworks such as Apache Spark and Hadoop have been developed with this challenge in mind and can be of great benefit for data-intensive computing.

This workshop gives:

  •  Background: learn about the underlying concepts of Apache Spark & Hadoop
  •  Hands-on session: get experience with Spark in a Python notebook environment
  • Optional: discuss your own data problem
  • Duration: 6 hours
  • Date and TimeSchedule 2018
  • Target group: Researchers who need to analyze large amounts of data.
  • Course Leader: Machiel Jansen / Haukur Pall Jonsson  (SURFsara).