Data Science with a Rest API

Hopsworks Big Data services are easily consumed through a Rest API

Hopsworks simplifies access to Spark/TensorFlow/HDFS/Kafka through a single REST API.¬†Clients authenticate with JWT tokens against the REST API. Clients are can be very diverse. IoT devices or smartphones can easily produce massive volumes of data to Hopsworks’ Rest API that is then queued at Kafka, to be processed by backend Spark or Flink applications. Data scientists use Hopsworks’ User Interface for interactive data analytics on Jupyter and to run long running jobs on Spark/TensorFlow/Hive. Reporting and visualization tools can use the Rest API to run jobs and access data from the backend data platform. External Business Intelligence reporting tools (like Tabeleau and Qlik) can be connected directly to Hive.

Hopsworks Services

Hopsworks integrates some of the most popular services for Data Science and Data Engineering in a single platform, including: Hops Hadoop, Spark, TensorFlow, Flink, Kafka, Hive, and Elasticsearch. Hopsworks also integrates support services, necessary for a human-friendly platform: Jupyter and Zeppelin notebooks, Kibana/Logstash for real-time logging and visualizations, InfluxDB/Grafana for monitoring applications/services/hosts, and integration with identity providers (Kerberos/LDAP).

Identity/Authentication/Authorization

User identity can be provided by a LDAP server, Kerberos KDC (Active Directory), or by Hopsworks itself.  Internally, Hopsworks uses TLS certificates for security (authentication, network traffic encryption). TLS certificates are the key technology that enables multi-tenancy in Hopsworks (as oppposed to Apache Hadoop that does not support multi-tenancy and is built on Kerberos). All authorization policies (from HopsFS to YARN to Hive to Kafka) are stored centrally in MySQL Cluster.

Versions

The table below shows the services and their versions currently used by Hopsworks (0.4):

Service Version
Hops Hadoop 2.8.2.3
TensorFlow 1.4.0
Apache Spark 2.2.0
Apache Flink 1.1.3
Apache Kafka 1.0.0
Jupyter Latest
Apache Zeppelin Latest
Apache Hive 2.3.0
InfluxDB 1.2.1
Grafana 4.1.1
Telegraf 1.2.1
Kapacitor 1.2.0
Logstash 2.4.1
Kibana 4.6.4
Elasticsearch 2.4.1
Filebeat 5.2.0
Zookeeper 3.4.7
MySQL Cluster 7.5.9
Anaconda 5.0.1
Java 1.8
Download Hops Sandbox