If you do not have time to dig into all possible variations of Big Data technologies, here is a quick (yet far from complete) overview over Big Data technologies, summarizing on-premise and cloud solutions.
Main On-premise Big Data distributions
Hortonworks
Hortonworks established in 2011 and the only distribution that uses pure Apache Hadoop without any proprietary tools and components. Hortonworks is also the only pure Open Source project of all three distributions.
Cloudera
Cloudera was one of the first Hadoop distributions, established i 2008. Cloudera is based to large extent on Open Source components but not as much as Hortonworks. Cloudier is easier to installed use than Hortonworks. The most important difference from Hortonworks is the proprietary management stack.
MapR
Hortonworks
|
Cloudera
|
MapR
|
||
Data access
|
SQL
|
Hive
|
Impala
|
MapR-DB
Hive
Impala
Drill
SparkSQL
|
Data access
|
NoSQL
|
HBase
Accumulo
Phoenix
|
HBase
|
HBase
|
Data access
|
Scripting
|
Pig
|
Pig
|
Pig
|
Data access
|
Batch
|
MapReduce
|
Spark
Hive
MapReduce
|
Spark
Pig
MapReduce
|
Data access
|
Search |
Solr
|
Solr
|
Solr
|
Data access
|
Graph/ML
|
GraphX
MLib
Mahout
|
||
Data access
|
RDBMS | Kudu | MySQL | |
Data access
|
File system access | Limited, not standard NFS | Limited, not standard NFS | HDFS, read/write NFS (Posix) |
Data access | Authentication | Kerberos | Kerberos | Kerberos and native |
Data access | Streaming | Storm | Spark | Storm Spark MapR-Streams |
Ingestion
|
Ingestion
|
Sqoop
Flume
Kafka
|
Sqoop
Flume
Kafka
|
Sqoop
Flume
|
Operations
|
Scheduling
|
Oozie
|
Oozie
|
|
Operations
|
Data lifecycle
|
Falcon
Atlas
|
Cloudera Navigator
|
|
Operations
|
Resource management
|
YARN
|
YARN
|
|
Operations
|
Coordination
|
ZooKeeper
|
ZooKeeper
Sahara
Myriad
|
|
Security
|
Security
|
Sentry
RecordService
|
Sentry
Record Service
|
|
Perfromance
|
Data ingestion
|
Batch
|
Batch
|
Batch and streaming (write)
|
Perfromance
|
Metadata Architecture
|
Centralized
|
Centralized
|
Distributed
|
Redundancy
|
HA
|
Survives single fault | Survives single fault | Survives multiple faults (self healing) |
Redundancy
|
MapReduce HA
|
Restart of jobs | Restart of jobs | Continuous without restart |
Redundancy
|
Upgrades | With planned dowtnime | Rolling upgrades | Rolling upgrades |
Redundancy
|
Replication | Data only | Data only | Data and metadata |
Redundancy
|
Snapshots
|
Consistent for closed files | Consistent for closed files | Consistent for all files and tables |
Redundancy
|
Disaster recovery
|
None | Scheduled file copy | Data mirroring |
Management
|
Tools |
Ambari
Cloudbreak
|
Cloudera Manager
|
MapR Control System
|
Management
|
Heat map, alarms
|
Supported
|
Supported
|
Supported
|
Management
|
ReST API
|
Supported
|
Supported
|
Supported
|
Management
|
Data and job placement
|
None
|
None
|
Yes
|
Other on-premise solutions
Oracle Cloudera
Oracle Cloudera is a joint solution from Oracle/Cloudera. Oracle based their Big Data platform on a Cloudera distribution. This distribution offers some additional and useful tools and solutions that give increased performance, in particular Oracle Big Data Appliance, Oracle Big Data Discovery, Oracle NoSQL database and Oracle R Enterprise.
Oracle Big Data appliance is an integrated HW and SW Big Data solution running on a platform based on Engineered Systems (like Exa Data). Oracle adds Big Data Discovery visualization tools on top of Cloudier/Hadoop while Oracle R Enterprise includes R – an open source, advanced statistical analysis tool.
IBM BigInsights
Cloud solutions
Amazon EMR
Amazon EMR (Elastic Map Reduce) is a Hadoop distribution put together by Amazon and running in Amazon cloud. Amazon EMR is easier to take into use than on-premise Hadoop. Amazon is absolutely the biggest cloud provider but when it comes to BigData its solution is relatively new compared to Google.
Google Cloud Platform
Microsoft Azure
Amazon Web Services |
Google
Cloud Platform |
Azure
(HDInsights) |
||
Data access
|
File system storage
|
Hadoop
|
Cloud Storage
|
|
Data access
|
NoSQL
|
HBase
|
Cloud Bigtable
|
HBase
|
Data access
|
SQL
|
Hive
Hue
Presto
|
BigQuery
Cloud SQL
|
Hive
|
Data access
|
RDBMS
|
Phoenix
|
Cloud SQL
|
|
Data access
|
Batch
|
Pig
Spark
|
Cloud Dataflow
|
Map Reduce
Pig
Spark
|
Data access
|
Streaming
|
Spark
|
Google Cloud Pub/Sub
|
Storm
Spark
|
Data access
|
Script | Pig | ||
Data access
|
Search | Solr | ||
Ingestion
|
Ingestion
|
Sqoop
|
Cloud Dataflow
|
|
Visualisation
|
Visualisation | CloudData lab | ||
Analytics
|
Machine Learning | Mahout | Google Cloud Machine Learning Speech API Natural Language API Translate API Vision API |
R Server Azure Machine Learning |
Operations
|
Logging
|
Logging
Error reporting
Trace
|
||
Operations
|
Coordination
|
ZooKeeper
|
||
Operations
|
Scheduling | Oozie | ||
Operations
|
Resource Management | HCatalog
Tez
|
Cloud Console
Cloud Resource Manager
|
|
Operations
|
Monitoring | Ganglia | Monitoring |
This work excluding photos is licensed under a Creative Commons Attribution 4.0 International License.