Big Data – quick overview

If you do not have time to dig into all possible variations of Big Data technologies, here is a quick (yet far from complete) overview over Big Data technologies, summarizing on-premise and cloud solutions.

Photo: is am are/Shutterstock.com

Main On-premise Big Data distributions

Hortonworks

Hortonworks established in 2011 and the only distribution that uses pure Apache Hadoop without any proprietary tools and components. Hortonworks is also the only pure Open Source project of all three distributions.

Cloudera

Cloudera was one of the first Hadoop distributions, established i 2008. Cloudera is based to large extent on Open Source components but not as much as Hortonworks. Cloudier is easier to installed use than Hortonworks. The most important difference from Hortonworks is the proprietary management stack.

MapR
MapR swaps HDFS file system with a proprietary MapRFS. MapRFS gives better robustness and redundancy and largely simplified use. Most likely the on-premise distribution that offers the best performance, redundancy and user friendliness. MapR offers extensive documentation, courses and other materials.
 
Comparison of most important Hadoop distributions (based on: “Hadoop buyers guide”)
   
Hortonworks
Cloudera
MapR
Data access
SQL
Hive
Impala
MapR-DB
Hive
Impala
Drill
SparkSQL
Data access
NoSQL
HBase
Accumulo
Phoenix
HBase
HBase
Data access
Scripting
Pig
Pig
Pig
Data access
Batch
MapReduce
Spark
Hive
MapReduce
Spark
Pig
MapReduce
Data access
Search
Solr
Solr
Solr
Data access
Graph/ML
   
GraphX
MLib
Mahout
Data access
RDBMS    Kudu MySQL
Data access
File system access Limited, not standard NFS Limited, not standard NFS HDFS, read/write NFS (Posix)
Data access Authentication Kerberos Kerberos Kerberos and native
Data access Streaming Storm Spark Storm
Spark
MapR-Streams
Ingestion
Ingestion
Sqoop
Flume
Kafka
Sqoop
Flume
Kafka
Sqoop
Flume
Operations
Scheduling
Oozie
 
Oozie
Operations
Data lifecycle
Falcon
Atlas
Cloudera Navigator
 
Operations
Resource management
 
YARN
YARN
Operations
Coordination
ZooKeeper
 
ZooKeeper
Sahara
Myriad
Security
Security
 
Sentry
RecordService
Sentry
Record Service
Perfromance
Data ingestion
Batch
Batch
Batch and streaming (write)
Perfromance
Metadata Architecture
Centralized
Centralized
Distributed
Redundancy
HA
Survives single fault Survives single fault Survives multiple faults
(self healing)
Redundancy
MapReduce HA
Restart of jobs Restart of jobs Continuous without restart
Redundancy
Upgrades With planned dowtnime Rolling upgrades Rolling upgrades
Redundancy
Replication Data only Data only Data and metadata
Redundancy
Snapshots
Consistent for closed files Consistent for closed files Consistent for all files and tables
Redundancy
Disaster recovery
None Scheduled file copy Data mirroring
Management
Tools
Ambari
Cloudbreak
Cloudera Manager
MapR Control System
Management
Heat map, alarms
Supported
Supported
Supported
Management
ReST API
Supported
Supported
Supported
Management
Data and job placement
None
None
Yes

Other on-premise solutions

Oracle Cloudera

Oracle Cloudera is a joint solution from Oracle/Cloudera. Oracle based their Big Data platform on a Cloudera distribution. This distribution offers some additional and useful tools and solutions that give increased performance, in particular Oracle Big Data Appliance, Oracle Big Data Discovery, Oracle NoSQL database and Oracle R Enterprise. 

Oracle Big Data appliance is an integrated HW and SW Big Data solution running on a platform based on Engineered Systems (like Exa Data). Oracle adds Big Data Discovery visualization tools on top of Cloudier/Hadoop while Oracle R Enterprise includes R – an open source, advanced statistical analysis tool.

IBM BigInsights
IBM BigInsights for Apache Hadoop is a solution from IBM that also builds on top of Hadoop. BigInsights offers in addition to Hadoop, some proprietary tool for analysis like BigSQL, BigSheets and BigInsights Data Scientist that includes BigR.
IBM BigInsights for Hadoop also offers BigInsights Enterprise Management solution and IBM Spectrum Scale-FPO file system as an alternative to HDFS.

Cloud solutions

Amazon EMR

Amazon EMR (Elastic Map Reduce) is a Hadoop distribution put together by Amazon and running in Amazon cloud. Amazon EMR is easier to take into use than on-premise Hadoop. Amazon is absolutely the biggest cloud provider but when it comes to BigData its solution is relatively new compared to Google.

Google Cloud Platform
Google offers also BigData cloud services. The most popular er known as BigQuery (SQL like database), Cloud Dataflow (processing framework) and Cloud Dataproc (Sparc and Hadoop services). Google has been working on BigData technologies since long which gives a good start point when it comes to advanced Big Data tools. GCP offers good analysis and visualization tools as well as an advanced platform test the solutions (Cloud Datalab).
Microsoft Azure
Microsoft offers three different cloud solutions based on Azure: HDInsights, HDP for Windows and Microsoft Analytics Platform System.
 
 Comparison of most important Big Data cloud solutions
    Amazon
Web Services
Google
Cloud Platform
Azure
(HDInsights)
Data access
File system storage
Hadoop
Cloud Storage
 
Data access
NoSQL
HBase
Cloud Bigtable
HBase
Data access
SQL
Hive
Hue
Presto
BigQuery
Cloud SQL
Hive
Data access
RDBMS
Phoenix
Cloud SQL
 
Data access
Batch
Pig
Spark
Cloud Dataflow
Map Reduce
Pig
Spark
Data access
Streaming
Spark
Google Cloud Pub/Sub
Storm
Spark
Data access
Script      Pig
Data access
Search      Solr
Ingestion
Ingestion
Sqoop
Cloud Dataflow
 
Visualisation
Visualisation   CloudData lab  
Analytics
Machine Learning Mahout Google Cloud Machine Learning
Speech API
Natural Language API
Translate API
Vision API
R Server
Azure Machine Learning
Operations
Logging
 
Logging
Error reporting
Trace
 
Operations
Coordination
ZooKeeper
   
Operations
Scheduling Oozie    
Operations
Resource Management HCatalog

 

 

Tez
Cloud Console

 

 

Cloud Resource Manager
 
Operations
Monitoring Ganglia Monitoring  
Creative Commons License

This work excluding photos is licensed under a Creative Commons Attribution 4.0 International License.

Leave a Reply

Your email address will not be published. Required fields are marked *