NimbleMind

If you do not have time to dig into all possible variations of Big Data technologies, here is a quick (yet far from complete) overview over Big Data technologies, summarizing on-premise and cloud solutions.

Main On-premise Big Data distributions

Hortonworks

Hortonworks established in 2011 and the only distribution that uses pure Apache Hadoop without any proprietary tools and components. Hortonworks is also the only pure Open Source project of all three distributions.

Cloudera

Cloudera was one of the first Hadoop distributions, established i 2008. Cloudera is based to large extent on Open Source components but not as much as Hortonworks. Cloudier is easier to installed use than Hortonworks. The most important difference from Hortonworks is the proprietary management stack.

MapR

MapR swaps HDFS file system with a proprietary MapRFS. MapRFS gives better robustness and redundancy and largely simplified use. Most likely the on-premise distribution that offers the best performance, redundancy and user friendliness. MapR offers extensive documentation, courses and other materials.

Comparison of most important Hadoop distributions (based on: “Hadoop buyers guide”)

		Hortonworks	Cloudera	MapR
Data access	SQL	Hive	Impala	MapR-DB Hive Impala Drill SparkSQL
Data access	NoSQL	HBase Accumulo Phoenix	HBase	HBase
Data access	Scripting	Pig	Pig	Pig
Data access	Batch	MapReduce	Spark Hive MapReduce	Spark Pig MapReduce
Data access	Search	Solr	Solr	Solr
Data access	Graph/ML			GraphX MLib Mahout
Data access	RDBMS		Kudu	MySQL
Data access	File system access	Limited, not standard NFS	Limited, not standard NFS	HDFS, read/write NFS (Posix)
Data access	Authentication	Kerberos	Kerberos	Kerberos and native
Data access	Streaming	Storm	Spark	Storm Spark MapR-Streams
Ingestion	Ingestion	Sqoop Flume Kafka	Sqoop Flume Kafka	Sqoop Flume
Operations	Scheduling	Oozie		Oozie
Operations	Data lifecycle	Falcon Atlas	Cloudera Navigator
Operations	Resource management		YARN	YARN
Operations	Coordination	ZooKeeper		ZooKeeper Sahara Myriad
Security	Security		Sentry RecordService	Sentry Record Service
Perfromance	Data ingestion	Batch	Batch	Batch and streaming (write)
Perfromance	Metadata Architecture	Centralized	Centralized	Distributed
Redundancy	HA	Survives single fault	Survives single fault	Survives multiple faults (self healing)
Redundancy	MapReduce HA	Restart of jobs	Restart of jobs	Continuous without restart
Redundancy	Upgrades	With planned dowtnime	Rolling upgrades	Rolling upgrades
Redundancy	Replication	Data only	Data only	Data and metadata
Redundancy	Snapshots	Consistent for closed files	Consistent for closed files	Consistent for all files and tables
Redundancy	Disaster recovery	None	Scheduled file copy	Data mirroring
Management	Tools	Ambari Cloudbreak	Cloudera Manager	MapR Control System
Management	Heat map, alarms	Supported	Supported	Supported
Management	ReST API	Supported	Supported	Supported
Management	Data and job placement	None	None	Yes

Cloud solutions

Amazon EMR

Amazon EMR (Elastic Map Reduce) is a Hadoop distribution put together by Amazon and running in Amazon cloud. Amazon EMR is easier to take into use than on-premise Hadoop. Amazon is absolutely the biggest cloud provider but when it comes to BigData its solution is relatively new compared to Google.

Google Cloud Platform

Google offers also BigData cloud services. The most popular er known as BigQuery (SQL like database), Cloud Dataflow (processing framework) and Cloud Dataproc (Sparc and Hadoop services). Google has been working on BigData technologies since long which gives a good start point when it comes to advanced Big Data tools. GCP offers good analysis and visualization tools as well as an advanced platform test the solutions (Cloud Datalab).

Microsoft Azure

Microsoft offers three different cloud solutions based on Azure: HDInsights, HDP for Windows and Microsoft Analytics Platform System.

Comparison of most important Big Data cloud solutions

		Amazon Web Services	Google Cloud Platform	Azure (HDInsights)
Data access	File system storage	Hadoop	Cloud Storage
Data access	NoSQL	HBase	Cloud Bigtable	HBase
Data access	SQL	Hive Hue Presto	BigQuery Cloud SQL	Hive
Data access	RDBMS	Phoenix	Cloud SQL
Data access	Batch	Pig Spark	Cloud Dataflow	Map Reduce Pig Spark
Data access	Streaming	Spark	Google Cloud Pub/Sub	Storm Spark
Data access	Script			Pig
Data access	Search			Solr
Ingestion	Ingestion	Sqoop	Cloud Dataflow
Visualisation	Visualisation		CloudData lab
Analytics	Machine Learning	Mahout	Google Cloud Machine Learning Speech API Natural Language API Translate API Vision API	R Server Azure Machine Learning
Operations	Logging		Logging Error reporting Trace
Operations	Coordination	ZooKeeper
Operations	Scheduling	Oozie
Operations	Resource Management	HCatalog Tez	Cloud Console Cloud Resource Manager
Operations	Monitoring	Ganglia	Monitoring

This work excluding photos is licensed under a Creative Commons Attribution 4.0 International License.

Preempting the EU Digital Single market regulations several Norwegian operators have introduced domestic rates for roaming in EU/EEA countries on selected subscriptions. Since majority of the Norwegian operators offer an AYCE (All you can eat) subscription with a data usage cap this means that the customers simply do not incur any extra charges while roaming in EEA countries.

More than 4 months since this offer has been introduced we start seeing some interesting implications. In particular it is easy to see a few pitfalls when roaming in European countries and areas outside EU/EEA. Switzerland and the Vatican are probably the biggest surprises to many subscribers, confuse them and thus cause them to incur high charges.

While Telia Norway includes Switzerland in their new offer Telenor does not. The rates in those countries outside EEA are often very high. Norwegian subscribers have to be on guard when transiting these countries or in the border areas. The fear of high roaming costs is therefore still present to some extent. Telias move seems actually very smart because it is absolutely going to reduce the number of customer complaints due to incurred charges.

Our own tests conducted in a few EEA countries (including Poland and Spain) show also another interesting dilemma. Operators often use a list of preferred networks which are always selected first. This is done to reduce the costs for your home network operator. However, this does not mean that the customer will actually get the best quality of service (coverage and bit rate). Your phone may still select and roam into a 2G service or prioritize service offering 3G over a 4G. Moreover, it is common that roaming service is limited to 2G/3G. This has been observed in both countries where we conducted the test. We used Telenor subscription to conduct the tests so it is difficult to say if the same applies to Telia, however, this kind of preferred network list may easily become an interesting way of reducing costs for Telenor and Telia. This will be the case if the preferred network also happens to be the slowest one since it reduces the costs which roaming partners bill home network operator. Time will show if this will be the case if so there might be a need for advanced roaming benchmarks to compare the operators and help subscribers choosing the one that gives the best performance also while roaming.

This work excluding photos is licensed under a Creative Commons Attribution 4.0 International License.

Posts

Big Data – quick overview

Main On-premise Big Data distributions

Hortonworks

Cloudera

MapR

Other on-premise solutions

Oracle Cloudera

IBM BigInsights

Cloud solutions

Amazon EMR

Google Cloud Platform

Microsoft Azure

Norwegian domestic roaming rates in EU/EEA – first impressions