Big Data solution – generic or specific, cloud or on-premise?

As Big Data becomes more and more popular, and more and more options become available selecting Big Data technology for your business can become a real headache. Number of options of different stacks and tools is huge ranging from pure Hadoop and Hortonworks to more proprietary solutions from Microsoft, IBM or Google. If this wasn’t enough you will need to choose between on premise installation and cloud solution. Number of proprietary solutions also increases at a huge rate.  Here we sum up a few strategies to introduce Big Data in your business.

One of the first questions you will meet when looking into possibilities of using Big Data for your business is if you should build a generic platform or a solution for specific needs.

Photo: Vasin Lee/Shutterstock.com

Building for specific needs

In many businesses, if you follow internal processes and project frameworks you will intuitively ask yourself what purpose or use case you want to support using Big Data technology. This approach may seem to be correct, but unfortunately, there is number of pitfalls here.

First of all, by only building a platform for specific needs and specific use cases, you will most likely choose a very limited product, which only mimics some of the features of a full-blown implementation. Examples here might be classical, old-fashioned analytical platforms like e.g. a Data Warehouse, statistical tools or even a plain old relational database. This will be sufficient for implementing your use case but as soon as you try to reuse it for another use case, you will realize the limitations. In particular the fact that you need to decide the structure of the stored data before you start collecting it, you need to transform it to adapt it to the new use case and face issues with scale-up every time the data volume increase and your Data Warehouse or relational database is unable to keep up with the volume and velocity of the data. You will in another word largely limit your flexibility and the possibility to explore your data.

A solution implemented for specific needs is in practice not really a Big Data solution although your vendor may insist calling it Big Data, thus this is just a Small Data solution. It may still be a viable choice for your business as long as you do not have any bigger ambitions or expectations in the future. By introducing more and more solutions like this you will ultimately fragment and disperse your business data into multiple loosely connected systems. The more fragmentation there is, the more difficult it gets to analyze data across your business.

Build a generic platform

Building a generic platform is much harder, but might be the right thing to do. It requires though courage to build a solution and start collecting data often without an adequate use case, to begin with. This is often difficult to advocate for, it is a leap of faith or a bet that your business needs to take. However, if you really want to unleash the power of Big Data, this is the strategy which potentially will both give you the flexibility to explore your data and to conduct experiments and find new facts, information and ways to use it for your business. This kind of platform based on open Big Data technology like Hadoop will also be easier to scale when needed and process increasing volumes and velocity of data.

The second very basic question one will meet is where to deploy and establish your platform – Cloud or on-premise? Although this question may seem really unrelated to it is important to be aware of the implications of chosen right deployment strategy.

On-premise platform

Choosing the on-premise platform seems like a natural choice here for many established business with established, in-house IT operations. However as soon as you choose to build a generic platform you will quickly realize that you need to experiment since the number of different Big Data stacks, technologies and tools is extreme. You need to be able to quickly change from one solution to another without too much lead time and waste. It may be hard to change the platform once you have heavily invested in an expensive proprietary on-premise platform like Oracle Big Data Appliance or even IBM Big Insights. It also requires people with a rather specific skill set to maintain the platform.

Cloud platform

Cloud-based Big Data platform like Amazon EMR, Google Cloud Platform or Microsoft Azure provides necessary flexibility and agility which is crucial when starting experimenting with Big Data. If you want to focus your business on what matters the most you will concentrate on the core of your business. Setting up hardware, installing Hadoop and running the basic Big Data infrastructure is not what most businesses need to focus on and should prioritize.

The cloud platform is especially relevant in the first, exploration phase when you are still unsure what to use the technology for. After the first exploration phase, when your solution is stabilized you may still reconsider sourcing in operations BigData technologies however in most of the cases you will like to still keep the flexibility of the cloud.

Summary

All in all, the best strategy is a platform which is open and flexible enough to cover future cases, do not build your BigData solution just for current needs. This is one of the cases when you actually need to concentrate more on technology and capabilities and not only the current, short-term business needs.

Creative Commons License

This work excluding photos is licensed under a Creative Commons Attribution 4.0 International License.

Big Data – quick overview

If you do not have time to dig into all possible variations of Big Data technologies, here is a quick (yet far from complete) overview over Big Data technologies, summarizing on-premise and cloud solutions.

Photo: is am are/Shutterstock.com

Main On-premise Big Data distributions

Hortonworks

Hortonworks established in 2011 and the only distribution that uses pure Apache Hadoop without any proprietary tools and components. Hortonworks is also the only pure Open Source project of all three distributions.

Cloudera

Cloudera was one of the first Hadoop distributions, established i 2008. Cloudera is based to large extent on Open Source components but not as much as Hortonworks. Cloudier is easier to installed use than Hortonworks. The most important difference from Hortonworks is the proprietary management stack.

MapR
MapR swaps HDFS file system with a proprietary MapRFS. MapRFS gives better robustness and redundancy and largely simplified use. Most likely the on-premise distribution that offers the best performance, redundancy and user friendliness. MapR offers extensive documentation, courses and other materials.
 
Comparison of most important Hadoop distributions (based on: “Hadoop buyers guide”)
   
Hortonworks
Cloudera
MapR
Data access
SQL
Hive
Impala
MapR-DB
Hive
Impala
Drill
SparkSQL
Data access
NoSQL
HBase
Accumulo
Phoenix
HBase
HBase
Data access
Scripting
Pig
Pig
Pig
Data access
Batch
MapReduce
Spark
Hive
MapReduce
Spark
Pig
MapReduce
Data access
Search
Solr
Solr
Solr
Data access
Graph/ML
   
GraphX
MLib
Mahout
Data access
RDBMS    Kudu MySQL
Data access
File system access Limited, not standard NFS Limited, not standard NFS HDFS, read/write NFS (Posix)
Data access Authentication Kerberos Kerberos Kerberos and native
Data access Streaming Storm Spark Storm
Spark
MapR-Streams
Ingestion
Ingestion
Sqoop
Flume
Kafka
Sqoop
Flume
Kafka
Sqoop
Flume
Operations
Scheduling
Oozie
 
Oozie
Operations
Data lifecycle
Falcon
Atlas
Cloudera Navigator
 
Operations
Resource management
 
YARN
YARN
Operations
Coordination
ZooKeeper
 
ZooKeeper
Sahara
Myriad
Security
Security
 
Sentry
RecordService
Sentry
Record Service
Perfromance
Data ingestion
Batch
Batch
Batch and streaming (write)
Perfromance
Metadata Architecture
Centralized
Centralized
Distributed
Redundancy
HA
Survives single fault Survives single fault Survives multiple faults
(self healing)
Redundancy
MapReduce HA
Restart of jobs Restart of jobs Continuous without restart
Redundancy
Upgrades With planned dowtnime Rolling upgrades Rolling upgrades
Redundancy
Replication Data only Data only Data and metadata
Redundancy
Snapshots
Consistent for closed files Consistent for closed files Consistent for all files and tables
Redundancy
Disaster recovery
None Scheduled file copy Data mirroring
Management
Tools
Ambari
Cloudbreak
Cloudera Manager
MapR Control System
Management
Heat map, alarms
Supported
Supported
Supported
Management
ReST API
Supported
Supported
Supported
Management
Data and job placement
None
None
Yes

Other on-premise solutions

Oracle Cloudera

Oracle Cloudera is a joint solution from Oracle/Cloudera. Oracle based their Big Data platform on a Cloudera distribution. This distribution offers some additional and useful tools and solutions that give increased performance, in particular Oracle Big Data Appliance, Oracle Big Data Discovery, Oracle NoSQL database and Oracle R Enterprise. 

Oracle Big Data appliance is an integrated HW and SW Big Data solution running on a platform based on Engineered Systems (like Exa Data). Oracle adds Big Data Discovery visualization tools on top of Cloudier/Hadoop while Oracle R Enterprise includes R – an open source, advanced statistical analysis tool.

IBM BigInsights
IBM BigInsights for Apache Hadoop is a solution from IBM that also builds on top of Hadoop. BigInsights offers in addition to Hadoop, some proprietary tool for analysis like BigSQL, BigSheets and BigInsights Data Scientist that includes BigR.
IBM BigInsights for Hadoop also offers BigInsights Enterprise Management solution and IBM Spectrum Scale-FPO file system as an alternative to HDFS.

Cloud solutions

Amazon EMR

Amazon EMR (Elastic Map Reduce) is a Hadoop distribution put together by Amazon and running in Amazon cloud. Amazon EMR is easier to take into use than on-premise Hadoop. Amazon is absolutely the biggest cloud provider but when it comes to BigData its solution is relatively new compared to Google.

Google Cloud Platform
Google offers also BigData cloud services. The most popular er known as BigQuery (SQL like database), Cloud Dataflow (processing framework) and Cloud Dataproc (Sparc and Hadoop services). Google has been working on BigData technologies since long which gives a good start point when it comes to advanced Big Data tools. GCP offers good analysis and visualization tools as well as an advanced platform test the solutions (Cloud Datalab).
Microsoft Azure
Microsoft offers three different cloud solutions based on Azure: HDInsights, HDP for Windows and Microsoft Analytics Platform System.
 
 Comparison of most important Big Data cloud solutions
    Amazon
Web Services
Google
Cloud Platform
Azure
(HDInsights)
Data access
File system storage
Hadoop
Cloud Storage
 
Data access
NoSQL
HBase
Cloud Bigtable
HBase
Data access
SQL
Hive
Hue
Presto
BigQuery
Cloud SQL
Hive
Data access
RDBMS
Phoenix
Cloud SQL
 
Data access
Batch
Pig
Spark
Cloud Dataflow
Map Reduce
Pig
Spark
Data access
Streaming
Spark
Google Cloud Pub/Sub
Storm
Spark
Data access
Script      Pig
Data access
Search      Solr
Ingestion
Ingestion
Sqoop
Cloud Dataflow
 
Visualisation
Visualisation   CloudData lab  
Analytics
Machine Learning Mahout Google Cloud Machine Learning
Speech API
Natural Language API
Translate API
Vision API
R Server
Azure Machine Learning
Operations
Logging
 
Logging
Error reporting
Trace
 
Operations
Coordination
ZooKeeper
   
Operations
Scheduling Oozie    
Operations
Resource Management HCatalog

 

 

Tez
Cloud Console

 

 

Cloud Resource Manager
 
Operations
Monitoring Ganglia Monitoring  
Creative Commons License

This work excluding photos is licensed under a Creative Commons Attribution 4.0 International License.

Norwegian domestic roaming rates in EU/EEA – first impressions

Preempting the EU Digital Single market regulations several Norwegian operators have introduced domestic rates for roaming in EU/EEA countries on selected subscriptions. Since majority of the Norwegian operators offer an AYCE (All you can eat) subscription with a data usage cap this means that the customers simply do not incur any extra charges while roaming in EEA countries.


Photo: Pexels
More than 4 months since this offer has been introduced we start seeing some interesting implications. In particular it is easy to see a few pitfalls when roaming in European countries and areas outside EU/EEA. Switzerland and the Vatican are probably the biggest surprises to many subscribers, confuse them and thus cause them to incur high charges.
While Telia Norway includes Switzerland in their new offer Telenor does not. The rates in those countries outside EEA are often very high. Norwegian subscribers have to be on guard when transiting these countries or in the border areas. The fear of high roaming costs is therefore still present to some extent. Telias move seems actually very smart because it is absolutely going to reduce the number of customer complaints due to incurred charges.
 
Data roaming throughputs in Spain
Data roaming throughputs in Poland

Our own tests conducted in a few EEA countries (including Poland and Spain) show also another interesting dilemma. Operators often use a list of preferred networks which are always selected first. This is done to reduce the costs for your home network operator. However, this does not mean that the customer will actually get the best quality of service (coverage and bit rate). Your phone may still select and roam into a  2G service or prioritize service offering 3G over a 4G. Moreover, it is common that roaming service is limited to 2G/3G. This has been observed in both countries where we conducted the test. We used Telenor subscription to conduct the tests so it is difficult to say if the same applies to Telia, however, this kind of preferred network list may easily become an interesting way of reducing costs for Telenor and Telia. This will be the case if the preferred network also happens to be the slowest one since it reduces the costs which roaming partners bill home network operator. Time will show if this will be the case if so there might be a need for advanced roaming benchmarks to compare the operators and help subscribers choosing the one that gives the best performance also while roaming.

Creative Commons License

This work excluding photos is licensed under a Creative Commons Attribution 4.0 International License.