For organizations of all sizes, data management has shifted from an important competency to a critical differentiator that can determine market winners and has-beens. Fortune 1000 companies and government bodies are starting to benefit from the innovations of the web pioneers. These organizations are defining new initiatives and reevaluating existing strategies to examine how they can transform their businesses using Big Data. In the process, they are learning that Big Data is not a single technology, technique or initiative. Rather, it is a trend across many areas of business and technology.
Big Data refers to technologies and initiatives that involve data that is too diverse, fast-changing or massive for conventional technologies, skills and infra- structure to address efficiently. Said differently, the volume, velocity or variety of data is too great.
But today, new technologies make it possible to realize value from Big Data. For example, retailers can track user web clicks to identify behavioral trends that improve campaigns, pricing and stockage. Utilities can capture household energy usage levels to predict outages and to incent more efficient energy consumption. Governments and even Google can detect and track the emergence of disease outbreaks via social media signals. Oil and gas companies can take the output of sensors in their drilling equipment to make more efficient and safer drilling decisions.
'Big Data' describes data sets so large and complex they are impractical to manage with traditional software tools.
Specifically, Big Data relates to data creation, storage, retrieval and analysis that is remarkable in terms of volume, velocity, and variety:
Volume. A typical PC might have had 10 gigabytes of storage in 2000. Today, Facebook ingests 500 terabytes of new data every day; a Boeing 737 will generate 240 terabytes of flight data during a single flight across the US; the proliferation of smart phones, the data they create and consume; sensors embedded into everyday objects will soon result in billions of new, constantly-updated data feeds containing environmental, location, and other information, including video.
Velocity. Clickstreams and ad impressions capture user behavior at millions of events per second; high-frequency stock trading algorithms reflect market changes within microseconds; machine to machine processes exchange data between billions of devices; infrastructure and sensors generate massive log data in real-time; on-line gaming systems support millions of concurrent users, each producing multiple inputs per second.
Variety. Big Data data isn't just numbers, dates, and strings. Big Data is also geospatial data, 3D data, audio and video, and unstructured text, including log files and social media. Traditional database systems were designed to address smaller volumes of structured data, fewer updates or a predictable, consistent data structure. Traditional database systems are also designed to operate on a single server, making increased capacity expensive and finite. As applications have evolved to serve large volumes of users, and as application development practices have become agile, the traditional use of the relational database has become a liability for many companies rather than an enabling factor in their business. Big Data databases, solve these problems and provide companies with the means to create tremendous business value.
Big Data for the Enterprise
With Big Data databases, enterprises can save money, grow revenue, and achieve many other business objectives, in any vertical.
Build new applications: Big data might allow a company to collect billions of real-time data points on its products, resources, or customers – and then repackage that data instantaneously to optimize customer experience or resource utilization. For example, a major US city can cut crime and improve municipal services by collecting and analyzing geospatial data in real-time from over 30 different departments.
Improve the effectiveness and lower the cost of existing applications: Big data technologies can replace highly-customized, expensive legacy systems with a standard solution that runs on commodity hardware. And because many big data technologies are open source, they can be implemented far more cheaply than proprietary technologies. For example, by migrating its reference data management application a bank can dramatically reduced the license and hardware costs associated with the proprietary relational database it previously ran, while also bringing its application into better compliance with regulatory requirements.
Realize new sources of competitive advantage: Big data can help businesses act more nimbly, allowing them to adapt to changes faster than their competitors. .
Increase customer loyalty: Increasing the amount of data shared within the organization – and the speed with which it is updated – allows businesses and other organizations to more rapidly and accurately respond to customer demand.
Selecting a Big Data Technology: Operational vs. Analytical
The Big Data landscape is dominated by two classes of technology: systems that provide operational capabilities for real-time, interactive workloads where data is primarily captured and stored; and systems that provide analytical capabilities for retrospective, complex analysis that may touch most or all of the data. These classes of technology are complementary and frequently deployed together.
Operational and analytical workloads for Big Data present opposing requirements and systems have evolved to address their particular demands separately and in very different ways. Each has driven the creation of new technology architectures. Operational systems, such as the NoSQL databases, focus on servicing highly concurrent requests while exhibiting low latency for responses operating on highly selective access criteria. Analytical systems, on the other hand, tend to focus on high throughput; queries can be very complex and touch most if not all of the data in the system at any time. Both systems tend to operate over many servers operating in a cluster, managing tens or hundreds of terabytes of data across billions of records.
Operational Big Data
For operational Big Data workloads, NoSQL Big Data systems such as document databases have emerged to address a broad set of applications, and other architectures, such as key-value stores, column family stores, and graph databases are optimized for more specific applications. NoSQL technologies, which were developed to address the shortcomings of relational databases in the modern computing environment, are faster and scale much more quickly and inexpensivelythan relational databases.
Critically, NoSQL Big Data systems are designed to take advantage of new cloud computing architectures that have emerged over the past decade to allow massive computations to be run inexpensively and efficiently. This makes operational Big Data workloads much easier to manage, and cheaper and faster to implement.
In addition to user interactions with data, most operational systems need to provide some degree of real-time intelligence about the active data in the system. For example in a multi-user game or financial application, aggregates for user activities or instrument performance are displayed to users to inform their next actions. Some NoSQL systems can provide insights into patterns and trends based on real-time data with minimal coding and without the need for data scientists and additional infrastructure.
Analytical Big Data
Analytical Big Data workloads, on the other hand, tend to be addressed by MPP database systems and MapReduce. These technologies are also a reaction to the limitations of traditional relational databases and their lack of ability to scale beyond the resources of a single server. Furthermore, MapReduce provides a new method of analyzing data that is complementary to the capabilities provided by SQL.
As applications gain traction and their users generate increasing volumes of data, there are a number of retrospective analytical workloads that provide real value to the business. Where these workloads involve algorithms that are more sophisticated than simple aggregation, MapReduce has emerged as the first choice for Big Data analytics. Some NoSQL systems provide native MapReduce functionality that allows for analytics to be performed on operational data in place. Alternately, data can be copied from NoSQL systems into analytical systems such as Hadoopfor MapReduce.
Overview of Operational vs. Analytical Systems
|Latency||1 ms - 100 ms||1 min - 100 min|
|Concurrency||1000 - 100,000||1 - 10|
|Access Pattern||Writes and Reads||Reads|
|End User||Customer||Data Scientist|
|Technology||NoSQL||MapReduce, MPP Database|
Cloud computing refers to a broad set of computing and software products that are sold as a service, managed by a 3rd-party provider and delivered over a network. Infrastructure-as-a-Service (IaaS) is a flavor of cloud computing in which on-demand processing, storage or network resources are provided to the customer. Sold on-demand with limited or no upfront investment for the end-user, consumption is readily scalable to accommodate spikes in usage. Customers pay only for the capacity that is actually used (like a utility), as opposed to self-hosting, where the user pays for system capacity it is are used or not.
As compared to self-hosting, IaaS is:
Inexpensive. To self-host an application, one has to pay for enough resources to handle peak load on an application, at all times. Amazon discovered that before launching its cloud offering it was using only about 10% of its server capacity the vast majority of the time.
Tailored. Small applications can be run for very little cost by taking advantage of spare capacity. Bandwidth, processing and storage capability can be added in relatively small increments.
Elastic. Computing resources can easily be added and released as needed, making it much easier to deal with unexpected traffic spikes.
Reliable. With the cloud, it’s easy and inexpensive to have servers in multiple geographic locations, allowing content to be served locally to users, and also allowing for better disaster recovery and business continuity.
Overall, cloud computing provides better agility and scalability, together with lower costs and faster time to market. However, it does require that applications be engineered to take advantage of this new infrastructure; applications built for the cloud need to be able to scale by adding more servers, for example, instead of adding capacity to existing servers.
On the storage layer, traditional relational databases were not designed to take advantage of horizontal scaling. A class of new database architectures, dubbed NoSQL databases, are designed to take advantage of the cloud computing environment. NoSQL databases are natively able to handle load by spreading data among many servers, making them a natural fit for the cloud computing environment. Part of the reason NoSQL databases can do this is that related data is always stored together, instead of in separate tables.
Combining Operational and Analytical Technologies; Using Hadoop
New technologies like NoSQL, MPP databases, and Hadoop have emerged to address Big Data challenges and to enable new types of products and services to be delivered by the business.
One of the most common ways companies are leveraging the capabilities of both systems is by integrating a NoSQL database such as MongoDB with Hadoop. The connection is easily made by existing APIs and allows analysts and data scientists to perform complex, retroactive queries for Big Data analysis and insights while maintaining the efficiency and ease-of-use of a NoSQL database.
NoSQL, MPP databases and Hadoop are complementary: NoSQL systems should be used to capture Big Data and provide operational intelligence to users, and MPP databases and Hadoop should be used to provide analytical insight for analysts and data scientists. Together, NoSQL, MPP databases and Hadoop enable businesses to capitalize on Big Data.
Considerations for Decision Makers
While many Big Data technologies are mature enough to be used for mission-critical, production use cases, it is still nascent in some regards. Accordingly, the way forward is not always clear. As organizations develop Big Data strategies, there are a number of dimensions to consider when selecting technology partners, including:
1. Online vs. Offline Big Data
2. Software Licensing Models
4. Developer Appeal
6. General Purpose vs. Niche Solutions
1. Online vs. Offline Big Data
Big Data can take both online and offline forms. Online Big Data refers to data that is created, ingested, trans- formed, managed and/or analyzed in real-time to support operational applications and their users. Big Data is born online. Latency for these applications must be very low and availability must be high in order to meet SLAs and user expectations for modern application performance. This includes a vast array of applications, from social networking news feeds, to analytics to real-time ad servers to complex CRM applications. Examples of online Big Data databases include MongoDB and other NoSQL databases.
Offline Big Data encompasses applications that ingest, transform, manage and/or analyze Big Data in a batch context. They typically do not create new data. For these applications, response time can be slow (up to hours or days), which is often acceptable for this type of use case. Since they usually produce a static (vs. operational) output, such as a report or dashboard, they can even go offline temporarily without impacting the overall goal or end product. Examples of offline Big Data applications include Hadoop-based workloads; modern data warehouses; extract, transform, load (ETL) applications; and business intelligence tools.
Organizations evaluating which Big Data technologies to adopt should consider how they intend to use their data. For those looking to build applications that support real-time, operational use cases, they will need an operational data store like MongoDB. For those that need a place to conduct long-running analysis offline, perhaps to inform decision-making processes, offline solutions like Hadoop can be an effective tool. Organizations pursuing both use cases can do so in tandem, and they will sometimes find integrations between online and offline Big Data technologies. For instance, MongoDB provides integration with Hadoop.
2. Software License Model
There are three general types of licenses for Big Data software technologies:
Proprietary. The software product is owned and controlled by a software company. The source code is not available to licensees. Customers typically license the product through a perpetual license that entitles them to indefinite use, with annual maintenance fees for support and software upgrades. Examples of this model include databases from Oracle, IBM and Terradata.
Open-Source. The software product and source code are freely available to use. Companies monetize the software product by selling subscriptions and adjacent products with value-added components, such as management tools and support services. Examples of this model include MongoDB (by MongoDB, Inc.) and Hadoop (by Cloudera and others).
Cloud Service. The service is hosted in a cloud- based environment outside of customers’ data centers and delivered over the public Internet. The predominant business model is metered (i.e., pay-per-use) or subscription-based. Examples of this model include Google App Engine and Amazon Elastic MapReduce.
For many Fortune 1000 companies, regulations and internal policies around data privacy limit their ability to leverage cloud-based solutions. As a result, most Big Data initiatives are driven with technologies deployed on-premise. Most of the Big Data pioneers are web companies that developed powerful software and hardware, which they open-sourced to the larger community. Accordingly, most of the software used for Big Data projects is open-source.
In these early days of Big Data, there is an opportunity to learn from others. Organizations should consider how many other initiatives are being pursued using the same technologies and with similar objectives. To understand a given technology’s adoption, organiza- tions should consider the following:
The number of users
The prevalence of local, community-organized events
The health and activity of online forums such as Google Groups and StackOverflow
The availability of conferences, how frequently they occur and whether they are well-attended
4. Developer Appeal
The market for Big Data talent is tight. The nation’s top engineers and data scientists often flock to companies like Google and Facebook, which are known havens for the brightest minds and places where one will be exposed to leading edge technology. If enterprises want to compete for this talent, they have to offer more than money.
By offering developers the opportunity to work on tough problems, and by using a technology that has strong developer interest, a vibrant community, and an auspicious long-term future, organizations can attract the brightest minds. They can also increase the pool of candidates by choosing technologies that are easy to learn and use — which are often the ones that appeal most to developers. Furthermore, technologies that have strong developer appeal tend to make for more productive teams who feel they are empowered by their tools rather than encumbered by poorly-designed, legacy technology. Productive developer teams reduce time to market for new initiatives and reduce development costs, as well.
Organizations should use Big Data products that enable them to be agile. They will benefit from technologies that get out of the way and allow teams to focus on what they can do with their data, rather than how to deploy new applications and infrastructure. This will make it easy to explore a variety of paths and hypotheses for extracting value from the data and to iterate quickly in response to changing business needs.
In this context, agility comprises three primary components:
Ease of Use. A technology that is easy for developers to learn and understand -- either because of the way it’s architected, the availability of tools and information, or both -- will enable teams to get Big Data projects started and to realize value quickly. Technologies with steep learning curves and fewer resources to support education will make for a longer road to project execution.
Technological Flexibility. The product should make it relatively easy to change requirements on the fly—such as how data is modeled, which data is used, where data is pulled from and how it gets processed as teams develop new findings and adapt to internal and external needs. Dynamic data models (also known as schemas) and scalability are capabilities to seek out.
Licensing Freedom. Open-source products are typically easier to adopt, as teams can get started quickly with free community versions of the software. They are also usually easier to scale from a licensing standpoint, as teams can buy more licenses as requirements increase. By contrast, in many cases proprietary software vendors require large, upfront license purchases, which make it harder for teams to get moving quickly and to scale in the future.
6. General Purpose vs. Niche Solutions
Organizations are constantly trying to standardize on fewer technologies to reduce complexity, to improve their competency in the selected tools and to make their vendor relationships more productive. Organizations should consider whether adopting a Big Data technology helps them address a single initiative or many initiatives. If the technology is general purpose, the expertise, infrastructure, skills, integrations and other investments of the initial project can be amortized across many projects. Organizations may find that a niche technology may be a better fit for a single project, but that a more general purpose tool is the better option for the organization as a whole.