Everyone wants to claim that they work with big data, but what is big data? Like many qualitative concepts, extremes are easy to identify. Large banks and social media platforms clearly qualify as big data centers, but what about more average companies? When do they need to start worrying about the impact of big data?
Fortunately, big data practitioners generally agree on a small set of criteria to be met that, all together, constitute “big” data. These are commonly known as The 3 Vs of Big Data. They are:
- Volume, or the total amount of data stored.
- Velocity, or how often new data is created and needs to be stored
- Variety, or how heterogeneity your data structures and sources are
In recent years, two more Vs have come to be widely recognized as well:
- Veracity, or the “truthiness” and integrity of your data.
- Value, or the significance your data to your business goals and the impact it has on the bottom line.
We’ll delve further into each of these in a moment, but just know that when your data processes start encroaching the first three of these areas, your company is entering its epoch of big data. The last two are important, but small data can and should also be high veracity and high value. There are, however, special concerns with veracity and value when dealing with big data, but more on that later.
This is probably the first thing people think of when they hear the phrase “big data.” To be blunt, if all your data fits on a hard drive you could buy online and throw into your gaming desktop, your data isn’t high volume. Perhaps 10 or 15 years ago, terabytes qualified as high volume data, but these days you’re not really in the big data world unless your closer to exabytes (1 million TB) or petabytes (1,000 TB). When you get to hundreds of TB, you’re well on your way – maybe at that point you can begin telling investors your data is “big” in the volume sense, and even show projections that you’ll be in big data territory in the coming fiscal quarters or year.
Cost is one of the two biggest issues with big data. A quick web search reveals that a decent 10TB hard drives runs at least $300 a piece at the time of this writing. To manage a PB of data that’s 100 x $300 USD = $30,000 USD. Maybe you’ll get some kind of bulk discount, but even at 50% off you’re well over $10,000 USD in storage costs alone. Imagine if you wanted to actually do something with that data – like transform or copy it, or just keep a redundant version of it for disaster recovery. You’d need even more disk space!
If you’re not transforming and processing your data, you’re not extracting any insights out of it, and this brings us to our next problem – as there are no PB sized, single hard disks or SSDs on the market yet, how do you even connect to all these disk drives? Now you must deal with networking issues and optimizations, which is a whole new dimension of the problem space.
Fear not, dear reader, for there are solutions to the problems of big data! The Cloud will be a recurring theme in each Solution section, but we’ll do our best to discuss solutions that work just as well on-premises, too (even though cloud computing is highly resilient and getting cheaper by the day).
Today, there are many options for data warehousing, and one of the most popular options is Amazon Redshift. If you’re interested in learning more about this topic, we encourage you to take a look at our guides on Redshift’s pricing, how Redshift compares to other SQL databases, and how Redshift compares to other options like Hadoop and traditional data warehouses.
Imagine a booming business handling credit card transactions as a service to a growing number of clients, or a machine learning service that is constantly learning from a stream of tick-level stock data, or a social media giant with literally billions of clients making posts and comments and uploading videos and photos 24x7x365. They deal with high velocity data. Every second, there are hundreds, if not thousands (or even millions for the rare, giant company) of transactions occurring, and this means data is being transferred from millions of laptops and mobile devices around the world to a “single” (well, hopefully well-distributed) data center every minute. While there’s no hard and fast rule like a number of bytes per payload per millisecond or anything like that which practitioners might agree on as a strict threshold for high velocity data, if you’re beginning to encounter any of the following problems then you can safely qualify your data as high velocity.
High velocity data sounds great at its face – velocity x time = volume, and volume leads to insights, and insights leads to money. However, this path to growing revenue is not without its costs. Firstly, as we just mentioned, high velocity does lead to high volume, and so you get all the problems of volume with velocity (eventually). More unique to velocity, however, are cybersecurity concerns. The game of white hat vs black hat hacking is a never ending arms race, so you should always assume some (hopefully small) percentage of your incoming data is malicious. How do you process every packet of data that comes through your firewall, though? Do you spot check and hope for the best from there? Speaking of processing, how do you process such high frequency data on the fly? You might not consider all the data you could potentially collect from a payload to be vitally important to your business, so it could make sense to process payloads into a more refined and smaller state.
Moreover, when you have a high velocity of data, that almost always means that there are going to be large swings in the amount of data processed every second. Twitter is much more active during the Super Bowl than at midnight on average Tuesday. Ergo, building an infrastructure that can handle these fluctuations becomes critically important.
Fortunately, “streaming data” solutions have cropped up in the last decade to come to the rescue. Unfortunately, they can all be complicated, costly, and time-consuming.
The Apache organization has a few popular solutions, Spark and Kafka prevalent among them. Spark is written in Scala (which uses a JVM, and so it also has a Java implementation), but also has a well-established Python implementation (PySpark). Spark is great for both batch processing and streaming processing, and I can speak from experience that PySpark is about as easy to use as the Pandas python library (ie: very easy). Kafka runs on a pub/sub (publish/subscribe) mechanism – “topics” are “subscribed to” and “published to” by different processes, which themselves are grouped into different “consumer groups.” A single consumer group = evenly balanced, standard queue system, while 1 process per consumer group = all messages consumed by all clients, which is a true pub/sub model. In practice, it often makes sense to stay the middle ground and group various sets of processes into their own consumer groups. The Wikipedia page on Kafka’s architecture is pretty decent, though you can go right to Apache’s docs on it for more details.
Amazon also has some nice solutions to the high velocity problem. Amazon Kinesis is a set of related APIs designed to process streaming data. Amazon also probably has the most popular serverless functions API, Amazon Lambda. Google Cloud Functions (Google Firebase also has a version of this) is another popular serverless functions API. Serverless functions obviously must operate somewhere, presumably on some server, but the point is that you don’t have to spin one up yourself and manage it – you get to just request time on a magical server in “The Cloud” where you can run a script in your language of choice (well, almost – for instance, Ruby isn’t yet supported by AWS Lambda). It’s a great black-box solution for managing complex processing of payloads on the fly.
Now, if you don’t want to deal with the time and expense of creating your own data pipeline, that’s where something like FlyData could come in handy. FlyData seamlessly and securely replicates your Postgres, MySQL, or RDS data into Redshift in near real-time. You can learn more about how we can help you move your data into Redshift here.
The real world is messy so it makes sense that any application dealing with exciting challenges must also deal with messy data. This is really cool because it means work worth its salt gets to exploit a vast richness of data corresponding to real world truths. In practice, however, data heterogeneity is often a major source of stress in building up a data warehouse. Never mind storing videos, photos, and highly hierarchically interconnected comments attached to social graphs of users in the same warehouse – even basic numerical information can come in wildly different data structures. Companies with numerous data partners or dependencies must also deal with variety of data structure and content among the different APIs coming from those various sources. This leads to the challenge of data modeling, wherein one diagrams out the relationships and structure of the data they’re collecting throughout their data warehouse. Never underestimate the importance, need for rigor, and complexity that can be involved in modeling big data.
Perhaps the biggest issue when dealing with high variety data is fitting a square peg into a round hole; JSON, YAML, xSV (x = C(omma), P(ipe), T(ab), etc.), XML – there are many different structures a payload can take on when consuming data into your pipeline, and usually you have to massage it from whatever initial structure its taken into a relational database or a (perhaps schemaless, perhaps schema’d) key-value store database. This becomes an especially poignant pain point when consuming data with columns or keys that are not guaranteed to exist forever, such as from a social media API that sometimes renames, introduces, and/or deprecates support for columns/keys. So not only are you trying to squeeze many different data structures into a single, uniform state, but the structures you’re taking in can be in flux, over time.
A tidy way to deal with variety of data is to record every transformation milestone applied to it along the route of your data pipeline. Robust pipelines start with either an API call or some server side data collection of client data. Take this raw data as-is and store it directly in your data lake. A data lake is a hyper-flexible repository of data collected and kept in its rawest form. This is often a file storage system like Amazon S3. You then want to transform your safely stored raw data into some aggregated and potentially pruned/refined state. This would ideally be stored either as its own file in another location inside your data lake, and then loaded up into a database of the appropriate SQL or NoSQL structure. Depending on the variety of business teams and data topics your business deals with, you may have several databases that feed into a central “gate” or entrypoint into accessing all data from one application or platform. This is useful for your business and data analysts, who deal with analytics and insight derivation for all teams, and they should have a final dashboard of reports coming out of that data processing that is viewable and readily digested by stakeholders. Note that this whole process can and should flow asynchronously between steps to minimize delays in data recording.
What we’ve just described here is ELT – Extract, Load, Transform (with a hidden, final Load step). This is in contrast to the smaller-data days of ETL – in earlier times, transforming your data directly after you extracted it from its source was considered best practice. These days, we have the luxury of massive file storage systems like Amazon S3 or Google Firestore, for instance, so it is cheap insurance to just store all the data you collect in its raw form first and transform it later.
A solid data pipeline like the one we loosely described above is resilient against heterogeneity between data sources and non-guaranteed constancy within a single data source because whatever its schema or structure is, you can always look back to the raw data you collected and fix any breakages that occurred due to unexpected data source changes after the fact. Obviously you’d like to minimize or eliminate such unexpected breakages to begin with, and that can be handled with good monitoring setups, but even sans monitoring (please monitor your incoming data 24×7, though!) this kind of pipeline at least allows for recovery. Strict ETL does not.
The 4th and 5th Vs of Big Data: Veracity & Value
As mentioned earlier, I’ve drawn a line between Volume, Velocity, Variety, and Veracity & Value because the first three are specifically large concerns for big data that don’t usually crop up for small data. Veracity and value are of course of paramount importance in big data, but that’s still true of small data. The concerns do scale, though. With big data, more robust and potentially complex monitoring systems must be put into place to analyze how noisy and integral your data collection and transformation processes are. This is much simpler and less resource intensive to do with small data. If your data is not sufficiently veracious, of what value is it even? On a similar note, it doesn’t always make sense to collect all the data you can. This is expensive to do, but if you have the money to burn and a great R&D team dedicated to pushing the edge of what your company is capable of every day by doing intensive research on large and time-lengthy datasets collected with no other intent but future research – then by all means, go ahead. However, most companies that are not Google or Netflix don’t have this luxury. It then becomes important for stakeholders and data developers to discuss and agree on collecting only high value data into the company’s pipelines. This reduces infrastructure cost and cost of developing and maintaining your pipelines.
The 3+ Vs of Big Data are excellent indicators of when your data is becoming “big” and need to start considering tackling your data needs with newer, more scalable approaches. Key to everything we’ve discussed here is scalability. If your solutions don’t scale, you’re essentially doomed. In the age of cheap and ubiquitous sensors and mobile devices and constant streams of data coming from all sides, it’s usually safer to assume your data will become big unless you have very explicit reasons to believe otherwise (though its important to consider whether your business goals make sense if you sincerely believe you can avoid big data, in today’s market). So take these problems and solutions to heart and build scalably, friends.