What is Big Data?

Nov 16, 2021 12:49:22 PM

Big data is the technology that underlies many of our most important advances. Enabled by machine learning and connectivity, large tranches of data can now be mined for insights rapidly and used to improve effectiveness, reduce waste or accelerate processes.

In this post, we’ll briefly look at what big data is, before we move on to what it’s used for, the debates around how it should be obtained, and how it already features in our day-to-day lives.

What is big data?

Big data means large collections of data, typically unstructured. Structured data might be when everyone fills in a form with five fields; most of the time, the data in the ‘phone number’ field is a phone number. Unstructured data might be piles of notes, or a company’s customer usage data from their website, or traffic helicopter camera footage. There’s useful information in there, for sure, but none of it’s labeled or arranged. It’s unstructured.

How is big data used to innovate digital experiences

In games like Call of Duty, games companies collect data on where users typically pause, stop or quit, and restart the game, as well as numerous other data points. The insights derived from these can help games designers restructure storylines, redesign elements, and remake the user experience based on data from the user.

Lacking this kind of data means you’re designing blind, creating something without ever seeing how it’s used. Collecting user data at scale has the same effect as obtaining any other accurate observation at scale: statistical anomalies get ironed out. If Matt habitually rage quits when he runs out of ammo and he’s 33% of your sample, you might wind up redesigning the game around his preferences. If he’s 0.000001% of your sample — and he is; Call of Duty has a hundred million active players — then his preferences look like the anomaly they are. He disappears into the numbers along with a lot of other statistically insignificant effects. If you want reliable statistics you want large quantities of data — big data, in fact.

If you want reliable insight into what people actually do, you don’t really want to collect it from people who have volunteered to give it to you, because it’s unreliable. Most people want to represent themselves well and have a positive opinion of themselves, one reason why the average road user thinks they’re an above-average driver. They also don’t remember their own motivations, reasoning processes or even actions particularly clearly. And they massage the truth, put their best foot forward, and plain old lie, too.

For these reasons, the very best user data is collected from actual users who may know they’re being watched, but aren’t there to be watched. Test users are on their best behavior; CoD players are trying to play the game.

Obviously, what’s true for games and gamers is true for users of other products. User data from websites is equally baffling to designers who think they’ve laid out a clear path to value. App designers are in the same boat. In the immortal words of Benjamin Brewster (probably), ‘in theory there’s no difference between theory and practice, but in practice, there is.’

Big data allows us to bridge that gap. As a result, it lets us design everything from checkout flows to art gallery walkways to dashboards for heating controls in large buildings based, not on ‘best practices,’ theory or guesswork, but on a detailed picture of what people actually want. These digital experiences are innovative in the best way, with creativity allied to insight derived from big data.

Industries that benefit from big data

Big data is increasingly powering advances in any sector that can benefit from accurate visibility, acceleration and efficiency. That list includes:


healthcare data

Big data powers improvements in treatment and diagnosis accuracy. When doctors need to treat illnesses that are known to ‘mimic’ (share symptoms, have no typical progressions, and otherwise appear to be something else), easy access to relevant patient records are a must. Yet, these are often fractured, diffused across multiple institutions that don’t talk to each other, and unstructured. Recorded in different formats by different institutions, with notes in margins and other paraphernalia, patient notes might contain the clue to the patient’s real condition — if doctors could access them in a timely manner and drill down to the relevant information. Big data makes that possible, and in the US, Electronic Health Records (EHR) is used in 94% of hospitals: alongside improving patient outcomes it had already saved Kaiser Permanente over $1 billion worth of unneeded or duplicate tests and office visits by 2013. That tallies with McKinsey’s finding that big data can save American healthcare $300 billion to $450 billion a year.

Big data also makes possible the collection and analysis of usage data across organizations and institutions, enabling them to see what they need to order more or less of, without awaiting out-of-date departmental reporting.


manufacturing technology

Manufacturing uses big data to manage production processes, logistics, and planning. Big data allows manufacturers to reliably separate correlation from causation — to know whether what they’re seeing is a pattern of cause and effect, or just a coincidence. That’s an absolutely crucial element of continuous improvement, since without it efforts to make processes more effective and efficient can all too easily simply waste time or even make things worse. It also makes it possible to isolate outliers and anomalies — just like when we talked about gamers and their preferences above. One-offs that might otherwise exert a disproportionate effect on thinking and planning disappear into the data when the data set is sufficiently large, and can be correctly identified as outliers and discounted.


retail commerce

Retail and etail are increasingly integrated, and both experiences generate vast quantities of unstructured data. Big data can help retailers stitch together the online and offline personas and activities of their customers, and offer them more enjoyable and effective marketing and other experiences.

Retail is one of the industries with the longest history of collecting personal data, and has been using methods like loyalty programs for decades. More recently, efforts in this direction have focused on collecting IP addresses, cross-marketing to social, and tracing transactions and logins. This, together with in-store data collected the old-fashioned way or through apps, and social media data, allows retail businesses to offer customers increasingly-personalized ads, messaging and products. All this is big data in action.

The results can be most interesting when data from two tranches is compared. Walgreens and Pantene partnered with the Weather Channel to predict when high humidity would increase demand for anti-frizz products. Then they served up ads and promotions in-store to drive sales on those days. Sales rose 4% across the hair care category and 10% for Pantene products in just two months!

Media and communications

marketing data

Media and communications companies are in the business of delivering content to a massive audience, each member of which has slightly different interests and needs. They’re also in the business of advertising. Prior to the advent of big data, targeting content and ads was equally difficult; newspapers, magazines and TV channels had to be broad churches, offering a lot of material that many audience members didn’t really want.

Now, that’s very different. Digital publications gather huge quantities of data on audience behaviors; many have moved beyond surface-level metrics like views and clicks, which can decoy publications into creating a lot of content that many people look at but few care about. Instead, large, unstructured data tranches are mined for ‘single customer view’ insights into how individuals behave, then matched against personas to deliver newsletter content and website recommendations that actually match that individual’s interests. In effect, the move is toward curated micropublications targeted at audience segments — exactly the same trend as in digital advertising.

New insights into audience preferences and behaviors can be used for content and advertising targeting, but also for scheduling optimization and audience retention.


banking data

The banking industry is also using customer data analytics to improve customer services, and to offer customers personalized options for insurance, finance and accounts. But it’s also using predictive analytics derived from big data to improve its fraud detection and prevention systems, building lookalike pictures of suspicious financial activity from hundreds of millions of legitimate and illegitimate transactions and using these to attempt to prevent fraudulent actions in advance.

A simple version of these techniques became available as soon as banking was effectively digitized. If you use your credit card in Sacramento and then again an hour later in Brooklyn, something doesn’t add up. Once banks can see that usage pattern, they can issue a temporary hold on the card and contact you to figure out what’s going on. But big data analytics offers much more sophisticated fraud prevention. This extends to identifying less-obvious forms of fraud through a complex analysis of usage patterns, but also to preventative measures aimed at identity theft.

Account fraud that’s much more damaging to customers than credit card fraud has a long history of functioning on stolen Social Security numbers and fake ID cards. This is usually enough to open a new account in someone else’s name or to commit bank fraud, where control of a legitimate bank account passes to a fraudster. However, increasingly, location and behavioral data can be used to deny fraudsters such opportunities.

Some banks are collecting online usage data that goes beyond time of login, and using that to identify potential fraudsters. In April this year, a customer at the National Australia Bank appeared to try to raise her account transaction limit from AU$20,000 (US$15,000) to AU$100,000 (US$75,000). The password and username were correct, but ‘the way she was using her mouse looked different,’ Chris Sheehan, a National Australia Bank investigations manager, told Protocol. ‘The number of clicks on the mouse looked different. Her cutting and pasting details looked different.’

Based on that data, the change was refused — and the bank called the real customer, who was in the process of being conned by a fraudster. Big data analytics saved that customer tens of thousands of dollars — and banks are now doing this for consumers and businesses, at scale and regularly.


lecture hall

Big data analytics is increasingly used at every level of tertiary education. Many institutions are now tracking student interaction frequency and quality via logins, mouse movements and activity on university sites, for instance, and using this information both to optimize their offerings to students and to help encourage attendance and participation — especially during the pandemic.

Institutions are also using big data analytics to improve efficiency and productivity, and to more effectively allocate resources. Enrollment reports, student reviews, and digital signals like mouse movements, attendance and logins can be used to identify undersubscribed courses or less-effective educators, with the data analysis and insight production done by big data and machine learning.

Big data: challenges and controversies

Big data is a good, but it’s not an unadulterated, unquestionable good. It raises issues of whose data gets used, for what, and what say they should have in that process. And it’s come of age at a time when privacy online is an increasing concern, reflected in laws establishing new data rights in the EU and USA and the growth of businesses that build privacy into their model, as well as the end of some of some tools like cookies and scraping that made online data relatively easy to access and use.

Two of the biggest issues facing companies that use or hope to use big data are personal information and targeted advertising.

Personal information means any information that can be used to personally identify you, particularly in a negative or harmful way. Much of the data that makes up big data sets is personal information, acquired in ways users are only dimly aware of if they’re aware of it at all. This can include direct acquisition through cookies, website user recording tools like HotJar and Crazy Egg, onsite behavioral data, instore tracking through beacons, and more. But it can also include data acquired from third parties, from bought lists for sales through to data collected by other institutions. People are often unaware that this data is being shared to the extent that it is, and they’re increasingly concerned that it constitutes a violation of privacy. They’re particularly concerned about this where it’s used to target advertising.

Targeted advertising is a key usage for big data analytics and has been since the inception of the field. It’s also one of the biggest areas of concern for consumers, and public outcry over company data breaches and the use of targeting to spread fake news and inflame political partisanship,’ say HBR’s Leslie K. John, Tami Kim, and Kate Barasz, ‘have, understandably, put consumers on alert.’

‘Now,’ the authors continue, ‘regulators in some countries are starting to mandate that firms disclose how they gather and use consumers’ personal information.’ That includes the EU, whose General Data Protection Act (GDPR) forced a sharp rethink for many marketers, as well as California’s new Consumer Privacy Act (CCPA), which ‘secures new privacy rights for California consumers, including: ... The right to delete personal information collected from them (with some exceptions); The right to opt-out of the sale of their personal information; and the right to non-discrimination for exercising their CCPA rights.’

Much of the impetus for these new privacy regulations comes from consumer and campaigner concerns around cross-device digital tracking, cookies (which Google is phasing out by 2023) and across the web by social networks like Facebook whose main income stream is from ads.

The balance between benefits to users and intrusions on their privacy will be a difficult one to strike, and some of it will happen at the legislative level. However, consumers see the value in targeted ads and indicate a willingness to swap or sell their personal data in exchange for personalized ads or services — they just want to be in charge of the process.

Big data in our day-to-day

Most of us will increasingly encounter both data harvesting and the results of big data analytics in our everyday lives. In fact, several services reliant on big data are already a common part of our expectations. Most mapping services, including Google Maps, rely on big data for their effectiveness. As data lakes from transportation nets, traffic reportage and drivers’ usage data are fed to the algorithms, Google Maps’ advice and directions get more accurate and useful — a recursive big data effect, since we use them more the more useful they get.

Online shopping is another staple of modern life that’s totally reliant on big data. The data lakes behind Amazon’s recommendations engine are huge; postage and delivery are dynamically calculated by comparing locations with previous delivery data.

On a larger scale, the very environments we live in are being transformed by big data. Algorithms powered by enormous unstructured data sets are increasingly used to plan urban construction projects, forecast and manage traffic flows, and design and engineer buildings. Data from how we use these structures and processes is collected and used to optimize further iterations, in a process that’s familiar to the software world even as it becomes normal everywhere.


  • Big data increasingly suffuses everyday life and plays a part in almost all design and engineering projects. We’re using it every time we buy online or plan a journey, and we’re contributing data to it every time we catch a train or drive through a city
  • The industrial uses of big data underlie advances in efficiency, productivity and accuracy across multiple industrial sectors and eventually will permit the linkage of smart production, smart procurement and smart supply chains
  • These are the early days of big data usage. As connectivity and machine learning improve and methods of obtaining data sets for big data analytics acquire clear legitimacy, we can expect to see this technology transform our experience of cities, transportation — and software
Featured Image Source
Craig Gosselin

Written by Craig Gosselin

Craig is responsible for client management, sales, and marketing. He has deep mobile experience including growing businesses in Fortune 500, venture capital and private equity environments. This includes helping launch Richard Branson's startup Virgin Mobile, a successful sale of private equity owned Velocita Wireless, and roles as Senior VP at American Express and leader of a $3.5B business unit at AT&T. In his spare time, Craig is an instructor in the Columbia Business School Venture For All Program

    Get in touch