Eight Tips for Working with Big Data
By Abigail Lefkowitz, Data Analyst at MaassMedia, LLC
When describing a set of data, the word "big" doesn't just refer to size. "Big" implies that the data comes from multiple sources and is too complex overall for conventional business intelligence (BI) software to process in a practical amount of time and with sufficient computing memory.
It has been shown that the amount of data in the world is growing at a rate of 50 percent each year and more than doubling every two years. Since data volumes are not getting any smaller, it is imperative that organizations attempting to gain knowledge and insights from their data establish a methodology for storing, analyzing, and interpreting big data.
When you are ready to delve into your data, keep these tips in mind:
1. Start small, think big
Sampling is widely accepted as a preferred standard practice. Sampling takes a random, unbiased selection of individuals from a population to estimate the characteristics of that population. A statistically significant sample should produce results similar to that of the entire population.
First, choose the appropriate sample size to ensure statistical significance. Stated simply, a result is statistically significant if it is unlikely to happen by chance. Using multiple, but independent samples of the same size will increase the statistical significance. Next, randomly select that number of data points from the population and carry out the analysis as planned.
2. Use a unique identifier to connect different sources of data
It is advisable to link multiple data sources using some common denominator in order to create a more complete picture of visitor activity and set the stage for more in-depth, transformative insights. To connect data from various sources, each source must assign a unique ID common to every customer or visitor. Web analytics platforms, such as Google Analytics and Adobe SiteCatalyst allow unique IDs to pass through visitor cookies that are assigned when someone visits a website. These IDs can be linked to any other source of data that captures the same unique ID.
For example, a website utilizing a Web-based survey can assign each survey respondent a unique ID that gets passed from the site into your Web analytics platform. As a result, responses to the survey can be connected to behavioral Web data such as page views, time on site, etc. Without having both the unique survey ID and visitor ID, these types of merges would not be possible.
3. ETL - Extract, transform and load
To obtain your data, you will need to extract it from each source and store it in a place with sufficient space, such as an in-house FTP server or data warehouse. Unstructured data needs to be transformed into a consistent format that allows for data analysis. Removing duplicates, grouping values and concatenating variables are a few examples of possible transformations. When performing these manipulations on millions of rows of data, using structured query language (SQL) can cut down on both processing and overall analysis time.
After extracting and transforming the data, it must be loaded into a storage location or data warehouse. It is recommended that the data be formatted prior to being loaded into the data warehouse for easier access in the future.
4. Select appropriate business intelligence tools
Standard software, like Microsoft Excel, does not have the capacity to manipulate big data and carry out simple calculations in a timely manner because of the data's volume and complex nature, coupled with the fact that it is commonly stored in separate silos with limited accessibility. Microsoft Access is an example of a powerful database that allows you to join and link datasets, especially when connecting unique IDs from different sources.
You will also need two additional analysis tools—one that can perform in-depth analysis and one that can produce visually appealing graphs and charts. It is difficult to find a program that can tackle both effectively. SAS and SPSS/IBM are two examples of statistical analysis programs with excellent data mining capabilities, while Tableau is a tool that produces attractive and dynamic images.
5. Incorporate demographic data
Overlaying demographic information with your dataset is helpful and highly recommended to better understand who your target audience(s) might be. FICO and Experian collect a variety of consumer data scores based on credit card transactions and demographic status. For example, FICO can help you predict whether you will be taking your medication every day based on certain variables such as length of time living in one location and number of years owning a particular car. Again, pending a unique ID, the demographic information can be linked to the existing dataset.
6. Hire a "data scientist"
Making sense of big data can be daunting. Instead of taking it on alone, you might be better off hiring experienced analysts who are comfortable working with massive sets of data. Employing one or more people who can dedicate their time to analysis will allow for smarter and more accurate, meaningful and transformative findings in a shorter amount of time. Look for analysts who are knowledgeable about storage, manipulation, importing and exporting, merging, sub-setting, hypothesis testing and mining for insights.
7. If possible, automate reports
When working with data that is continuously streaming in, automated reporting can save both time and money. Instead of manually reformatting the data each time a new chunk comes in, set up an automatic report with a filter that captures only the necessary variables and manipulates them in such a way that the data is more streamlined to the end goal.
Reports must first be customized to the preference of the end user, and then they can be automatically delivered to one or more recipients at regular intervals. Adobe SiteCatalyst and Google Analytics are two examples of Qeb analytics platforms that can automate daily, weekly and monthly reports.
8. 75 percent investigation, 25 percent discovery
As with any type of analysis, it is a good rule of thumb to spend approximately 75 percent of your time testing pre-developed hypotheses and the other 25 percent poking around in the data with no specific direction in the hopes of uncovering nuggets of knowledge. It is likely that at least one hypothesis you have formulated will lead to additional analytical questions about the scope of the original hypothesis. Therefore, the majority of the time spent testing hypotheses will inevitably lead to hidden findings. Devote a smaller fraction of your time slicing and dicing the data in unexpected segments because many times buried insights can be found when looking at familiar information in an unfamiliar way.