What does the data of a super hero look like?
The data of a super hero must meet the following requirements:
Number of Tables
The super hero knows there can only be one. Therefore he will prepare 1 single table. This single table is called the analytical base table (ABT). To get to this ABT, for example from a star schema, some data preparation like performing joins might be required.
Example: From Star Schema to 1 single Analytical Base Table (ABT)
The super hero delivers his single table in his preferred super format: the SAS dataset.
A super hero knows that it's not the size of his data that matters, but what he does with his data. Therefore he limits the size of the single table to maximum 1 GB of data. The super hero doesn't need tricks like compressing SAS datasets or zipping SAS datasets to stay within the 1 GB limit. So the single table is not compressed, not zipped.
The data of a super hero is like the super hero itself: masked and anonymous. Therefore the data is depersonalized and the data is not confidential.
Example: Depersonalising Tables
Variable Format Types
The super hero loves to KISS, meaning Keeping It Short & Simple. Therefore his data variables are in one of the following formats: Character, Numeric, Date, Time, DateTime.
Besides the 'must have requirements' described above, the super hero tries to take into account as much as possible the following guidelines:
Analytical Base Table Layout
A super hero pays much attention to the layout of his battle-suit. Similarly he pays much attention to layout of the data table. To be powerful for analytics the layout of an analytical base table is wide (as opposed to long) as shown in the example below.
Example: Analytical Base Table Layout
Note that 'Sale' and 'CostOfSale' show up in different rows. This is not good for performing analytics. An analytical layout is needed: the table should be flattened by moving rows to columns.
The super hero is always in the place to be. Therefore he makes sure that his dataset contains geographical dimensions like for example continent, country, province, city/zip, ...
The super hero realizes that visualizations using a geographical map give a unique power to explore the geographical dimension. Therefore geographical coordinates will be added to his dataset. The coordinate space used must be one of the following: World Geodetic System (WGS84), Web Mercator, British National Grid (OSGB36).
Coordinates of for the Belgian cities (zip codes) and provinces can be provided by SAS.
Other geographical coordinates can be obtained via Openstreetmap as follows:
- Navigate to http://www.openstreetmap.org
- Double click the location you're looking for (the more you zoom, the more precise it will be)
- Click on 'permalink' in the bottom right corner of your screen
- The Latitude & Longitude coordinates are now shown in the URL of the webpage.
- These coordinates are ready for use in SAS Visual Analytics.
Timing is everything, therefore the super hero his data contains dimensions like date/time. This is a must to get insight via forecasting and scenario analysis.
Multiple Date/Time Intervals
Everybody needs more time, even a super hero. Therefore he deducts other variables from one Date/Time variable. For example from a Date variable the following variables can be deducted: DateByMonth, DataByQuarter, DateByYear. This has several advantages in SAS Visual Analytics like for creating hierarchies in the date/time dimension or for example to filter easily on a certain month/quarter/year.
The super hero works at all levels, therefore he makes sure his dataset has the possibility to create hierarchies.
The example below shows how the data should be structured to be able to get a hierarchy from Make to Type to Engine:
The super hero has an analytical mind so he knows that more is better when it comes down to measures. Therefore the dataset of the super hero contains many measures. This allows to perform correlation analysis.
The super hero doesn't get lost in translation. Therefore he translates all IDs into meaningful values, like in the example below:
The super hero wants to make the world a better place, serve the public and tries to use open/public data.
Some Super Heroes will like to solve mobility issues in their town predicting where and when to expect traffic jams, or how to improve the public road infrastructure to make them disappear completely.
Others may want to understand which parameters influence the spread of diseases most, link employment rates to education, or analyze interesting phenomena such as floods, garbage trucks, international trade, world population evolutions, etc.
The data is out there, grab it!
Here are some examples of great site where you can find open data:
Data.gov.be - Open data initiative from the Belgian government
- Open Belgium
- Gapminder World data
- Aviz Visual Analytics Project – statistics per country
- Datacatalog World bank - Open data to alleviate poverty
OpenSpending.org - Public data on government finance
PublicData.eu - European open data initiative
The Data Hub - Open data search engine
The Guardian Data Store - Gateway to open data from governments around the globe
- Open flight data
- US Bureau of Transporation Statistics
Example: PROC Contents of a valid sample dataset: