Archive

Archive for the ‘statistics’ Category

Tableau Public

June 1, 2011 Leave a comment

After downloading but just not getting around to using the software on three separate occasions, I finally created my first chart (or should I say “visualization”) using the free Tableau Public platform. The visualization I created was for Give Me The Rock and it graphed the top 200 fantasy basketball players on a scatter plot based on how similar or dissimilar they played during the 2010-11 season.

While I am by no means an expert on using the software, here are my first impressions on Tableau Public.

The Good Stuff

I mean, let’s start with the fact that Tableau Public is free to use. In a world where I pay to drink water, that is a very good thing. Loading up the software, I noticed that the interface is very slick and creating a basic chart is fairly easy to do. I’m a guy who likes to jump in head first without reading a manual, and I appreciate I could do that and still create something cool. For the most part, the software is drag and drop with your data displayed on the left hand side of the screen and the visualization on the right. You simply start dragging your data into the appropriate place and the magic happens.

Also, the visualizations look great – on par or better than what Excel can produce. The shapes, colors, labels are all sharp and really pop on the screen. But the best thing about the software is that Tableau Public knows exactly what it is and what people use it for. The chart options that it provides seem spot on and the ease and speed at which the software allows you to slice and dice your data and the way you present it is something that Excel can’t come close to touching. For example, my visualization of 200 NBA players was original crammed on the screen, but Tableau Public has a pages option that allows you to split out your presentation into different pages which can be viewed individually. I ended up splitting my graph into pages by player position to make it more readable.

If your the type of person that likes to experiment with different ways to present data, then Tableau Public has the potential to save you a lot of time.

Finally, Tableau Public makes it incredibly easy to share visualizations on the web. It provides scripts that you can copy and paste right into your blog, as well as an option to email it to others. You can also download any visualization on the web onto your computer if you have Tableau Public. No more worrying about versions of Excel and compatibility issues.

The Bad Stuff

Like all cloud software that I’ve used, Tableau Public is a touch slow for my tastes, especially in regard to loading and saving data. For a one off chart (excuse me, visualization) I can deal with it, but given the spastic way I typically work, I wouldn’t want to work with it all day long. Tableau has professional versions of the software (for $999 and $1,999 depending on the version) that I’d assume solves this problem by not requiring you to work in a cloud.

And while the software is generally easy to use, I’d say it’s still a step or two away from being perfectly intuitive. There were a few things that took me a while to figure out. I didn’t realize you could drag the interactive legends/tools around the screen and they would be presented in that exact location when you published the visualization (very handy). And while I split my graph by player position to make it easier to read, I still wanted to add a total page that presented all the data on the screen at once. I never did figure out how to do that.

Finally – and this may bother some people more than others – while Tableau Public is free, you are required to save your visualization to their servers. Once published, your data is in the public domain for all to see. Obviously if you have proprietary data, Tableau Public is not for you (although again, the paid versions will solve this problem).

The Verdict

Overall, I’m impressed with Tableau Public. As a guy who uses Excel on a daily basis, it’s not going to replace that for my day-to-day work, but it has definite advantages over Excel like its ability to slice and dice data in any number of ways and the fact that Tableau makes it very easy to share visualizations with others. The next time I create a graph that is going to be displayed on a website, I’d hands down use Tableau Public do to that. And did I mention it’s free?

>Fake Your Data like a Machine

July 14, 2010 Leave a comment

>You may have heard the news about a little polling company called Research 2000, which has gotten into a little hot water recently for supposedly faking polling data during the 2008 election. The Daily Kos uncovered the story based on some tips from a few statistical wizards who spotted some abnormalities in the data.

To break it down, in addition to the Daily Kos, Research 2000 provided polling services for a large number of local television and newspaper affiliates. Well, “provided” in the sense that they will likely not be providing said research services much longer. Research 2000 president Del Ali quickly and unsurprisingly shot back against the charges of faking data, writing in a statement:

Every charge against my company and myself are pure lies, plain and simple and the motives as to why Kos is doing it will be revealed in the legal process and not before that. I will share one little minor reason that Kos is doing this and it pertains to the fact they owe us a significant sum of monies that is in the six figure category and payment was on June 15, 2010.

Of course, the fact that he won’t publicly release his data (likely because it doesn’t exist) does hurt his credibility a little.

But this brings us to the larger and more important issue – when you fake your survey data, don’t do it like a human being. See, the world has both a randomness and an order to it that our feeble minds can’t quite grasp. And when we try to randomize things, we do it in a much too orderly fashion.

The thing that brought down Research 2000 is that their data was much too “clean.” It didn’t have the error associated with it that one would expect from a random survey. For example, when you fake your data, don’t make all the breakdowns either even OR odd. It’s best to mix it up a little.

Bad fake data when ALL the male/female comparisons are both either even or odd

Also, when you fake your data, you most likely want to make sure it is normally distributed, because that’s how the world operates. See, this Gallop poll demonstrates what a normal distribution looks like. It’s what’s refered to the Bell curve.

This Research 2000 poll on the other hand, demonstrates that humans don’t like the number 0 when faking data.

There were a number of other problems with the data that have been well documented on the Daily Kos website. It all adds up to a damning set of evidence indicating that Research 2000 faked some serious-ass data. Hundreds of thousands of dollars worth of data. And didn’t do it particularly well.

The fact that the company didn’t have a mailing address and operated out of a Kinko’s probably should have been enough of an indication that something wasn’t right.

Categories: poll, research, statistics, survey

>Quantitative Analytical Techniques

July 9, 2010 Leave a comment

>Statistics is a branch of mathematics that deals with the collection, analysis, interpretation, and presentation of masses of numerical data. When a measurement is calculated for an entire population, say the average age, it’s called a parameter. When we look across a sample and calculate a measurement, also the average age, we call it a statistic. Since people make entire careers out of the study of statistics, the point of this post is to present a birds-eye overview and brief description of common terms you’ll hear in conversations about quantitative analysis.

When discussing statistics, researchers usually talk about the data in terms of “variables.” A variable is a characteristic that may assume more than one set of values (age, income, birth place can all have more than one value). A variable can either be nominal, ordinal, interval, or ratio in its scale. Nominal variables are also referred to as categorical variables because they represent categories of responses. The color of a car would be represented by a categorical value (for example, black, red, or silver). Categorical variables have no set order, meaning that a black car is not necessarily any better than a silver car.

The level of satisfaction with one’s car on a 1 to 10 scale is an example an ordinal variable (where a 10 is a better score than a 5 and a 5 is better than a 1). Ordinal variables have a clear, set order, but they still represent categories of responses. Interval and ratio variables are numerical variables whose numbers have direct meaning. The age of a car would be ratio variable because it can be measured precisely and at equal intervals (in hours, years, or decades).

Variables can also be discrete or continuous. Continuous variables, such as time, have an infinite number of possible values, while discrete variables, such as a satisfaction scale, have a finite (in this case 10) number of possible values.

Descriptive statistics are simple portrayals of what the variables show. They are summaries of the frequency of the different values (like percentages); the central tendency (mean, median or mode); and the dispersion (like the range and the standard deviation). Cross tabs (short for tabulations) are popular for displaying the joint distribution of two or more variables. They are usually presented in a matrix called a contingency table. In a cross tab table, each cell gives the number of respondents that gave a particular combination of responses.

Measures of association summarize the relationship between two variables (correlation and regression, for instance). Two variables are associated when information about one can help us predict information about the other. A variety of techniques to measure association are available, each better suited to different classes of variables. When analyzing data, most statisticians use multivariate analysis where the effects of many variables are considered.

Tests of statistical significance are used to determine how sure we can feel about the associations found in the data — Could it just be chance? Can we infer that the result can be generalized to the study population? Confidence intervals, chi square tests and t-tests are the most common statistics used to indicate the probability of saying that there is a difference between two groups when actually there is none (level of significance).

Measures of association can be used in very sophisticated ways. Conjoint analysis can be used to determine trade-offs customers are willing to make among product or service attributes. In addition to understanding current preferences, this technique allows modeling of the impact of the introduction of new factors on preferences.

Discrete choice analysis models selection of a product or concept with many attributes from a set of products or concepts. In essence, it models how people make decisions in the real world. For example, one could test products with varying combinations of features to assess which consumers prefer. As with conjoint analysis, discrete choice analysis allows modeling of the impact of the introduction of a new product or concept on factors such as market share.

Cluster analysis identifies population segments using groups of variables. This provides information to better understand and communicate with customers, or help you understand your place in the market place. In general, whenever one needs to classify a mass of information into manageable and meaningful results, cluster analysis is a technique of great usefulness.

Discriminant analysis is used to define which variables best differentiate between predefined groups. The key difference is that discriminant analysis relies on previously defined groups whereas cluster analysis uses the data to discover these groups.

Factor analysis finds the underlying construct behind answers to a series of questions. In other words, factor analysis is designed to classify variables. For clients, it simplifies the interpretation of answers to many questions to a few “factors” that seem to drive answers to all questions. It can be used to determine the key factors that drive aspects like satisfaction, image or customer retention. In addition, factor analysis is used when designing surveys. Often complex concepts (like “leadership”) need to be turned into a group of concrete questions in order to query meaningfully.

Regression Analysis (linear, non-linear and logistic) is widely used for forecasting. It compares the effects of one or more variables on another. The objective of regression analysis is to understand the relationship between several independent or predictor variables on a dependent or criterion variable. This allows forecasting or estimation of the change in a dependent variable based on the change in an independent variable.

***

http://rcm.amazon.com/e/cm?t=patrmadd0a-20&o=1&p=8&l=bpl&asins=0761925767&fc1=000000&IS2=1&lt1=_blank&m=amazon&lc1=0000FF&bc1=000000&bg1=FFFFFF&f=ifrhttp://rcm.amazon.com/e/cm?t=patrmadd0a-20&o=1&p=8&l=bpl&asins=0521674654&fc1=000000&IS2=1&lt1=_blank&m=amazon&lc1=0000FF&bc1=000000&bg1=FFFFFF&f=ifrhttp://rcm.amazon.com/e/cm?t=patrmadd0a-20&o=1&p=8&l=bpl&asins=B00387FOGM&fc1=000000&IS2=1&lt1=_blank&m=amazon&lc1=0000FF&bc1=000000&bg1=FFFFFF&f=ifr

Follow

Get every new post delivered to your Inbox.