Thursday, August 22, 2013

Big Data.. So what? Part 2

Sorry for this delay in providing part 2 of this series, but stuff happened that had really high priority, and in addition I was on vacation. But now I'm back in business!

So, last time I left you with some open thought on why Big Data can be useful, but that we also need new analysis tools as well as new ways of visualizing data for this to be truly useful. As for analysis, lets have a look at text, which should be simple enough, right? And sometimes it is simple. One useful analysis tool that is often overlooked is Google. Let's give it a shot, just for fun: if I think of two fierce competitors, somehow, that we can compare, say Oracle and MySQL.. Oracle is much older, both as a technology and as a company and in addition owns the MySQL brand these days. But on the other hand, the Web is where MySQL has it's sweet spot. Just Googling for MySQL and Oracle shows that MySQL seems to be much more discussed (and no, I haven't turned stupid just because I was on vacation, so I realize that there are many sites which doesn't talk about MySQL but have a "powered by MYSQL" text or something or it is actually not shown to the end user, but anyway) "Oracle" gets 165 000 000 hits, whereas MySQL gets 226 000 000.

One issue is that this is not terribly interesting in this case. If I am a small company with a little known brand and I am trying to make my name known, then this is not a bad way of getting some kind of measurement on how I am doing, but in any case, this tells me nothing about what is being said, if it's good or bad, if it's said by some powerful dude, like "Vladimir Putin says that MySQL sucks" for example, or if it's just yours truly: "Swedish Ponytailed Geek says MySQL is actually pretty much OK, but not as OK as MariaDB". You see my point here, we have the data but we must analyze it better, even though this simple analysis is useful in some cases.

So, what kind of analysis do we do here? Text analysis you say. Maybe we could analyze the text and look for negative or positive sentiments. This is exactly what advanced text analysis tools do, but they aren't foolproof, and much less so then they used to be surprisingly. The reason is cyber-language. Analyzing an editorial article in the New York Post is what these technologies rock at, but things such as blogs, facebook and twitter is much more difficult.

But I don't give up easily on my tools, let's for fun try to use Google to look for sentiments. Let's search for "Oracle sucks" and "MySQL sucks" and see what we get. We already know that a simple search for MySQL has many more hits than plain Oracle. So when I add slightly negative word to my search, will this relationship remain? The answer is no: "Oracle sucks" gets 1 950 000 hits, whereas "MySQL Sucks" gets 1 650 000 hits, i.e. significantly less than Oracle. Does this tell us something then? Well yes, to an extent it does actually. Not so much so I want to bet all my money on it, but combined with some other knowledge, this can turn out to be useful.

For another dimension of data, one that Google isn't terribly good for, but we can at least play with it a bit and using Google isn't too expensive either! What we are to look at here is something that is difficult to determine from a document, even when using advanced analysis tools, which is the date and time. Look at a standard article in some newspaper, just look at the top of the page and you see when the article was published. But then the article itself might mention some date previously that the article really is about, and it is not about current time at all. This is easy to figure out with the naked eye, but analyzing this pragmatically is much more difficult, and it gets even worse with relative dates, like "yesterday" or "in a few days" or so.

But let's give Google a shot here. I go to Google to search and instead of just searching I first click on the "Any time" drop down, select "Custom range" and select 2003-01-01 til 2003-12-31 and then I search for Oracle and MySQL. Now I see Oracle getting 243 000 hits and MySQL 204 000. Then we see what the situation is today and do the same search for "Past year" and this time Oracle gets 94 000 000 results and MySQL only 17 000 000.

Why are these numbers so much smaller for MySQL now? I guess because the high numbers come from MySQL referencing pages that really has no actual MySQL content. I don't know, I'm just guessing here. Or maybe we are reaching the limit of what Google can do for us here.

See, I have now written a whole blog on just text analysis, and I have given some simple examples. Look at what can be done with image and video analysis. Or with some more dimensions, such as who wrote the text or shoot the video? Right? And if we combine all this, can we agree that we get some interesting insights? And still, we have not really visualized it in some cool way. In my mind, the most interesting with all this is that what we get back isn't very hard facts, far from the normal Data Warehouse statistics ("We had 25 751 orders last year"), and as the result is fuzzy, where we get some insights, but far from all, it triggers my curiosity. What query can I ask now when I am armed with this new, fuzzy, insight? Yes, Big Data and Big Analysis really is there to triggers us to think up new questions to ask.

I'll do another blog in this series in a few weeks or so!

Cheers
/Karlsson

No comments: