Thursday, August 22, 2013

Big Data.. So what? Part 2

Sorry for this delay in providing part 2 of this series, but stuff happened that had really high priority, and in addition I was on vacation. But now I'm back in business!

So, last time I left you with some open thought on why Big Data can be useful, but that we also need new analysis tools as well as new ways of visualizing data for this to be truly useful. As for analysis, lets have a look at text, which should be simple enough, right? And sometimes it is simple. One useful analysis tool that is often overlooked is Google. Let's give it a shot, just for fun: if I think of two fierce competitors, somehow, that we can compare, say Oracle and MySQL.. Oracle is much older, both as a technology and as a company and in addition owns the MySQL brand these days. But on the other hand, the Web is where MySQL has it's sweet spot. Just Googling for MySQL and Oracle shows that MySQL seems to be much more discussed (and no, I haven't turned stupid just because I was on vacation, so I realize that there are many sites which doesn't talk about MySQL but have a "powered by MYSQL" text or something or it is actually not shown to the end user, but anyway) "Oracle" gets 165 000 000 hits, whereas MySQL gets 226 000 000.

One issue is that this is not terribly interesting in this case. If I am a small company with a little known brand and I am trying to make my name known, then this is not a bad way of getting some kind of measurement on how I am doing, but in any case, this tells me nothing about what is being said, if it's good or bad, if it's said by some powerful dude, like "Vladimir Putin says that MySQL sucks" for example, or if it's just yours truly: "Swedish Ponytailed Geek says MySQL is actually pretty much OK, but not as OK as MariaDB". You see my point here, we have the data but we must analyze it better, even though this simple analysis is useful in some cases.

So, what kind of analysis do we do here? Text analysis you say. Maybe we could analyze the text and look for negative or positive sentiments. This is exactly what advanced text analysis tools do, but they aren't foolproof, and much less so then they used to be surprisingly. The reason is cyber-language. Analyzing an editorial article in the New York Post is what these technologies rock at, but things such as blogs, facebook and twitter is much more difficult.

But I don't give up easily on my tools, let's for fun try to use Google to look for sentiments. Let's search for "Oracle sucks" and "MySQL sucks" and see what we get. We already know that a simple search for MySQL has many more hits than plain Oracle. So when I add slightly negative word to my search, will this relationship remain? The answer is no: "Oracle sucks" gets 1 950 000 hits, whereas "MySQL Sucks" gets 1 650 000 hits, i.e. significantly less than Oracle. Does this tell us something then? Well yes, to an extent it does actually. Not so much so I want to bet all my money on it, but combined with some other knowledge, this can turn out to be useful.

For another dimension of data, one that Google isn't terribly good for, but we can at least play with it a bit and using Google isn't too expensive either! What we are to look at here is something that is difficult to determine from a document, even when using advanced analysis tools, which is the date and time. Look at a standard article in some newspaper, just look at the top of the page and you see when the article was published. But then the article itself might mention some date previously that the article really is about, and it is not about current time at all. This is easy to figure out with the naked eye, but analyzing this pragmatically is much more difficult, and it gets even worse with relative dates, like "yesterday" or "in a few days" or so.

But let's give Google a shot here. I go to Google to search and instead of just searching I first click on the "Any time" drop down, select "Custom range" and select 2003-01-01 til 2003-12-31 and then I search for Oracle and MySQL. Now I see Oracle getting 243 000 hits and MySQL 204 000. Then we see what the situation is today and do the same search for "Past year" and this time Oracle gets 94 000 000 results and MySQL only 17 000 000.

Why are these numbers so much smaller for MySQL now? I guess because the high numbers come from MySQL referencing pages that really has no actual MySQL content. I don't know, I'm just guessing here. Or maybe we are reaching the limit of what Google can do for us here.

See, I have now written a whole blog on just text analysis, and I have given some simple examples. Look at what can be done with image and video analysis. Or with some more dimensions, such as who wrote the text or shoot the video? Right? And if we combine all this, can we agree that we get some interesting insights? And still, we have not really visualized it in some cool way. In my mind, the most interesting with all this is that what we get back isn't very hard facts, far from the normal Data Warehouse statistics ("We had 25 751 orders last year"), and as the result is fuzzy, where we get some insights, but far from all, it triggers my curiosity. What query can I ask now when I am armed with this new, fuzzy, insight? Yes, Big Data and Big Analysis really is there to triggers us to think up new questions to ask.

I'll do another blog in this series in a few weeks or so!


Monday, August 5, 2013

Don't let Technophobia kill innovation

What? Me? technophobic? I have the latest iPhone, my office is jam packed with USB gadgets and my car is a Prius, how much more techno friendly can one get?

That is all fine, but looking beyond fun technologies that we play with just for fun, or natural, but cool and useful, evolutions come to most of us easily. But can you honestly say (I can't) that you always look at the promise of a new technology and never have never looked at it not from the point of view of the obvious new advantages, when the technology has developed into something useful, and instead just looked at it and judged this new technology only from it's first, shaky, implementation?

When I was in my early teens (which occurred around the time just after Mayflower had arrived in New England) my family moved into our first own house. My parents were running a restaurant at the time (they ran one or the other all through my childhood) and my mother had seen most of the weirdo Heath Robinson designed (TM) commercial and domestic kitchen appliances, and when we first entered our new home and mum looked in the kitchen and realized there was a dishwasher in there, her first reaction was "Well, I'm never going to use that one". One month later, the dishwasher was working more or less daily, and my mum never did any dished by hand.

Many years later, me, her only son, having spent the better part of his life playing with SQL based relational databases (and looking at some of the code in them, I suspect that Heath Robinson is still around, now as a software engineer), started to look at NoSQL databases, and my reaction was largely that of my mums when she saw the dishwasher "Nah, I'm not going to use anything like that. Eventual consistency? What kind of silly idea is that".

Yes, I was wrong, but I am still convinced that NoSQL databases (yes, I know NoSQL is a bad term, but this is monday morning and I don't have enough energy to think up something better) will not replace SQL based system. What I do think is that we need both.

Just as I think my mum got it wrong twice: Yes, the dishwasher really is a good idea, but some things are better handled without is. The results is that there is an abundant lack of sharp knifes in my mums house (as a dishwasher is a really effective knife-unsharpener). My self, I use a dishwasher, but knifes and beer glasses are still, to this day, washed by hand by yours truly (beer glasses and I don't want any left over enzymes in my beer, as they are used to kill bacteria, including the really tasty bacteria that gives beer it's distinctive taste).

Too many words has so far been used to say this: The world needs both SQL and NoSQL databases working together, serving different purposes and applications. As for Eventual Consistency, I still thing this is bogus, just say what it is, no consistency, and live with it, MongoDB, Cassandra and LevelDB are still very useful technologies, as is MySQL. And in many cases you need ACID properties and atomic transactions and all that, but in many cases this is a gross overkill.

Look at something like Virtualization. In that case, I think I looked at it in the right way, looking at the potential of the new features this brought, and not ignoring, but thinking less about the issues with the first implementations (slow I/O, slow networking, complexity of use, complexity of installation etc) and looking at what it could do in terms of cost reduction, effective systems management etc.

Back them, when I was a big Virtualization supporter, many were opposing me with the obvious issues with databases (which is the field where I work, if this wasn't already obvious) which was that I/O was slow and unreliable. Yes it was, but that can be fixed. This is not a flaw with the technology per se, but with the specific implementation and the limitations of the underlying technology at the time. Not everyone needs the highest of high performance, many can do with less. And some can easily scale out to more machines. All in all, many can benefit from Virtualization, maybe more than you think. These days, I think noone doubts that Virtualization is useful.

This is not to say I am always right, but I am not so technophobic that everything that is not something I already know is something that sucks. Also, we should be careful when comparing things. We often compare based on attributes of existing technologies and tend to forget that new technologies might well have virtues of their own (which we do not use for comparison as we are unfamiliar with these features as they don't exist in the technologies we currently use).

I think one technology that is now in a state of being seen as inferior is Cloud technologies. We look at a cloud by taking something we run on some hard iron in-house and throw it at Amazon and look at the result. Maybe we should build our applications and infrastructure differently to support clouds, and maybe, if we do that, a Cloud might well be both more cost-effective, scalable and performant than the stuff we run at our in-house data center.

So don't let new innovative technologies die just because they lack a 9600 baud modem or a serial port. Or because they are no good for washing beer glasses (even if that is a very important dishwasher feature).


Big Data.. So what? Part 1

This is the first blog post in a series where I hope to raise a bit above the technical stuff and instead focus on how we can put Big Data to effective use. I ran a SkySQL Webinar on the subject recently that you might also want to watch, and a recording is available here:

Yes, so what? Why do you need or want all that data? All data you need from your customers you have in your Data Warehouse, and all data you need on the market you are in, you can get from some analyst? Right?

Well, yes, that is one source of data, but there is more to it than that. The deal with Data is that once you have enough of it, you can start to see things you haven't seen before. Trend analysis is only relevant when you have enough data, and the more you have, the more accurate it gets.Big Data is different from the data you already have in that it is Bigger, hence the name, but not only that. Big Data also contains much more diverse types of data, such as images, sound, metadata and video. Also, Big Data has much more new data coming in and is constantly shifting. Research says that each day some 25 quintillion bytes of data is created, this is 25 000 000 000 000 000 000 bytes, if you ask (which is some 25 000 petabytes or 25 000 000 terabytes). And yes, that is every day. (and yes, this is using 1000 bytes per kb, not 1024 per Kb).

As I already said, what is interesting with such huge amounts of data is that once the volumes are high enough, is that you can infer things that you couldn't with smaller or more focused data. You can infer changes that you couldn't before and in some sense make qualified predictions on what will happen in the world. Does this sound like rocket science? Well, it shouldn't and truth is that we have been doing this in at least one field for a very long time, since the 1950's or so, and this was one of the first application for computers. And no, I'm not talking about Angry Birds here.

What I am talking about is weather forecasting. Using knowledge about how winds blow, temperatures, geographies and statistics, we can reasonably well predict how the weather will be. As we all know, these forecasts aren't always right, but even when they go wrong, we get to know why they went wrong. The way these predictions work is to combine large amounts of data with experience and hard facts on how the weather behaves, and the data isn't directly related to the area where we try to predict the weather either. We can do very little to influence the weather, except of course plan a picnic which is sure to create thunderstorms.

In the case of, say, sales of some consumer product, we are actually able to influence this more than we can influence the weather. And if we then add our knowledge of our market and the dynamics of it and combine that with truckloads of related and semi-related data, why shouldn't we be able to do some predictions. Not in the sense of knowing exactly what will happen in the future, but at least have an idea of what is the most likely thing to happen and have an idea of the likelihood that this will be so. Which is how weather forecasts work.

But this isn't all there is to it. Let's pop back to weather forecasting for a second. The analysis done on weather systems is a lot more complex than that done in most data warehouses, there is more to this than some summaries and averages. Also, the way this is presented: Using a way with an overlay of symbols (a Sun, a Cloud, some poor soul planning a picnic) is different from how we are used to see trend data in our data houses.

  • We need ways of dealing with large amount of fast moving data - Big Data
  • We need new, better and more specialized analysis - Big Analytics
  • We need new ways to view data - Visualizations
I'll be back soon with something more specific on this subject, so don't touch that dial!