Sorry for this delay in providing part 2 of this series, but stuff happened that had really high priority, and in addition I was on vacation. But now I'm back in business!
So, last time I left you with some open thought on why Big Data can be useful, but that we also need new analysis tools as well as new ways of visualizing data for this to be truly useful. As for analysis, lets have a look at text, which should be simple enough, right? And sometimes it is simple. One useful analysis tool that is often overlooked is Google. Let's give it a shot, just for fun: if I think of two fierce competitors, somehow, that we can compare, say Oracle and MySQL.. Oracle is much older, both as a technology and as a company and in addition owns the MySQL brand these days. But on the other hand, the Web is where MySQL has it's sweet spot. Just Googling for MySQL and Oracle shows that MySQL seems to be much more discussed (and no, I haven't turned stupid just because I was on vacation, so I realize that there are many sites which doesn't talk about MySQL but have a "powered by MYSQL" text or something or it is actually not shown to the end user, but anyway) "Oracle" gets 165 000 000 hits, whereas MySQL gets 226 000 000.
One issue is that this is not terribly interesting in this case. If I am a small company with a little known brand and I am trying to make my name known, then this is not a bad way of getting some kind of measurement on how I am doing, but in any case, this tells me nothing about what is being said, if it's good or bad, if it's said by some powerful dude, like "Vladimir Putin says that MySQL sucks" for example, or if it's just yours truly: "Swedish Ponytailed Geek says MySQL is actually pretty much OK, but not as OK as MariaDB". You see my point here, we have the data but we must analyze it better, even though this simple analysis is useful in some cases.
So, what kind of analysis do we do here? Text analysis you say. Maybe we could analyze the text and look for negative or positive sentiments. This is exactly what advanced text analysis tools do, but they aren't foolproof, and much less so then they used to be surprisingly. The reason is cyber-language. Analyzing an editorial article in the New York Post is what these technologies rock at, but things such as blogs, facebook and twitter is much more difficult.
But I don't give up easily on my tools, let's for fun try to use Google to look for sentiments. Let's search for "Oracle sucks" and "MySQL sucks" and see what we get. We already know that a simple search for MySQL has many more hits than plain Oracle. So when I add slightly negative word to my search, will this relationship remain? The answer is no: "Oracle sucks" gets 1 950 000 hits, whereas "MySQL Sucks" gets 1 650 000 hits, i.e. significantly less than Oracle. Does this tell us something then? Well yes, to an extent it does actually. Not so much so I want to bet all my money on it, but combined with some other knowledge, this can turn out to be useful.
For another dimension of data, one that Google isn't terribly good for, but we can at least play with it a bit and using Google isn't too expensive either! What we are to look at here is something that is difficult to determine from a document, even when using advanced analysis tools, which is the date and time. Look at a standard article in some newspaper, just look at the top of the page and you see when the article was published. But then the article itself might mention some date previously that the article really is about, and it is not about current time at all. This is easy to figure out with the naked eye, but analyzing this pragmatically is much more difficult, and it gets even worse with relative dates, like "yesterday" or "in a few days" or so.
But let's give Google a shot here. I go to Google to search and instead of just searching I first click on the "Any time" drop down, select "Custom range" and select 2003-01-01 til 2003-12-31 and then I search for Oracle and MySQL. Now I see Oracle getting 243 000 hits and MySQL 204 000. Then we see what the situation is today and do the same search for "Past year" and this time Oracle gets 94 000 000 results and MySQL only 17 000 000.
Why are these numbers so much smaller for MySQL now? I guess because the high numbers come from MySQL referencing pages that really has no actual MySQL content. I don't know, I'm just guessing here. Or maybe we are reaching the limit of what Google can do for us here.
See, I have now written a whole blog on just text analysis, and I have given some simple examples. Look at what can be done with image and video analysis. Or with some more dimensions, such as who wrote the text or shoot the video? Right? And if we combine all this, can we agree that we get some interesting insights? And still, we have not really visualized it in some cool way. In my mind, the most interesting with all this is that what we get back isn't very hard facts, far from the normal Data Warehouse statistics ("We had 25 751 orders last year"), and as the result is fuzzy, where we get some insights, but far from all, it triggers my curiosity. What query can I ask now when I am armed with this new, fuzzy, insight? Yes, Big Data and Big Analysis really is there to triggers us to think up new questions to ask.
I'll do another blog in this series in a few weeks or so!
Cheers
/Karlsson
I am Anders Karlsson, and I have been working in the RDBMS industry for many, possibly too many, years. In this blog, I write about my thoughts on RDBMS technology, happenings and industry, and also on any wild ideas around that I might think up after a few beers.
Showing posts with label big data. Show all posts
Showing posts with label big data. Show all posts
Thursday, August 22, 2013
Monday, August 5, 2013
Big Data.. So what? Part 1
This is the first blog post in a series where I hope to raise a bit above the technical stuff and instead focus on how we can put Big Data to effective use. I ran a SkySQL Webinar on the subject recently that you might also want to watch, and a recording is available here:http://bit.ly/17TTQnJ
Yes, so what? Why do you need or want all that data? All data you need from your customers you have in your Data Warehouse, and all data you need on the market you are in, you can get from some analyst? Right?
Well, yes, that is one source of data, but there is more to it than that. The deal with Data is that once you have enough of it, you can start to see things you haven't seen before. Trend analysis is only relevant when you have enough data, and the more you have, the more accurate it gets.Big Data is different from the data you already have in that it is Bigger, hence the name, but not only that. Big Data also contains much more diverse types of data, such as images, sound, metadata and video. Also, Big Data has much more new data coming in and is constantly shifting. Research says that each day some 25 quintillion bytes of data is created, this is 25 000 000 000 000 000 000 bytes, if you ask (which is some 25 000 petabytes or 25 000 000 terabytes). And yes, that is every day. (and yes, this is using 1000 bytes per kb, not 1024 per Kb).
As I already said, what is interesting with such huge amounts of data is that once the volumes are high enough, is that you can infer things that you couldn't with smaller or more focused data. You can infer changes that you couldn't before and in some sense make qualified predictions on what will happen in the world. Does this sound like rocket science? Well, it shouldn't and truth is that we have been doing this in at least one field for a very long time, since the 1950's or so, and this was one of the first application for computers. And no, I'm not talking about Angry Birds here.
What I am talking about is weather forecasting. Using knowledge about how winds blow, temperatures, geographies and statistics, we can reasonably well predict how the weather will be. As we all know, these forecasts aren't always right, but even when they go wrong, we get to know why they went wrong. The way these predictions work is to combine large amounts of data with experience and hard facts on how the weather behaves, and the data isn't directly related to the area where we try to predict the weather either. We can do very little to influence the weather, except of course plan a picnic which is sure to create thunderstorms.
In the case of, say, sales of some consumer product, we are actually able to influence this more than we can influence the weather. And if we then add our knowledge of our market and the dynamics of it and combine that with truckloads of related and semi-related data, why shouldn't we be able to do some predictions. Not in the sense of knowing exactly what will happen in the future, but at least have an idea of what is the most likely thing to happen and have an idea of the likelihood that this will be so. Which is how weather forecasts work.
But this isn't all there is to it. Let's pop back to weather forecasting for a second. The analysis done on weather systems is a lot more complex than that done in most data warehouses, there is more to this than some summaries and averages. Also, the way this is presented: Using a way with an overlay of symbols (a Sun, a Cloud, some poor soul planning a picnic) is different from how we are used to see trend data in our data houses.
Conclusion:
/Karlsson
Yes, so what? Why do you need or want all that data? All data you need from your customers you have in your Data Warehouse, and all data you need on the market you are in, you can get from some analyst? Right?
Well, yes, that is one source of data, but there is more to it than that. The deal with Data is that once you have enough of it, you can start to see things you haven't seen before. Trend analysis is only relevant when you have enough data, and the more you have, the more accurate it gets.Big Data is different from the data you already have in that it is Bigger, hence the name, but not only that. Big Data also contains much more diverse types of data, such as images, sound, metadata and video. Also, Big Data has much more new data coming in and is constantly shifting. Research says that each day some 25 quintillion bytes of data is created, this is 25 000 000 000 000 000 000 bytes, if you ask (which is some 25 000 petabytes or 25 000 000 terabytes). And yes, that is every day. (and yes, this is using 1000 bytes per kb, not 1024 per Kb).
As I already said, what is interesting with such huge amounts of data is that once the volumes are high enough, is that you can infer things that you couldn't with smaller or more focused data. You can infer changes that you couldn't before and in some sense make qualified predictions on what will happen in the world. Does this sound like rocket science? Well, it shouldn't and truth is that we have been doing this in at least one field for a very long time, since the 1950's or so, and this was one of the first application for computers. And no, I'm not talking about Angry Birds here.
What I am talking about is weather forecasting. Using knowledge about how winds blow, temperatures, geographies and statistics, we can reasonably well predict how the weather will be. As we all know, these forecasts aren't always right, but even when they go wrong, we get to know why they went wrong. The way these predictions work is to combine large amounts of data with experience and hard facts on how the weather behaves, and the data isn't directly related to the area where we try to predict the weather either. We can do very little to influence the weather, except of course plan a picnic which is sure to create thunderstorms.
In the case of, say, sales of some consumer product, we are actually able to influence this more than we can influence the weather. And if we then add our knowledge of our market and the dynamics of it and combine that with truckloads of related and semi-related data, why shouldn't we be able to do some predictions. Not in the sense of knowing exactly what will happen in the future, but at least have an idea of what is the most likely thing to happen and have an idea of the likelihood that this will be so. Which is how weather forecasts work.
But this isn't all there is to it. Let's pop back to weather forecasting for a second. The analysis done on weather systems is a lot more complex than that done in most data warehouses, there is more to this than some summaries and averages. Also, the way this is presented: Using a way with an overlay of symbols (a Sun, a Cloud, some poor soul planning a picnic) is different from how we are used to see trend data in our data houses.
Conclusion:
- We need ways of dealing with large amount of fast moving data - Big Data
- We need new, better and more specialized analysis - Big Analytics
- We need new ways to view data - Visualizations
/Karlsson
Monday, January 21, 2013
Talking at the SkySQL Roadshow in Stockholm
SkySQL Roadshow is coming to Stockholm on Feb 7, come by and meet us. I'll be ending the day with a talk on Big Data, which will be a more generic Big Data talk with some MySQL relevance, but with the focus on Big Data in general.
I haven't blogging much recently, but that has some reasons. I am since Dec 1 the proud father of twins, a little boy and a little girl. I have yet to teahc them to write proper SQL, the have particular issues with subqueries, but we'll get there. In order to create the usual mess of things and to make sure things are at the brink of running out of control, we decided to renovate our flat in the middle of all this. But I'll get there, and once we have a new kitchen installed, I'll do some more blogging, I have some things piled up to write about.
/Karlsson
I haven't blogging much recently, but that has some reasons. I am since Dec 1 the proud father of twins, a little boy and a little girl. I have yet to teahc them to write proper SQL, the have particular issues with subqueries, but we'll get there. In order to create the usual mess of things and to make sure things are at the brink of running out of control, we decided to renovate our flat in the middle of all this. But I'll get there, and once we have a new kitchen installed, I'll do some more blogging, I have some things piled up to write about.
/Karlsson
Subscribe to:
Posts (Atom)