We have a lot more storage space available these days, and a lot more data to work with, so Big Data and Big Analytics is getting much more mainstream now. And there are conclusions and insights you can get from that data, any data more or less, but Web data in particular brings a new dimension when combined with more traditional, domain specific data. But this data is also mostly in the shape of plain text, like your blogs, twitters, news articles and other web content. And this in turn means that to combine your organized structures sales data for 20 years with Web data, the Web data first needs to be analyzed.
Web data also brings in a new difficulty: the data is big, and it's not organized at it's core, so you can not easily aggregate or something like that to save space (and why would you want to do that?). It's not until after you have analyzed it that you know what data is interesting and what is not. And to be frank (but I am not, I'm Anders), not even then can you start to aggregate data or throw data away that isn't interesting. And in my mind, this is an mistake that has been done in all sorts of analytics, even with smaller amounts of data.
When it comes to analytics, in my mind "If you think you have all the right answers, you haven't asked all the right questions". This is an important point, analytics is a recurring activity, and the more questions you get answered, the more questions you should get. And with this in mind, how can you know what to aggregate? In particular when it comes to web content?
So, can we live with Web data not being aggregated and how do we do it? What database can support that? Oracle? MySQL? MongoDB? Vertica? And the answer is, in the same way as with analytics, you will not know when you start analyzing, and once you have started doing that, you will be even more in doubt! Which technology supports all the aspects you might need to look at? And the keyword is might!
So, how can we solve this? And my answer is: By using the right tool for the job at hand, and be prepared to combine different tools! Postgres and Oracle are great for temporal analysis, for GIS we have Oracle, MySQL and PostGIS. For handling large amounts of data with good scalability and keeping the cost down, you might want a key-value store like MongoDB or DynamoDB. To search data you might head for Sphinx or Lucene. Etc. etc.
As an example, I'd might want to look at a key-value store for my raw Web data, holding some key for easy lookup. An RDBMS for the attributes of this data. Sphinx for searching it. Sphinx and Lucene are much better tools than your average RDBMS, be it MySQL or Oracle or Whatever, and RDBMS search is different than a text search in web-data!
So the most important aspect to look at, if you ask me, is to choose technologies that can easily be combined and where different aspects of data can be served by different technologies as appropriate. And be prepared to add, remove and change technologies as you go along!
/Karlsson
No comments:
Post a Comment