Big data has exploded over the past few years. It is expected that by 2020 every human will create the equivalent of 1.7 megabytes of data every second. More than 90% of all the data that exists has been generated since 2015.
The data comes from numerous sources: social media, financial transactions, governments, and sensors. With the rise of the internet of things (IoT), the amount of data collected is expected to increase exponentially. It is believed that by 2030 there will be more than 125 billion IoT devices in service, representing a 360% increase from 2017.
However, problems arise when data is collected from multiple sources. It becomes unstructured, messy, and disparate.
“I think this is a categorically important problem or challenge to AI,” Min Wanli, chief machine intelligence scientist at Alibaba Cloud, said at TechCrunch Hangzhou.”Anytime, if you get a new data source, you have to reconcile or do a data ‘massage’ in real time or either offline.”
While this may be easy for offline data, it becomes much harder when dealing with data in real-time. Min says that a basic approach is needed when dealing with this data. “This is not from the technology’s side; rather it is from the application side. And first, you’ve got to identify your use case scenario, your vertical application.”
Unstructured data is expected to make up 80% of the world’s 163 zettabytes by 2025. This data, which is not stored in a fixed record length format, cannot be read by machines. Examples include documents, social media information, pictures, and video. Because of the vast amounts of it, it provides enormous potential for training AI and in machine learning applications.
Despite the challenges it poses for artificial intelligence, the technology has also been applied to solving the exact problem it is facing. According to IBM, Watson “takes huge amounts of unstructured data, understands it, and uses that data to lay out hypotheses.”
Apart from the difficulties created by unstructured data, Min said that too much data could also be troublesome. “Try to identify the minimum sufficient data input,” he said. “Providing minimum sufficient data is productive and is beneficial.” He said that from Alibaba’s perspective, not being overburdened with data can then speed up their practices. He also said it aids companies using their services
“Their entire implementation process to them is beneficial because they have less burden, less pressure on their side,” said Min.