Google’s search engine for datasets upgraded for improved data sourcing

Search engine by Google designed for datasets, that the company ‘smartly’ named Dataset Search, has now come out of beta, integrated with new tools to improve filter searches and provides access to about 25 million datasets.

Unify the fragmented World of Online

Dataset Search rolled out in September 2018, with Google wishing to calmly integrate the disintegrating world of online, open-access data. Although many institutions such as universities, labs and, governments post data online, most of the time is hard to find by traditional search. However, by including open-source metadata tags to their webpages, these groups can make their data indexed by Dataset Search, which now has a big range of information, such as from skiing injuries to penguin populations to volcano eruptions.

Google didn’t share any figures regarding specific usage for the search engine, however, it said “hundreds of thousands of users” have used Dataset Search after its launch, and the overall feedback from the scientific community was positive.

A research scientist at Google AI, Natasha Noy, who worked in creating the tool, says to The Verge that “most [data] repositories have been very responsive” and that the launch of the engine meant experienced scientific institutions are now taking “publishing metadata more seriously.”

Making Scientific Research more accessible

“For example, [the prestigious scientific journal] Nature is changing its policies to require data sharing with proper metadata,” Noy explains, underlining a change that is going to make the data underpinning top-level scientific research more reachable in future.

The latest features added in Dataset Search have the ability to categorize and filter data by type (tables, text, images, etc), whether it’s free to consider, and the areas it covers geographically. Also, the engine is now available to be used on mobile and has extended dataset descriptions.

Google states the topics covered by the search engine (which accounts for almost 25 million datasets) is only a “fraction of datasets on the web,” however a “significant” one all the same. The largest topics indexed are biology, geosciences, and agriculture, and the most common questions include “education,” “weather,” “cancer,” “crime,” “soccer,” and “dogs.” The US leads the world of open government datasets, publishing more than 2 million online.

Noy refused to comment on plans for Dataset Search, however, she states that the team was thinking about several functions they hope would be useful, including “understanding how datasets are cited and reused” and “helping users explore datasets in Dataset Search when they don’t necessarily know what they are looking for.”

“And, of course, continuing to expand the corpus,” says Noy. There’s always more data out there.

Hello, I’m Anna Yeo. If you like my news coverage, please drop a good word in my inbox. I’m journalist by profession and have been part of many major reporting across the globe. I like to write crisp and factual news. I have completed my masters degree in journalism. Feel free to contact me at [email protected]

Leave a Reply

Your email address will not be published. Required fields are marked *