Today, we’re excited to open source Scalding, a Scala API for Cascading. Cascading is a thin Java library and API that sits on top of Apache Hadoop’s MapReduce layer. Scalding is comprised of two main components:
Hadoop, of course, is a distributed system for dealing with large data sets. Within Twitter, we use Scalding to query large data sets like Tweets on HDFS. For example, if we wanted to find out the number of times that each URL is tweeted in a given day, we’d create a Scalding query:
StatusSource()
.flatMapTo('created_date, 'url) { s =>
for( url <- urls(s.getText))
yield (RichDate(s.getCreatedAt).toString(DATE_WITH_DASH), url)
}
.groupBy(‘created_date, ‘url) {
_.size(‘urlCnt) //Count the number of appearences of the URL
}
.write(Tsv(args(“output”)))
In addition, more complex algorithms can be implemented such as a toy page-rank implementation included within the examples. In comparison to languages such as Apache Pig that separate the query language from the user defined functionality, with Scalding everything is integrated into one language. In most cases, one file will describe your job.
If you have any questions, feel free to file any issues on Github or follow the project on Twitter via @scalding.
Did someone say … cookies?
X and its partners use cookies to provide you with a better, safer and
faster service and to support our business. Some cookies are necessary to use
our services, improve our services, and make sure they work properly.
Show more about your choices.