Back

Airline dataset
For evaluation of machine learning algorithms on non-stationary streaming real-world problems, I prepared a dataset using the data from the Data Expo competition (2009). The dataset consists of a large amount of records, containing flight arrival and departure details for all the commercial flights within the USA, from October 1987 to April 2008. This is a large dataset with nearly 120 million records (11.5 GB memory size). The dataset was cleaned and records were sorted according to the arrival/departure date (year, month, and day) and time of flight. Its final size is around 116 million records and 5.76 GB of memory.
The .names file which describes the data can be downloaded here. There are 13 attributes, each represented in a separate column: Year,Month, Day of Month, Day of Week, CRS Departure Time, CRS Arrival Time, Unique Carrier, Flight Number, Actual Elapsed Time, Origin, Destination, Distance, and Diverted . Details on their meaning can be found in Data Expo (2009). The target variable is the Arrival Delay, given in seconds.

The dataset is divided into separate files, each corresponding to one year of records:
1988 ; 1989 ; 1990 ; 1991 ; 1992 ; 1993 ; 1994 ; 1995 ; 1996 ; 1997 ; 1998 ; 1999 ; 2000 ; 2001 ; 2002 ; 2003 ; 2004 ; 2005 ; 2006 ; 2007 ; 2008 ;
Concatenate these files to get the complete dataset, or download it here.


Public small to medium size datasets for the task of regression
Here is a list of 10 publicly available datasets for evaluation of machine learning algorithms for the task of regression:

Abalone and the corresponding .names file;
Ailerons and the corresponding .names file;
Cal Housing and the corresponding .names file;
Elevators and the corresponding .names file;
House 8L and the corresponding .names file;
House 16H and the corresponding .names file;
Mv Delve and the corresponding .names file;
Pol and the corresponding .names file;
Wind and the corresponding .names file;
Winequality and the corresponding .names file;