I am the data modeler for an organization which lends to small businesses. In my...

I am the data modeler for an organization which lends to small businesses. In my experience "big data" is all in the eye of the beholder, and it's not all about how many gigabytes of data you work with, how wide, or how long it is. The challenges are the same: how to use the data in relevant ways to forward organizational goals. In my case the days isn't particularly long in terms of number of rows, but it is exceptionally wide in terms of potential variables. It's enough data that I have to spend a reasonable amount of time thinking about the most efficient way to model (statistically) and data mine. The issues are similar to other data oriented jobs I've had: how to determine which variables are relevant, clean and transform the data... And ultimately how to turn a big pile of data into a model which effectively predicts likelihood of charge off if the loan were to be approved. Scintillating stuff, but obscenely difficult. Of course, it's harder too because I'm the only modeler and am fairly inexperienced. My last experience building predictive models was a couple classes in college... Which was also my last experience using R (which I prefer to SAS.

To answer your implied question, I'd recommend picking up ANY size real world data and playing with it. Build statistical models (predictive or otherwise), apply supervised and unsupervised machine learning methods to it, but above all develop a foundation of experience working with real world data. In class in college we used "canned" data sets which were already cleaned, validated, organized, and so forth. This made it unrealistically easy to model. In the real world just working with the data effectively is a hard won skill. So from the get go you need to learn how to explore data, visualize it, interpret plots and statistics, clean/transform/normalize it, formulate a question your data can answer, and apply the relevant methods in pursuit of the answers you seek. Once you have the fundamentals down the size of the data is immaterial--only requiring you to put additional thought into what you can computationally achieve (for instance, how to determine which of 150 candidate variables are statistically relevant).