Quick start
getML is an innovative tool for the end-to-end automation of data science projects. It covers everything from convenient data loading procedures to the deployment of trained models.
Most notably, getML includes advanced algorithms for automated feature engineering (feature learning) on relational data and time series. Feature engineering on relational data is defined as the creation of a flat table by merging and aggregating data. It is sometimes also referred to as data wrangling. Feature engineering is necessary if your data is distributed over more than one data table.
Automated feature engineering
- Saves up to 90% of the time spent on a data science project
- Increases the prediction accuracy over manual feature engineering
Andrew Ng, Professor at Stanford University and Co-founder of Google Brain described manual feature engineering as follows:
Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering.
The main purpose of getML is to automate this "difficult, time-consuming" process as much as possible.
getML comes with a high-performance Engine written in C++ and an intuitive Python API. Completing a data science project with getML consists of eight simple steps.
1. Launch the Engine
import getml
getml.engine.launch()
getml.engine.set_project('one_minute_to_getml')
2. Load the data into the Engine
df_population = getml.data.DataFrame.from_csv('data_population.csv',
name='population_table')
df_peripheral = getml.data.DataFrame.from_csv('data_peripheral.csv',
name='peripheral_table')
3. Annotate the data
df_population.set_role(cols='target', role=getml.data.role.target)
df_population.set_role(cols='join_key', role=getml.data.role.join_key)
4. Define the data model
dm = getml.data.DataModel(population=df_population.to_placeholder())
dm.add(df_peripheral.to_placeholder())
dm.population.join(
dm.peripheral,
on="join_key",
)
5. Train the feature learning algorithm and the predictor
pipe = getml.pipeline.Pipeline(
data_model=dm,
feature_learners=getml.feature_learning.FastProp(),
predictors=getml.predictors.LinearRegression()
)
pipe.fit(
population=df_population,
peripheral=[df_peripheral]
)
6. Evaluate
pipe.score(
population=df_population_unseen,
peripheral=[df_peripheral_unseen]
)
7. Predict
pipe.predict(
population=df_population_unseen,
peripheral=[df_peripheral_unseen]
)
8. Deploy
# Allow the pipeline to respond to HTTP requests
pipe.deploy(True)
Check out the rest of this documentation to find out how getML achieves top performance on real-world data science projects with many tables and complex data schemes.