Want your uploads to run as smooth as possible? FlyData supports four data formats for uploading to Amazon Redshift.

By following the guidelines below, we can better ensure that your data gets into Amazon more smoothly.

For more information on using your application with Heroku, see Flydata gem for Heroku.

JSON

FlyData supports the importing of JSON formatted data.

When your data is formatted as JSON, FlyData automatically creates tables and dynamically adds columns when new key-values are received. Dynamic column creation through JSON is one of the most characteristic features of FlyData, allowing you to handle table schemas with much greater flexibility. As this feature is not available to CSV and TSV formats, it is necessary that you output your data in JSON format to take advantage of this feature.

The JSON data should follow the guidelines below:

Basic JSON format

When you first sign up for FlyData, you are prompted to specific a table name to correspond to each of your log files. From there, every time your log files are updated with new JSON data rows, FlyData updates the corresponding Amazon Redshift table, using the new JSON key-value pairs to update columns and values in the table.

{ "[column-name]": [column-value], … }

A line in JSON must start with { and ends with the newline character, \n. If the data contains invalid JSON formatting, FlyData ignores those lines.

Nested JSON format

FlyData also supports JSON data in nested structures that have hashes or arrays of hashes in the value.

When a value is nested, FlyData treats its keys as table name suffixes and processes the values in the same way as that of root key-values. For example, if a JSON record for a "users" table includes a hash value with the key name "address", then the child table takes the name "users_address".

Additionally, FlyData adds an "id" column to all tables in order to maintain the relation of the parent and child tables. FlyData issues the "id" value automatically, unless the record already includes an "id" key and value. In such cases, for the child tables, FlyData creates a "[parent-table-name]_id" column and matches it to the "id" value of the parent table record.

For instance, the following JSON data for a "users" table is stored in two tables, one for "users" and one for "addresses":

// JSON Data
{
  "name": "John",
  "address": {
    "city": "LA"
  }
}

Once FlyData uploads it to Amazon Redshift, the following shows how it appears on the cluster tables:

// Stored Data on the Redshift cluster

// users (parent table)
-----------------
| name | id     |
+------+--------+
| John | 123456 |
-----------------

// users_address (child table)
-------------------
| city | users_id |
+------+----------+
| LA   | 123456   |
-------------------

As you can see, the value of "address" can be an array of hash and all of the elements of the array are processed. However, if the value is an array with non-hash, values are processed in the different way.

The record log columns are as follows:

  • index (store the index of array)
  • value
  • parent_id

Here is an example of "gps_history" table that includes coordinates array values.

// JSON Data
{
  "created_at": "2013-03-12 12:23:45",
  "coordinates": [
    121.01,
    14.51
  ]
}

// Stored Data on the Redshift cluster

// gps_history (parent table)
--------------------------------
|      created_at     |   id   |        
+---------------------+---------
| 2013-03-12 12:23:45 | 123456 |
--------------------------------

// gps_history_coordinates (child table)
-----------------------------------
| index | value  | gps_history_id |
+-------+--------+----------------+
| 0     | 121.01 | 123456         |
+-------+--------+----------------+
| 1     | 14.51  | 123456         |
-----------------------------------

Column Format Type

FlyData detects the format type from the value to create a new table or add a new column. The supported format types are as following.

  • integer (4-byte)
  • big integer (8-byte)
  • float
  • boolean
  • varchar (256-byte)
  • date
  • timestamp

If the value of the unknown key is "null", empty hash, or empty array, that is skipped to add a new column.

Date and Timestamp format

FlyData replaces the text for timestamp or date format fields with the date format that Amazon Redshift supports.

Supported formats for timestamp and date format fields include:

  • Date

    YYYY/mm/dd  (ex 2012/05/25)
    YYYY-mm-dd  (ex 2013-04-12)
  • Timestamp

    // W3C http://www.w3.org/TR/NOTE-datetime
    YYYY-MM-DDThh:mm:ssTZD (ex 1997-07-16T19:20:01+01:00)
    YYYY-MM-DDThh:mm:ss.sTZD (ex 1997-07-16T19:20.123+01:00)
    
    // Twitter
    ddd MMM dd HH:mm:ss zzzz yyyy (ex Sat Sep 17 14:48:57 +0000 2011)
    
    // Redshift
    YYYY-MM-DD hh:mm:ss (ex 2012-01-23 19:20:15)
    YYYY-MM-DD hh:mm:ss.s (ex 2012-01-23 19:20:15.243)
    
    // Apache
    [DD/MMM/YYYY:HH:mm:ss zzzz] (ex [13/Sep/2006:07:01:53 -0700])
    
    // Others
    YYYY-MM-DD hh:mm:ss zzzz (ex 2012-01-23 19:20:15 +0000)
    YYYY-MM-DD hh:mm:ss.s zzzz (ex 2012-01-23 19:20:15.243 +0000)

Supported characters

Data can contain only UTF-8 characters up to three-bytes long. Invalid characters will be replaced with ‘?’ character.

Settings

You can configure the following options from the FlyData settings page:

  • Auto create table
    If turned off and no tables to store data, data will be ignored.
  • Auto add columns
    If "support nested JSON" option is on, "id" and "parent id" column will be added automatically even if this option is off.
  • Support nested JSON
    If turned off, the value of array or hash will be stored with a serialized text of JSON.

    CREATE TABLE your_tablename_here (
       an_integer INT,
       first_string VARCHAR255),
       second_string VARCHAR(255)
    );

CSV / TSV

In addition to JSON, FlyData supports data formatted into comma or tab separated lists.

The specifications for CSV and TSV are a little different from RFC 4180.

  • FlyData does not support the use of double quotes to enclose a field and escape characters like commas and tabs.
  • FlyData requires backslash escapes for the following characters:

    • , commas, in CSV only.
    • \t tabs, in TSV only.
    • \ backslashes.
    • \n newlines.

For example, if you wanted to upload a CSV file that includes the newline character, this is how you would handle the escape:

01,john,29,Hello I'm John.\\nHow are you doing?

Note: You must create the table on Amazon Redshift before you can upload in either CSV or TSV data format.

Apache Log Format

FlyData supports the commonly used Apache log format, Common Log Format (CLF) and Combined Log Format.

Using this format, users can upload their Apache access logs to Amazon Redshift and start analyzing their data right away. You don’t even need to create a table in Redshift, as FlyData does the job for you.

Standard log attributes

For example, the following log data in Combined Log Format will be uploaded to the Redshift table as follows:

  • Log Data

    127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 
    2326 "http://www.example.com/start.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
  • Data in Redshift

    // apache_access_log table
    ------------------------------------------------------
    | COLUMN         | ROW                               |
    +----------------+-----------------------------------+
    | ip             | 127.0.0.1                         |
    | remote_logname | -                                 |
    | remote_user    | frank                             |
    | timestamp      | 2000-10-10 13:55:36               |
    | http_method    | GET                               |
    | resource       | /apache_pb.gif                    |
    | protocol       | HTTP/1.0                          |
    | status         | 200                               |
    | size           | 2326                              |
    | referrer       | http://www.example.com/start.html |
    | user_agent     | Mozilla/4.08 [en] (Win98; I ;Nav) |
    ------------------------------------------------------


Request query parameters

In addition to the standard log attributes mentioned above, FlyData also allows users to store the values of request query parameters, (for instance, ?user_id=123) into their own corresponding columns on the Redshift table.

If the column for this is missing, FlyData automatically creates the column so that the user doesn’t have to worry about the table definition.

  • Log Data

    127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /purchase?user_id=293&item_id=201
    HTTP/1.0" 200 2326 "http://www.example.com/store.html" "Mozilla/4.08 [en] (Win98; I Nav)"
  • Data in Redshift

    // Apache_access_log table
    ------------------------------------------------------
    | COLUMN         | ROW                               |
    +----------------+-----------------------------------+
    | ip             | 127.0.0.1                         |
    | remote_logname | -                                 |
    | remote_user    | frank                             |
    | timestamp      | 2000-10-10 13:55:36               |
    | http_method    | GET                               |
    | resource       | /purchase?user_id=293&item_id=102 |
    | protocol       | HTTP/1.0                          |
    | status         | 200                               |
    | size           | 2326                              |
    | referrer       | http://www.example.com/store.html |
    | user_agent     | Mozilla/4.08 [en] (Win98; I ;Nav) |
    | user_id        | 293                               |
    | item_id        | 102                               |
    ------------------------------------------------------


FlyData extended format

In the event that the user has custom data that fits neither into the standard parameters nor into the request query parameters, they can use the FlyData Extended Log Format.

Extended Log Format is the addition of double-quoted strings to the end of Common Log Format or Combined Log Format. The contents must be in a string of key=value pairs concatenated with &, which is the same format as the request query parameters.

  • Log Data

    127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /purchase?user_id=293&item_id=201 
    HTTP/1.0" 200 2326 "http://www.example.com/store.html" "Mozilla/4.08 [en] (Win98; I ;Nav)"
    "session_id=rfnq17675gtrfejbtc46n0vi97&response_time=7"

    Here, the last double-quoted string, for session_id and response_time, are in the FlyData Extended Log Format. During upload, FlyData will create columns for them on the Amazon Redshift table.

  • Data in Redshift

    // Apache_access_log table
    ------------------------------------------------------
    | COLUMN         | ROW                               |
    +----------------+-----------------------------------+
    | ip             | 127.0.0.1                         |
    | remote_logname | -                                 |
    | remote_user    | frank                             |
    | timestamp      | 2000-10-10 13:55:36               |
    | http_method    | GET                               |
    | resource       | /purchase?user_id=293&item_id=102 |
    | protocol       | HTTP/1.0                          |
    | status         | 200                               |
    | size           | 2326                              |
    | referrer       | http://www.example.com/store.html |
    | user_agent     | Mozilla/4.08 [en] (Win98; I ;Nav) |
    | user_id        | 293                               |
    | item_id        | 102                               |
    | session_id     | rfnq17675gtrfejbtc46n0vi97        |
    | response_time  | 7                                 |
    ------------------------------------------------------

FlyData handles real-time replication for all of the major RDS systems, including mySQL, PostgreSQL, Amazon RDS, Amazon Aurora, MariaDB, and Percona.

With an easy one-time setup, you can focus on uncovering data to make business decisions.

No Credit Card Required

By using this website you agree to accept our Privacy Policy and Terms & Conditions