# Amazon Redshift & UTF-8
One of the things to be aware of when transferring data to Amazon Redshift is related to the handling of UTF-8 characters. When moving data from external data sources into Redshift, you may encounter an error that looks like below.
ERROR: Load into table 'alex_test_table_char' failed. Check stl_load_errors system table for details.
And in your
stl_load_errors it will say:
err_code | err_reason ----------+------------------------------------------------------------------------------------------------------ 1220 | Multibyte character not supported for CHAR (Hint: try using VARCHAR). Invalid char: e6 97 ad
This is because we were using a column that was a CHAR type, which only supports single-byte ASCII characters. When the data that you want to load contains UTF-8 characters, you will want a column of the VARCHAR type.
# Using a VARCHAR Column Instead
Now let’s try this again using a VARCHAR(20). The results are:
INFO: Load into table 'alex_test_table_varchar' completed, 3 record(s) loaded successfully.
So, finally we were able to complete the load successfully.
select * from alex_test_table_varchar; id | name ---+------------ 2 | 旭川清 1 | John Doe 3 | (´・ω・`)
Things have gotten a lot more convenient compared to before, when Amazon Redshift was just released. Redshift now supports up to 4 byte UTF-8 characters, which covers many of the use cases.
# Handling NUL Values
One thing to note about Amazon Redshift is that if you have a string that contains NUL(U+0000) values, then those NUL values will cause a load error when you’re try to copy that data. Let’s say your source data is:
- The table schema is
(int, varchar, varchar)
- `` is the
When this record is copied, you will see a load error as follows:
err_reason | raw_line Missing newline: Unexpected character 0x61 found at location 3 | 1,aa
The reason why
1,aa is because the NUL character is regarded as the end-of-line. As a result, Redshift fails to load the data due to the missing 3rd column value. To avoid this, you have to replace NUL values before running the COPY command. For example, escaping NUL characters like
"\x00" is a durable workaround.
# Using ACCEPTINVCHARS
It is not rare to have invalid UTF-8 characters in your data, causing the load error when copying. In case of invalid characters, Redshift provides
ACCEPTINVCHARS option for the COPY command, which replaces invalid characters with a replacement character.
copy <your_table_name> from <your_data_source.csv> credentials 'aws_access_key_id=<your_key_id>;aws_secret_access_key=<your_secret_access_key>' delimiter ‘,’ acceptinvchars;
For example, if your data contains an invalid character,
\xc3is an invalid character
You will see the following log message when copying:
INFO: Load into table 'utf8test' completed, 1 record(s) loaded successfully. INFO: Load into table 'utf8test' completed, 1 record(s) were loaded with replacements made for ACCEPTINVCHARS. Check 'stl_replacements' system table for details.
And the record will be:
id | val_1 | val_2 ----+-------+------- 1 | aa?aa | bbbb
"?" is the default replacement character. You can change the replacement character by setting
ACCEPTINVCHARS as .
# How FlyData Can Help
FlyData provides continuous, near real-time replication between RDS, MySQL and PostgreSQL databases to Amazon Redshift. The FlyData Sync tool is an intuitive, powerful, cost-effective way to integrate your transactional databases and analytical data stores in a single interface with no manual scripting.
You can start a 14-day Free Trial and begin syncing your data within minutes. For questions about FlyData and how we can help accelerate your use-case and journey on Amazon Redshift, connect with us at firstname.lastname@example.org.