Amazon Customer Review
This dataset contains over 150M customer reviews of Amazon products. The data is in snappy-compressed Parquet files in AWS S3 that total 49GB in size (compressed). Let's walk through the steps to insert it into ClickHouse.
Note
The queries below were executed on a Production instance of ClickHouse Cloud.
- Without inserting the data into ClickHouse, we can query it in place. Let's grab some rows, so we can see what they look like:
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_2015.snappy.parquet')
LIMIT 10
The rows look like:
┌─review_date─┬─marketplace─┬─customer_id─┬─review_id──────┬─product_id─┬─product_parent─┬─product_title────────────────────────────────────────────────┬─product_category───────┬─star_rating─┬─helpful_votes─┬─total_votes─┬─vine──┬─verified_purchase─┬─review_headline─────────────────────────────────────────────────────────────┬─review_body────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ 16452 │ US │ 21080196 │ R17NMVYCQXEEFW │ B00RSI5DJA │ 904397429 │ Pilot │ Digital_Video_Download │ 5 │ 0 │ 0 │ false │ false │ yes indeed │ OMG- i totally see myself get hook on that show if it happen- love it │
│ 16452 │ US │ 44158214 │ R3MAPJVO9D0ERG │ B00RSI61PU │ 475013967 │ Salem Rogers: Model of the Year 1998 │ Digital_Video_Download │ 5 │ 0 │ 0 │ false │ false │ Halarious show!! │ Loved this pilot episode!! Please pick this up Amazon!!! │
│ 16452 │ US │ 1944630 │ R1Q5YPRE84OVB6 │ B009IU6BIS │ 101502671 │ National Lampoon's Christmas Vacation │ Digital_Video_Download │ 5 │ 0 │ 0 │ false │ false │ Classic! │ This is a holiday classic. How can you not love it! │
│ 16452 │ US │ 49029010 │ RGDK35TBJJ2ZI │ B00RSI68V2 │ 639602030 │ Table 58 │ Digital_Video_Download │ 5 │ 2 │ 3 │ false │ false │ Fun for the whole family!! │ This show is fun! Our family really enjoyed watching the show. I can see this being one of the shows that we watch on Friday nights with our pizza and ice cream. I hope to see more of the show and the great cast of characters. │
│ 16452 │ US │ 52257958 │ R1R2SEOJT8M14Y │ B00RSGIMUE │ 196368495 │ Niko and the Sword of Light │ Digital_Video_Download │ 5 │ 1 │ 2 │ false │ false │ it's a new kind of avatar. great show. make more. │ My 7 year old son and my husband both loved it! It's like avatar the last air bender but with different magical powers. The characters are adorably well developed. The story is interesting. We hope amazon makes the whole season. We can't wait to see more! │
│ 16452 │ US │ 26927978 │ RN0JCPQ6Z4FUB │ B009ITL7UG │ 497741324 │ Lord of the Rings: The Return of the King (Extended Edition) │ Digital_Video_Download │ 5 │ 0 │ 0 │ false │ true │ Definite must-own for any Tolkien buff who has not yet upgraded to Blu-Ray! │ If you liked the theatrical release and are a fan of Middle-Earth then you should get this. │
│ 16452 │ US │ 19626887 │ R15LDVOU1S1DFB │ B00RSGHGB0 │ 576999592 │ Just Add Magic - Season 1 │ Digital_Video_Download │ 5 │ 1 │ 1 │ false │ false │ Great story! So good even my teenage boys said ... │ Great story! So good even my teenage boys said this is actually pretty good!!! Can't wait for the next episode. │
│ 16452 │ US │ 1439383 │ R2DJVLZM1MVFQH │ B002WEQJ3E │ 733651019 │ Six: The Mark Unleashed │ Digital_Video_Download │ 1 │ 0 │ 4 │ false │ false │ I am now less intelligent for having watched an entire 10 minutes of it │ I am now less intelligent for having watched an entire 10 minutes of it. God save my sole as I now must kick out the chair from which I am standing on so that the noose may do its job. Watch the movie at your own risk. The screen will suck your brain cells out of your body. │
│ 16452 │ US │ 46233181 │ R33W2NB9MCRUFV │ B00RSGFYQE │ 464741068 │ Point of Honor │ Digital_Video_Download │ 4 │ 0 │ 0 │ false │ false │ Give it a chance. │ Pilots are just what they are...pilots. A chance to see what works and what doesn't and a chance to smooth out the wrinkles. Point of Honor at least stands a fair chance. │
│ 16452 │ US │ 19537300 │ R2WGJYESHID0ZF │ B00RSGHQJM │ 374287214 │ Down Dog │ Digital_Video_Download │ 5 │ 1 │ 1 │ false │ false │ Five Stars │ great fun │
└─────────────┴─────────────┴─────────────┴────────────────┴────────────┴────────────────┴──────────────────────────────────────────────────────────────┴────────────────────────┴─────────────┴───────────────┴─────────────┴───────┴───────────────────┴─────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
- Let's define a new
MergeTreetable namedamazon_reviewsto store this data in ClickHouse:
CREATE TABLE amazon_reviews
(
review_date Date,
marketplace LowCardinality(String),
customer_id UInt64,
review_id String,
product_id String,
product_parent UInt64,
product_title String,
product_category LowCardinality(String),
star_rating UInt8,
helpful_votes UInt32,
total_votes UInt32,
vine Bool,
verified_purchase Bool,
review_headline String,
review_body String
)
ENGINE = MergeTree
ORDER BY (review_date, product_category);
- The following
INSERTcommand uses thes3Clustertable function, which allows the processing of multiple S3 files in parallel using all the nodes of your cluster. We also use a wildcard to insert any file that starts with the namehttps://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_*.snappy.parquet:
INSERT INTO amazon_reviews
SELECT
*
FROM s3Cluster(
'default',
'https://datasets-documentation.s3.eu-west-3.amazonaws.com/amazon_reviews/amazon_reviews_*.snappy.parquet'
);
Tip
In ClickHouse Cloud, the name of the cluster is default. Change default to the name of your cluster...or use the s3 table function (instead of s3Cluster) if you do not have a cluster.
- That query doesn't take long - averaging about 300,000 rows per second. within 5 minutes or so you should see all the rows inserted:
SELECT formatReadableQuantity(count())
FROM amazon_reviews;
┌─formatReadableQuantity(count())─┐
│ 150.96 million │
└─────────────────────────────────┘
1 row in set. Elapsed: 0.005 sec.
- Let's see how much space our data is using:
SELECT
disk_name,
formatReadableSize(sum(data_compressed_bytes) AS size) AS compressed,
formatReadableSize(sum(data_uncompressed_bytes) AS usize) AS uncompressed,
round(usize / size, 2) AS compr_rate,
sum(rows) AS rows,
count() AS part_count
FROM system.parts
WHERE (active = 1) AND (table = 'amazon_reviews')
GROUP BY disk_name
ORDER BY size DESC;
The original data was about 70G, but compressed in ClickHouse it takes up about 30G:
┌─disk_name─┬─compressed─┬─uncompressed─┬─compr_rate─┬──────rows─┬─part_count─┐
│ s3disk │ 30.05 GiB │ 70.47 GiB │ 2.35 │ 150957260 │ 14 │
└───────────┴────────────┴──────────────┴────────────┴───────────┴────────────┘
- Let's run some queries...here are the top 10 most-helpful reviews in the dataset:
SELECT
product_title,
review_headline
FROM amazon_reviews
ORDER BY helpful_votes DESC
LIMIT 10;
Notice the query has to process all 151M rows, but takes less than one second!
┌─product_title────────────────────────────────────────────────────────────────────────────┬─review_headline───────────────────────────────────────────────────────┐
│ Kindle: Amazon's Original Wireless Reading Device (1st generation) │ Why and how the Kindle changes everything │
│ BIC Cristal For Her Ball Pen, 1.0mm, Black, 16ct (MSLP16-Blk) │ FINALLY! │
│ The Mountain Kids 100% Cotton Three Wolf Moon T-Shirt │ Dual Function Design │
│ Kindle Keyboard 3G, Free 3G + Wi-Fi, 6" E Ink Display │ Kindle vs. Nook (updated) │