Why did the spark stop responding after calling pipeline fit and distinct count?

  Kiến thức lập trình

I’m trying to use the spark nlp to detect English-language texts from a large dataset. The ipynb file is on Google Colab.

Everything works fine, from reading CSV to pipeline fit and result selection. It looks like this:

result.select(F.explode(F.arrays_zip(result.document.result,
                                     result.language.result)).alias("cols")) 
      .select(F.expr("cols['0']").alias("Document"),
              F.expr("cols['1']").alias("Language")).show(truncate=100)


+----------------------------------------------------------------------------------------------------+--------+
|                                                                                            Document|Language|
+----------------------------------------------------------------------------------------------------+--------+
|#Elecciones2020 | En #Florida: #JoeBiden dice que #DonaldTrump solo se preocupa por él mismo. El ...|      es|
|                                                                                                    |    null|
|                                                                                                    |    null|
|                                                                                                    |    null|
|                                                                                                 0.0| Unknown|
|                                                                                                    |    null|
|                                                                                                    |    null|
|                                                                                              1860.0| Unknown|
|#Trump: As a student I used to hear for years, for ten years, I heard China! In 2019! And we have...|      en|
|                                                                                                 1.0| Unknown|
|                     2 hours since last tweet from #Trump! Maybe he is VERY busy. Tremendously busy.|      en|
|@CLady62 Her 15 minutes were over long time ago. Omarosa never represented the black community! #...|      en|
|                                                                                                 0.0| Unknown|
|                             @richardmarx Glad u got out of the house! DICK!!#trump 2020????????????????????????|      en|
|@DeeviousDenise @realDonaldTrump @nypost There won’t be many of them.  Unless you all have been v...|      en|
|One of the single most effective remedies to eradicate another round of #Trump Plague in our #Whi...|      en|
|                                                                               #Election2020 #Trump |      en|
|                                                                                                 0.0| Unknown|
+----------------------------------------------------------------------------------------------------+--------+

but then I need to filter En texts. I tried select(F.expr("cols['1']").distinct().count(), filter(F.expr("cols['1']") == 'en') and even list comprehension after collect(), but none returned anything. After waiting an hour, I interrupted the process, and the spark seemed to stop responding, e.g., not even giving results to read.csv().

Why is this happening and how to solve this problem?

LEAVE A COMMENT