layout: post title: “SQL Errors Summary” category: SQL tags: SQL —
mysql Error
Error Code: 1046. No database selected Select the default DB to be used by double-clicking its name in the SCHEMAS list in the sidebar
You have to tell sql which database you are going to use.
show databases;
USE mysql;
It works in MySQL!
show databases;
USE mysql;
CREATE TABLE test (
id INT(11),
name VARCHAR(25)
);
INSERT INTO test VALUES(1, "23");
INSERT INTO test VALUES(1, "测试");
INSERT INTO test VALUES(1, "12测试");
-- start with Chinese
SELECT name FROM test WHERE ascii(name) > 127;
-- start with no Chinese
SELECT name FROM test WHERE ascii(name) < 127;
-- Chinese
select * from test where name like '%[吖-座]%';
-- number
-- ascii code
-- number: 48 - 57, letter: 65 - 123, chinese: 123+
-- delete all the numbers values
SELECT * FROM test WHERE ascii(name) between 48 and 57;
--
SELECT name,
CASE name REGEXP "[u0391-uFFE5]"
WHEN 1 THEN "不是中文字符"
ELSE "是中文字符"
END AS "判断是否是中文字符"
FROM test;
-- length:是计算字段的长度一个汉字是算三个字符,一个数字或字母算一个字符
-- char_length:不管汉字还是数字或者是字母都算是一个字符
-- 检查是否包含中文
select * FROM test
WHERE length(name) > char_length(name);
Python block for reading the big data and handling it!
import pandas as pd
import time
import numpy as np
start = time.clock()
reader = pd.read_csv('data.csv', iterator = True, low_memory = False)
loop = True
chunkSize = 5000000
chunks = []
while loop:
try:
chunk = reader.get_chunk(chunkSize)
chunks.append(chunk)
except StopIteration:
loop = False
print "Iteration is stopped."
df = pd.concat(chunks, ignore_index=True)
print time.clock() - start
# Python Chinese comments
# -*- coding: utf-8 -*-
# UnicodeEncodeError: 'ascii' codec can't encode characters in position
import sys
reload(sys)
sys.setdefaultencoding('utf8')
# python export csv Chinese Problems:
df.to_csv(path+'\\'+filename+'.csv', encoding='gb18030', index=False)
# dataframe to list
company_list = companys.values.tolist()
# string output Chinese
str(company_list).replace('[', '').replace(']', '').decode('string_escape')
# read csv with Chinese
df = pd.read_csv("data.csv", encoding = 'gbk')
# dataframe to dictionary
data = full_data.set_index('id')['values'].to_dict()
# set to list
l = list(set([0, 1, 2]))
layout: post title: “SQL Update Table Operations” category: SQL tags: SQL —
It compiles in Postgre SQL.
-- add a column
alter table cece_table add name varchar(100);
-- update column name
alter table cece_table rename column name to another_name
-- update the values in one columns
update cece_table set name = '个人' where length(number) < 4;
update cece_table set name = table2.name
from table2
where cece_table.id = table2.id;
-- delete some rows
delete from cece_table where id in (select id from table2)
too many features most of features are correlated poor accuracy with too many data
PCA is a method of extracting import variables from a larget set data in dataset. It extracts low dimensional set of features from a high dimensional data with a motive to capture as much information as possible.
A principal component is a normalized linear combination of the original predictors in a data set.
Let’s say we have a set of predictors as X¹, X²…,Xp
The principal component can be written as:
Z¹ = Φ¹¹X¹ + Φ²¹X² + Φ³¹X³ + .... +Φp¹Xp
where,
Z¹ is first principal component
Φp¹ is the loading vector comprising of loadings (Φ¹, Φ²..) of first principal component. The loadings are constrained to a sum of square equals to 1. This is because large magnitude of loadings may lead to large variance. It also defines the direction of the principal component (Z¹) along which data varies the most. It results in a line in p dimensional space which is closest to the n observations. Closeness is measured using average squared euclidean distance.
X¹..Xp are normalized predictors. Normalized predictors have mean equals to zero and standard deviation equals to one.
As noted above, the results of PCA depend on the scaling of the variables. A scale-invariant form of PCA has been developed.